In a Fully Convolutional Network (FCN), different layers require specific initialization strategies to optimally support spatial predictions. The transposed convolutional layer, which is responsible for upsampling the feature maps back to the original input image size, is initialized using the upsampling of bilinear interpolation (e.g., via a bilinear kernel). Conversely, the $$1 \times 1$$ convolutional layer, which transforms the number of channels to correspond to the target classes, is initialized using Xavier initialization.

```python
# PyTorch
W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W)
```

```python
# MXNet
W = bilinear_kernel(num_classes, num_classes, 64)
net[-1].initialize(init.Constant(W))
net[-2].initialize(init=init.Xavier())
```

Fully Convolutional Network Initialization

To restore the spatial dimensions of feature maps to the original input image size in a Fully Convolutional Network (FCN), a transposed convolutional layer is employed. If the spatial dimensions need to be increased by a factor of $$ s $$, the transposed convolution is configured with a stride of $$ s $$. To achieve the exact original dimensions, the padding is set to $$ s/2 $$ (assuming $$ s/2 $$ is an integer), and the height and width of the convolution kernel are set to $$ 2s $$. For instance, to upscale a feature map by 32 times, the stride is 32, the padding is 16, and the kernel size is 64.

```python
# PyTorch
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes, kernel_size=64, padding=16, stride=32))
```

```python
# MXNet
net.add(nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16, strides=32))
```

Claude

Google

A Fully Convolutional Network (FCN) is designed for dense prediction tasks, such as semantic segmentation. The model architecture begins with a standard Convolutional Neural Network (CNN) to extract image features. Next, it employs a $$1 	imes 1$$ convolutional layer to transform the number of channels to match the number of target classes. Finally, it uses a transposed convolutional layer to scale the height and width of the feature maps back to the dimensions of the original input image. The resulting output has the same spatial dimensions as the input, with each output channel representing the predicted classes for the pixel at the corresponding spatial position.

Fully Convolutional Network (FCN) Architecture

A primary application of transposed convolutional layers is to perfectly reverse the spatial dimensionality transformations introduced by standard convolutional layers. If an input tensor $$\mathsf{X}$$ is passed through a convolutional layer $$f$$ to produce an output $$\mathsf{Y} = f(\mathsf{X})$$, a transposed convolutional layer $$g$$ can be constructed to invert this shape change. By configuring $$g$$ with the exact same spatial hyperparameters (kernel size, padding, and stride) as $$f$$, and setting its output channel count to match the channels of $$\mathsf{X}$$, the resulting tensor $$g(\mathsf{Y})$$ will possess the identical shape as the original input $$\mathsf{X}$$. This property can be verified programmatically in frameworks like PyTorch and MXNet by ensuring `tconv(conv(X)).shape == X.shape` when both layers share identical parameters.

```python
# PyTorch
X = torch.rand(size=(1, 10, 16, 16))
conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3)
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3)
tconv(conv(X)).shape == X.shape
```

```python
# MXNet
X = np.random.uniform(size=(1, 10, 16, 16))
conv = nn.Conv2D(20, kernel_size=5, padding=2, strides=3)
tconv = nn.Conv2DTranspose(10, kernel_size=5, padding=2, strides=3)
conv.initialize()
tconv.initialize()
tconv(conv(X)).shape == X.shape
```

Transposed Convolution Shape Reversal

Dive into Deep Learning

When utilizing a pretrained convolutional neural network, such as ResNet-18, for feature extraction in a Fully Convolutional Network (FCN), the final classification layers must be removed. Specifically, the global average pooling layer and the fully connected layer are discarded because they collapse spatial dimensions and are unnecessary for dense pixel-level predictions. The remaining layers form the feature extraction backbone of the FCN, which produces feature maps with reduced spatial dimensions. For example, given an input with a height of $$320$$ and width of $$480$$, the forward propagation reduces the spatial dimensions by a factor of $$1/32$$, resulting in an output shape of $$10 \times 15$$. This extraction process can be implemented in frameworks like PyTorch or MXNet by explicitly selecting all layers except the final pooling and dense layers.

```python
# PyTorch
net = nn.Sequential(*list(pretrained_net.children())[:-2])
```

```python
# MXNet
net = nn.HybridSequential()
for layer in pretrained_net.features[:-2]:
    net.add(layer)
```

Feature Extraction in Fully Convolutional Networks

In a Fully Convolutional Network (FCN), after image features have been extracted by the backbone network, a $$1 \times 1$$ convolutional layer is applied. The purpose of this layer is to transform the number of output channels from the feature extractor to match the exact number of target classes (for example, $$21$$ classes for the Pascal VOC2012 dataset) without altering the spatial dimensions of the feature maps.

```python
# PyTorch
num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
```

```python
# MXNet
num_classes = 21
net.add(nn.Conv2D(num_classes, kernel_size=1))
```

Channel Transformation in Fully Convolutional Networks

When evaluating the performance of a fully convolutional network in a semantic segmentation task, accuracy is determined at the pixel level rather than the image level. The accuracy calculation assesses the proportion of individual pixels across the entire input for which the model correctly predicted the semantic class. This means the overall metric reflects the correctness of the predicted classes for all pixels simultaneously, treating each spatial position as an independent classification instance.

Accuracy Calculation in Fully Convolutional Networks

In training a fully convolutional network for semantic segmentation, the loss function is fundamentally similar to that used in standard image classification tasks. However, because the network predicts a distinct class for every individual pixel, these predictions are represented by the output channels of the final transposed convolutional layer. Consequently, when calculating the loss—typically a cross-entropy loss—the computation must explicitly specify the channel dimension to correctly evaluate the predicted probabilities against the ground-truth classes for each spatial position.

```python
# PyTorch
def loss(inputs, targets):
    return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)
```

```python
# MXNet
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
```

Learn Before

Related

Learn After