3.5 DenseNet
Densely Connected Convolutional Networks
Created Date: 2025-05-18
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers.
DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.
3.5.1 Introduction
Convolutional neural networks (CNNs) have become the dominant machine learning approach for visual object recognition. Although they were originally introduced over 20 years ago, improvements in computer hardware and network structure have enabled the training of truly deep CNNs only recently.
The original LeNet5 consisted of 5 layers, VGG featured 19, and only last year Highway Networks and Residual Networks (ResNets) have surpassed the 100-layer barrier.
As CNNs become increasingly deep, a new research problem emerges: as information about the input or gradient passes through many layers, it can vanish and "wash out" by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets and Highway Networks bypass signal from one layer to the next via identity connections.
Stochastic depth shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets repeatedly combine several parallel layer sequences with different number of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network.
Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers.
3.5.2 Related Work
3.5.3 DenseNets
Consider a single image \(x_0\) that is passed through a convolutional network. The network comprises \(L\) layers, each of which implements a non-linear transformation \(H_l(\cdot)\), where \(l\) indexes the layer. \(H_l(\cdot)\) can be a composite function of operations such as Batch Normalization (BN), rectified linear units (ReLU), Pooling, or or Convolution (Conv). We denote the output of the \(l^{th}\) layer as \(x_l\).
ResNets
Traditional convolutional feed-forward networks connect the output of the \(\mathcal{l}^{th}\) layer as input to the \({(\mathcal{l} + 1)}^{th}\) layer, which gives rise to the following layer transition: \(x_{\mathcal{l}} = H_{\mathcal{l}}(x_{\mathcal{l} - 1})\). ResNets add a skip-connection that bypasses the non-linear transformations with an identity function:
An advantage of ResNets is that the gradient can flow directly through the identity function from later layers to the earlier layers. However, the identity function and the output of \(H_{\mathcal{l}}\) are combined by summation, which may impede the information flow in the network.
Dense connectivity
To further improve the information flow between layers we propose a different connectivity pattern: we introduce direct connections from any layer to all subsequent layers. Figure 1 illustrates the layout of the resulting DenseNet schematically. Consequently, the \(\mathcal{l}^{th}\) layer receives the feature-maps of all preceding layers, \(x_0, \cdot , x_{\mathcal{l} - 1}, as input:
where \([x_0, x_1, \cdots , x_{\mathcal{l} - 1}]\) refers to the concatenation of the feature-maps produced in layers \(0, \cdots , \mathcal{l} - 1\). Because of its dense connectivity we refer to this network architecture as Dense Convolutional Network (DenseNet). For ease of implementation, we concatenate the multiple inputs of \(H_{\mathcal{l}}(\cdot)\) in eq. (2) into a single tensor.
Composite function
Motivated by [12], we define \(H_{\mathcal{l}}(\cdot)\) as a composite function of three consecutive operations: batch normalization (BN) [14], followed by a rectified linear unit (ReLU) [6] and a 3 × 3 convolution (Conv).
Pooling layers
The concatenation operation used in Eq. (2) is not viable when the size of feature-maps changes. However, an essential part of convolutional networks is down-sampling layers that change the size of feature-maps. To facilitate down-sampling in our architecture we divide the network into multiple densely connected dense blocks; see Figure 2. We refer to layers between blocks as transition layers, which do convolution and pooling. The transition layers used in our experiments consist of a batch normalization layer and an \(1 \times 1\) convolutional layer followed by a \(2 \times 2\) average pooling layer.
Growth rate
Bottleneck layers
Compression
3.5.4 Experiments
3.5.5 Discussion
3.5.6 Conclusion
We proposed a new convolutional network architecture, which we refer to as Dense Convolutional Network (DenseNet). It introduces direct connections between any two layers with the same feature-map size. We showed that DenseNets scale naturally to hundreds of layers, while exhibiting no optimization difficulties.
In our experiments, DenseNets tend to yield consistent improvement in accuracy with growing number of parameters, without any signs of performance degradation or overfitting. Under multiple settings, it achieved state-of-the-art results across several highly competitive datasets.
Moreover, DenseNets require substantially fewer parameters and less computation to achieve state-of-the-art performances. Because we adopted hyperparameter settings optimized for residual networks in our study, we believe that further gains in accuracy of DenseNets may be obtained by more detailed tuning of hyperparameters and learning rate schedules.
Whilst following a simple connectivity rule, DenseNets naturally integrate the properties of identity mappings, deep supervision, and diversified depth. They allow feature reuse throughout the networks and can consequently learn more compact and, according to our experiments, more accurate models.
Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features, e.g., [4, 5]. We plan to study such feature transfer with DenseNets in future work.