[Reading] Convolutional neural networks

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2017.
AlexNet
This is the classic paradigmshift paper of AlexNet. It is the first large CNN that beated the traditional computer vision approach and won the ImageNet competition in 2012.

Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, RealTime Object Detection,” arXiv:1506.02640 [cs], Jun. 2015.
YOLO, object detection
The YOLO algorithm is an endtoend network that receives images as input and output bounding box coordinates and class probabilities. Unlike previous sliding window and region proposalbased techniques, YOLO sees the entire image so it is less likely to predict false positives on the background.
YOLO divides the input image into an $S x S$ grid. A grid is responsible for detecting an object if its center point falls into the grid. Each grid cell predicts the number of bounding boxes and confidence scores for these boxes. The bounding boxes are normalized so that they fall between 0 and 1. The network architecture is similar to GoogLeNet, which consist of 24 convolutional layers followed by 2 fully connected layers. The loss function is composed of classification and localization errors with different weights. Show the unified architecture can process fast for realtime videos but struggles to fine localize small objects.

M. Haris, G. Shakhnarovich, and N. Ukita, “Deep BackProjection Networks For SuperResolution,” arXiv:1803.02735 [cs], Mar. 2018.
Super resolution
Image superresolution task is to recover a highresolution (HR) image from a lowresolution (LR) image. The current approach is to construct an HR image by learning nonlinear LRtoHR mapping, implemented as a deep neural network. Unlike the previous methods which predict the SR in a feedforward manner, the author proposed Deep BackProjection Networks that focus to directly increase the SR features using multiple up and down sampling stages, and feed the error predictions on each depth in the networks to revise the sampling results. Then, the model accumulates the selfcorrecting features from each upsampling stage to create SR image. The results show the effectiveness of the proposed network compares to other stateoftheart methods.

X. Wang, K. Yu, C. Dong, and C. C. Loy, “Recovering Realistic Texture in Image Superresolution by Deep Spatial Feature Transform,” arXiv:1804.02815 [cs], Apr. 2018.
Super resolution
Similar to the above work, this is another image superresolution works from CVPR 2018. It aims to recovery textures faithful to semantic classes. The categorical prior, which characterizes the semantic class of a region in an image (e.g.: sky, building, plant), is crucial for constraining the plausible solution space in SR. The authors use semantic segmentation maps as the categorical prior. To condition the network on semantic segmentations probability maps, the authors applied Conditional Normalization that applies a learned function of some conditions to replace parameters for featurewise affine transformation in Batch normalization. To do that, they designed a Spatial Feature transform (SFT) layer. The prior is modeled by a pair of affine transformation parameters $(\gamma, \beta)$. Then the transformation is carried out by applying the affine transformation to the feature maps of a specific layer. The SFT is modular and can be applied in between network layers. The general network architecture is a GAN. They applied perceptual loss and adversarial loss in the model. The experiments show that segmentation maps encapsulate rich categorical prior up to pixel level.

A. Dosovitskiy et al., “FlowNet: Learning Optical Flow with Convolutional Networks,” 2015, pp. 2758–2766.
FlowNet 1.0; optic flow
This work is on an endtoend learning approach to estimating optical flow with CNN: given a dataset consisting of image pairs and ground truth flows field, a network is trained to predict the $xy$ flow fields directly from the images. The optical flow estimation requires precise perpixel localization, and finding correspondences between two input images. This paper is FlowNet 1.0. It proposed and compared two architectures: FlowNetSimple (FlowNetS) is a generic CNN. The authors simply stack two sequentially adjacent input images together and feed them through the network. FlowNetCorr (FlowNetC) first produces representations of the two images separately, then combines them together in the “correlation layer”, and then learn the higher representation together. The correlation layer performs multiplicative patch comparisons between two feature maps. It uses a square patch to convolutes one feature map with another.
Since the image resolution is reduced after a series of convolution and pooling layers, at the final refinement part, the authors refine the coarse feature map by unpooling and upconvolution. Then the feature maps are concatenated with corresponding feature maps from the contractive part of the network, and an upsampled coarse flow prediction. The “down and up sample” with “skipconnection” design preserves both the highlevel information passed from coarser feature maps and fine local information provided in lower layer feature maps. To train the network, the authors developed a synthesis Flying Chairs dataset and the model achieves competitive accuracy at a frame rate of 5 to 10 fps.

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks,” arXiv:1612.01925 [cs], Dec. 2016.
FlowNet 2.0; optical flow
The FlowNet 1.0 resolves problems with small displacements and noisy artifacts in estimated flow fields. This paper built upon the previous FlowNet 1.0 and present three improvements: 1) the schedule of presenting data for training can help to decrease the error. 2) Since all stateoftheart optical flow approaches rely on iterative methods, the authors hypotheses that deep network may also benefit from iterative refinement. To do this, they experiment with stacking multiple FlowNetS and FlowNetC. Subsequent networks get the input two adjacent images and the previous flow estimate as input. They used curriculum learning to train the stacked networks 3) the authors introduced a subnetwork specializing on small motions. The experiments showed it performs on par with stateoftheart methods, and computed optical flow at up to 140fps with accuracy matching the FlowNet 1.0.
Notes: from the two above works, I get to know how deep learning approaches replaces the traditional CV techniques. Optical flow estimation in my understanding is usually done by identifying key points in the image and then track their motions. Instead of handengineering features by the human, the authors handdesign the deep network architectures which enable the model learns the features automatically. Here I saw a paradigm shift from handdesign algorithm to extract features directly, to handdesign architectures to learn automatically.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” arXiv:1412.0767 [cs], Dec. 2014.
3D CNN
This work develops a deep 3dimensional convolutional networks (3D ConvNets) to understand and analyze video data. The deep learning breakthroughs in the image domain focus on spatial modelings for 2D images with CNN. However, such models are not directly suitable for videos due to lack of temporal motion modeling. This work exploits 3D CNN in the context of largescale supervised training datasets and modern deep architectures to learn spatiotemporal features such as objects, scenes and action for videos. The authors experiment with different size of 3D kernels and empirically find that 3 x 3 x 3 convolutional kernel for all layers works best among the explored architectures. The experiments results show that the 3D CNN can act as a good feature extractor and outperform or comparable with stateoftheart methods on several benchmarks.

X. Huang, M.Y. Liu, S. Belongie, and J. Kautz, “Multimodal Unsupervised ImagetoImage Translation,” arXiv:1804.04732 [cs, stat], Apr. 2018.
GAN, imagetoimage translation, style transfer
This paper provides a novel and impressive framework for unsupervised imagetoimage translation. The task is given an image in the source domain, the model learns the conditional distribution of corresponding images in the target domain, without seeing any examples of corresponding image pairs. To do so, the authors assumes that the image representation can be decomposed into a content code that is domaininvariant and shared within domains, and a style code that captures domainspecific properties. The image translation is done by recombining its content encoder with a domain style encoder sampled from the style space of the target domain. The loss function is consists of bidirectional reconstruction losses, which enforces the image and latent encoder reconstruction; and a adversarial loss.

W. Liu et al., “Decoupled Networks,” arXiv:1804.08071 [cs, stat], Apr. 2018.
decouple convolution
This paper hacks the convolution operation in CNN. The convolution is a linear operation of the dot product of $x$ and $w$. Since $w \cdot x = \vert w \vert * \vert x \vert * cos \theta$, where $ \vert w \vert * \vert x \vert $ reflects the withinclass variation, and $cos \theta$ reflects the betweenclass variation. Based on the above observation, the authors present the decoupled convolution operation, which replaces $w \cdot x$ with the multiplication of the magnitude function and the angular function. The aim is to better model the betweenclass variation so as to make the different class more separable in highdimensional space. The current CNN doesn’t address this since the inner product mixes the within and between class variation.
The previous work proposed a deep hyperspherical learning network which directly makes the magnitude function equal to 1 such that all the activation outputs only depend on angular function. It is suboptimal in some cases since the magnitude function is restricted to 1. To better model the withinclass variation,
The authors devise two kinds of decoupled convolution operations: bounded (hyperspherical, hyperball, and hyperbolic tangent) and unbounded (linear, segmented, logarithm and mixed) convolution. The parameters in the nonlinear operation are learnable through backpropagation. The experiments showed the model gained equivalent performance and robust to adversarial attack. The nonlinearity also may save the ReLU function. But due to nonlinearity, it is more computational expensive, and may need better weight initialization with pretrained weights.
Leave a Comment