A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2017.
This is the classic paradigm-shift paper of AlexNet. It is the first large CNN that beated the traditional computer vision approach and won the ImageNet competition in 2012.
Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” arXiv:1506.02640 [cs], Jun. 2015.
YOLO, object detection
The YOLO algorithm is an end-to-end network that receives images as input and output bounding box coordinates and class probabilities. Unlike previous sliding window and region proposal-based techniques, YOLO sees the entire image so it is less likely to predict false positives on the background.
YOLO divides the input image into an $S x S$ grid. A grid is responsible for detecting an object if its center point falls into the grid. Each grid cell predicts the number of bounding boxes and confidence scores for these boxes. The bounding boxes are normalized so that they fall between 0 and 1. The network architecture is similar to GoogLeNet, which consist of 24 convolutional layers followed by 2 fully connected layers. The loss function is composed of classification and localization errors with different weights. Show the unified architecture can process fast for real-time videos but struggles to fine localize small objects.
M. Haris, G. Shakhnarovich, and N. Ukita, “Deep Back-Projection Networks For Super-Resolution,” arXiv:1803.02735 [cs], Mar. 2018.
Image super-resolution task is to recover a high-resolution (HR) image from a low-resolution (LR) image. The current approach is to construct an HR image by learning non-linear LR-to-HR mapping, implemented as a deep neural network. Unlike the previous methods which predict the SR in a feed-forward manner, the author proposed Deep Back-Projection Networks that focus to directly increase the SR features using multiple up- and down- sampling stages, and feed the error predictions on each depth in the networks to revise the sampling results. Then, the model accumulates the self-correcting features from each upsampling stage to create SR image. The results show the effectiveness of the proposed network compares to other state-of-the-art methods.
X. Wang, K. Yu, C. Dong, and C. C. Loy, “Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform,” arXiv:1804.02815 [cs], Apr. 2018.
Similar to the above work, this is another image super-resolution works from CVPR 2018. It aims to recovery textures faithful to semantic classes. The categorical prior, which characterizes the semantic class of a region in an image (e.g.: sky, building, plant), is crucial for constraining the plausible solution space in SR. The authors use semantic segmentation maps as the categorical prior. To condition the network on semantic segmentations probability maps, the authors applied Conditional Normalization that applies a learned function of some conditions to replace parameters for feature-wise affine transformation in Batch normalization. To do that, they designed a Spatial Feature transform (SFT) layer. The prior is modeled by a pair of affine transformation parameters $(\gamma, \beta)$. Then the transformation is carried out by applying the affine transformation to the feature maps of a specific layer. The SFT is modular and can be applied in between network layers. The general network architecture is a GAN. They applied perceptual loss and adversarial loss in the model. The experiments show that segmentation maps encapsulate rich categorical prior up to pixel level.
A. Dosovitskiy et al., “FlowNet: Learning Optical Flow with Convolutional Networks,” 2015, pp. 2758–2766.
FlowNet 1.0; optic flow
This work is on an end-to-end learning approach to estimating optical flow with CNN: given a dataset consisting of image pairs and ground truth flows field, a network is trained to predict the $x-y$ flow fields directly from the images. The optical flow estimation requires precise per-pixel localization, and finding correspondences between two input images. This paper is FlowNet 1.0. It proposed and compared two architectures: FlowNetSimple (FlowNetS) is a generic CNN. The authors simply stack two sequentially adjacent input images together and feed them through the network. FlowNetCorr (FlowNetC) first produces representations of the two images separately, then combines them together in the “correlation layer”, and then learn the higher representation together. The correlation layer performs multiplicative patch comparisons between two feature maps. It uses a square patch to convolutes one feature map with another.
Since the image resolution is reduced after a series of convolution and pooling layers, at the final refinement part, the authors refine the coarse feature map by unpooling and upconvolution. Then the feature maps are concatenated with corresponding feature maps from the contractive part of the network, and an upsampled coarse flow prediction. The “down- and up- sample” with “skip-connection” design preserves both the high-level information passed from coarser feature maps and fine local information provided in lower layer feature maps. To train the network, the authors developed a synthesis Flying Chairs dataset and the model achieves competitive accuracy at a frame rate of 5 to 10 fps.
E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks,” arXiv:1612.01925 [cs], Dec. 2016.
FlowNet 2.0; optical flow
The FlowNet 1.0 resolves problems with small displacements and noisy artifacts in estimated flow fields. This paper built upon the previous FlowNet 1.0 and present three improvements: 1) the schedule of presenting data for training can help to decrease the error. 2) Since all state-of-the-art optical flow approaches rely on iterative methods, the authors hypotheses that deep network may also benefit from iterative refinement. To do this, they experiment with stacking multiple FlowNetS and FlowNetC. Subsequent networks get the input two adjacent images and the previous flow estimate as input. They used curriculum learning to train the stacked networks 3) the authors introduced a sub-network specializing on small motions. The experiments showed it performs on par with state-of-the-art methods, and computed optical flow at up to 140fps with accuracy matching the FlowNet 1.0.
Notes: from the two above works, I get to know how deep learning approaches replaces the traditional CV techniques. Optical flow estimation in my understanding is usually done by identifying key points in the image and then track their motions. Instead of hand-engineering features by the human, the authors hand-design the deep network architectures which enable the model learns the features automatically. Here I saw a paradigm shift from hand-design algorithm to extract features directly, to hand-design architectures to learn automatically.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” arXiv:1412.0767 [cs], Dec. 2014.
This work develops a deep 3-dimensional convolutional networks (3D ConvNets) to understand and analyze video data. The deep learning breakthroughs in the image domain focus on spatial modelings for 2D images with CNN. However, such models are not directly suitable for videos due to lack of temporal motion modeling. This work exploits 3D CNN in the context of large-scale supervised training datasets and modern deep architectures to learn spatio-temporal features such as objects, scenes and action for videos. The authors experiment with different size of 3D kernels and empirically find that 3 x 3 x 3 convolutional kernel for all layers works best among the explored architectures. The experiments results show that the 3D CNN can act as a good feature extractor and outperform or comparable with state-of-the-art methods on several benchmarks.
X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal Unsupervised Image-to-Image Translation,” arXiv:1804.04732 [cs, stat], Apr. 2018.
GAN, image-to-image translation, style transfer
This paper provides a novel and impressive framework for unsupervised image-to-image translation. The task is given an image in the source domain, the model learns the conditional distribution of corresponding images in the target domain, without seeing any examples of corresponding image pairs. To do so, the authors assumes that the image representation can be decomposed into a content code that is domain-invariant and shared within domains, and a style code that captures domain-specific properties. The image translation is done by recombining its content encoder with a domain style encoder sampled from the style space of the target domain. The loss function is consists of bidirectional reconstruction losses, which enforces the image and latent encoder reconstruction; and a adversarial loss.
W. Liu et al., “Decoupled Networks,” arXiv:1804.08071 [cs, stat], Apr. 2018.
This paper hacks the convolution operation in CNN. The convolution is a linear operation of the dot product of $x$ and $w$. Since $w \cdot x = \vert w \vert * \vert x \vert * cos \theta$, where $ \vert w \vert * \vert x \vert $ reflects the within-class variation, and $cos \theta$ reflects the between-class variation. Based on the above observation, the authors present the decoupled convolution operation, which replaces $w \cdot x$ with the multiplication of the magnitude function and the angular function. The aim is to better model the between-class variation so as to make the different class more separable in high-dimensional space. The current CNN doesn’t address this since the inner product mixes the within and between class variation.
The previous work proposed a deep hyperspherical learning network which directly makes the magnitude function equal to 1 such that all the activation outputs only depend on angular function. It is sub-optimal in some cases since the magnitude function is restricted to 1. To better model the within-class variation,
The authors devise two kinds of decoupled convolution operations: bounded (hyperspherical, hyperball, and hyperbolic tangent) and unbounded (linear, segmented, logarithm and mixed) convolution. The parameters in the non-linear operation are learnable through backpropagation. The experiments showed the model gained equivalent performance and robust to adversarial attack. The non-linearity also may save the ReLU function. But due to non-linearity, it is more computational expensive, and may need better weight initialization with pre-trained weights.