[Reading] CNN architectures in medical imaging analysis

Ronneberger, O., Fischer, P., & Brox, T. (2015). UNet: Convolutional Networks for Biomedical Image Segmentation. ArXiv:1505.04597 [Cs]. Retrieved from http://arxiv.org/abs/1505.04597
UNet; Segmentation
The task of medical imaging localization is to identify lesions’ location, i.e.: classify each pixel to be the lesion or not. The previous slidingwindow approach is usually computational slow and redundant, because divides the whole image into small patches and runs each patch separately through the CNN. It also through away the context information. Unet built upon the fully convolutional network (fCNN). With the fCNN, it outputs a segmentation map. The model architecture consists of a contracting path (downsampling) to capture context, and a symmetric expanding path (upsampling) for precise localization. Therefore it yields a ushaped architecture. The upsampling part also concatenates the corresponding part from the downsampling with the skipconnection. The results showed it preforms significantly better than the slidingwindow convolutional network on different biomedical segmentation tasks.

F. Milletari, N. Navab, and S.A. Ahmadi, “VNet: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” arXiv:1606.04797 [cs], Jun. 2016.
VNet, 3D volumetric segmentation
This work aims to build an endtoend fully convolutional network for 3D volumetric medical image segmentation. The previous patchwise classification approach only considered the local context and discard the global information. Different from previous approaches that deals with 3D input volumes slidewise, the authors propose to use volumetric convolutions instead.
The overall architecture is similar to UNet, which consists of a compression and decompression path. The 2D convolutions are replaced by 3D convolutions using volumetric kernels. The authors did not use any pooling layers, instead, they use nonoverlapping 2 x 2 x 2 convolutional volume patches so that the resulting feature maps is halved. The rationale is to save memory footprint during training. The author also proposes a novel objective function based on Dice coefficient maximization. The author demonstrates fast and accurate results on prostate MRI test volumes.

Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D UNet: Learning Dense Volumetric Segmentation from Sparse Annotation,” arXiv:1606.06650 [cs], Jun. 2016.
3D UNet
This work proposes a deep network that learns to generate dense volumetric segmentations, but only requires sparse annotated 2D slides for training. This is based on the observation that in many biomedical applications, only very few images are required to train a network that generalizes reasonably well. This is because each image already comprises repetitive structures with the corresponding variation. The network is based on the UNet architecture with an analysis and synthesis path, and expand it to 3D with 3D convolutions, 3D max pooling and 3D upconvolutional layers. It avoids bottlenecks in the network architecture and uses batch normalization for faster convergence. To enable train on sparse annotations, is uses weighted softmax loss functions by setting the weights of unlabeled pixels to zero, and reducing weights for the background and incresase weights for the inner tubule. This makes it to learn from only the labeled ones and hence to generalize to the whole volume. The authors augment the data with rotation, scaling, gray value augmentation, and smooth dense deformation field. The model achieve an aveerage IoU of 0.863 for semiautomated setup, and equivalent preformance with a 2D implementation in the fullyautomated setup.

Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C. S., Liang, H., Baxter, S. L., … Zhang, K. (2018). Identifying Medical Diagnoses and Treatable Diseases by ImageBased Deep Learning. Cell, 172(5), 11221131.e9. https://doi.org/10.1016/j.cell.2018.02.010
classfication
This work applied transfer learning to classify macular degeneration and diabetic retinopathy using retinal optical coherence tomography (OCT) images. They used an Inception V3 architecture pretrained on ImageNet. The pretrained layer was frozen and connected with fullyconnected layers. The author found finetuning the pretrained layers using backpropagation tended to decrease model performance due to overfitting. For the multiclass classification (4 classes, normal, two urgent referrals, and one routine referral), the model achieved an accuracy of 96.6%, with a sensitivity of 97.8%, a specificity of 97.4%, and a weighted error of 6.6%. This work is mainly conducted by clinicians, the main contribution is to collect a multicenter large dataset with disease Classification annotations.

M. Havaei, N. Guizard, N. Chapados, and Y. Bengio, “HeMIS: HeteroModal Image Segmentation,” arXiv:1607.05194 [cs], Jul. 2016.
missing imaging modalities; segmentation
One challenge of the medical imaging analysis is the missing modalities. For example, the MRI imaging has different modalities such as T1, T2, T2 FLAIR, DWI. A typical task usually requires several modalities to be available. In clinical settings, however, the missing modalities are very common, since doctors will prescribe MRI according to the specific diseases. Two kinds of approaches exist to deal with the missing modalities: 1) Data synthesize; 2) a robust model to inference without the missing modalities. The paper focuses on the 2nd approach.
It proposes an endtoend deep learning framework (HeMIS, HeteroModal Image Segmentation architecture) that can segment medical images from incomplete multimodal datasets. The framework consists of three parts:

The back end: Each imaging modality goes through a separate convolutional pipeline independently to generate a feature map. This maps each modality into an embedding common to all modalities, within which vector algebra operations carry welldefined semantics (in part 2).

The abstraction layers: For the available features maps extracted from part 1, the overall mean and variance is calculated as modality fusion.

The front end: In combined the merged modalities in part 2 to go through another conv and softmax layer to produce the final segmentation output.
The objective function is a pixelwise class crossentropy loss. To train the network, the author applied pseudocurriculum training, where the model starts learning from easy scenarios before turning to more difficult ones. Here the authors first trained the fCNN with all modalities, then randomly dropping one or several modalities to make the model more robust.


A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris, “Multimodal MR Synthesis via ModalityInvariant Latent Representation,” IEEE Transactions on Medical Imaging, vol. 37, no. 3, pp. 803–814, Mar. 2018.
missing imaging modalities; MRI synthesis
As stated in the above paper, there are two main approaches to deal with the incomplete imaging modalities. This paper focuses on data synthesis to impute missing images. The authors proposed a deep fCNN with multiinput and multioutput for MR synthesis. This overcomes the previous image synthesis that only learns mappings between pairs of image modalities.
The fCNN network takes aligned images with different modalities as input, and allows users to an arbitrary combination of modalities as output images. Its structure is a little similar to the previous paper. It composes of three stages: encoding, representation fusion, and decoding. In the encoder part, all inputs are projected into a shared latent representation space via a small UNet architecture. In the fusion part, the pixelwise max function fuses the latent representations into a single representation. And in the decoder part, a shallower fCNN maps the fused representation to the required output modality.
The challenges for MRI synthesis is to build a model that can take as any subset of the input modalities to produce its output. Simply embedding inputs into the same representation space does not ensure that they share a meaningful latent representation, i.e. the meaning of the latent representation is dependent on its original modality. The magic happened at the cost function. The author produces a latent representation that is independent of the originating modality.
The cost function is the mathematical expression of the following three goals:
1) Each modality’s individual latent representation should produce all output as accurately as possible.
It can be seen as the sum of each input modality’s average reconstruction error across all outputs.
2) The latent representations from all input modalities should be close in the Euclidean sense.
3) The fused latent representation should produce all outputs as accurately as possible.
The model is evaluated on the ISLES and BRATS datasets and demonstrate statistically significant improvements over stateoftheart methods for single input tasks.

A. Esteva et al., “Dermatologistlevel classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 02 2017.
dermatology, transfer learning
This is one of the early application and influencial work in medical image analysis with deep learning. The model is built upon Inception v3 CNN architecture with transfer learning to readopt the model to skin lesion classification. It used 0.1 M clinical images with over two thousand diseases. The authors compared the model’s performance with 21 dermatologists and showed it achieves performance on par with all tested doctors. However, the ROC curve problematically compares the discrete judgment of doctors for each image, with the continuous probability outputted by the CNN model. To verify the assersion that “AI surpasses doctors’ performances”, the work will need to provide more solid and equal comparision with statistical significance.

J. Kawahara and G. Hamarneh, “Fully Convolutional Neural Networks to Detect Clinical Dermoscopic Features,” p. 8, 2018.
dermatology, superpixel classification, segmentation
This paper aims to classify clinical dermoscopic features within superpixels. It reformulates the superpixel classification task as a semantic segmentation problem. The CNN architecture built upon VGG16 network. The authors resize selected responses/feature maps throughout the network to match the size of the input image using bilinear interpolation. These selected resized feature maps are concatenated, allowing to directly consider feature maps from several network layers. This design is motivated by the authors’ observation that the appearance of dermoscopic features are subtle, and may be represented in shallower layers with high spatial resolutions. The authors then use 1 x 1 convolution to reduce the number of parameters. To deal with pixel, class and sampleimbalance, the author used a negative multilabel SorensenDiceF1 loss function. The work achieved 1st place in the 2017 ISICISBI part 2: Dermoscopic Feature Classification challenge.

J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “7Point Checklist and Skin Lesion Classification using MultiTask MultiModal Neural Nets,” IEEE Journal of Biomedical and Health Informatics, pp. 1–1, 2018.
dermatology, multitask, multimodal
To distinguish between benign and malignant skin tumors, dermatologists use rulebased diagnostic checklists to facilitate diagnosis. This paper aims to classify the 7point melanoma checklist criteria and perform skin lesion diagnosis with a deep convolutional neural network. To enable the network utilizes the multimodal data(clinical and dermoscopic images, and patient metadata), the authors designed a multimodal multitask loss function that considers different combinations of the input modalities. This makes the model robust to missing data at inference time. The experiment results are benchmarked along with the dataset made available.

Z. Li et al., “Thoracic Disease Identification and Localization with Limited Supervision,” arXiv:1711.06373 [cs, stat], Nov. 2017.
Classification and localization
Manually labeling of medical images, especially the ones with localization, is costly to acquire. This work aims to classify and localize the disease within one model. It adopts the idea of handing an image as a group of grid cells, and treat each patch as a classification target (based on ideas of YOLO, SSD, MIL). For the model architecture, the inputs consist of two kinds of chest Xray images: image with and without bounding boxes. The bounding box information is encoded with the patch disease classification, where if a patch is covered by the bounding box, its $k$th disease classification is 1. The input image first goes through the ResNet to encode into a set of abstracted feature maps. Then the feature map is up or downsampled to a $P x P$ patch grid. Then it goes through two convolution layers to generate the output tensor $P x P x K* for $K$ disease label prediction of the $P x P$ patches.
The model designed two loss functions for input images with or without groundtruth bounding boxes. It defines the imagelevel probabilities given the image disease label with or without the boundingbox information. The probability is the multiplication of each individual grid’s probability for a disease label. The combined loss function is the crossentropy loss of the weighted sum the imagelevel probabilities in two cases. The proposed model achieves significant accuracy improvement over the stateoftheart approaches on classification and localization.

T. Würfl et al., “Deep Learning Computed Tomography: Learning ProjectionDomain Weights From Image Domain in Limited Angle Problems,” IEEE Transactions on Medical Imaging, vol. 37, no. 6, pp. 1454–1463, Jun. 2018.
Reconstruction, CT
This is an interesting work that combines the traditional image processing pipeline with learningbased approach. It maps an analytic reconstruction pipeline to a neural network as a onetoone correspondence.

Y. Sun, Z. Xia, and U. S. Kamilov, “Efficient and accurate inversion of multiple scattering with deep learning,” Optics Express, vol. 26, no. 11, p. 14678, May 2018.
Reconstruction
The reconstruction problem concerns of inversing the signals received by sensors to the unknown object of interest. For medical imaging, the reconstruction is often formulated as a linear inverse problem, but leads to reconstruction artifacts. When formulated as a nonlinear problem, it is common to solve it as an optimization problem. The objective functions typical includes a datafidelity term and a regularizer. However, when the light scattering is strong, the problem becomes highly nonconvex, and the corresponding optimization problem is difficult to solve.
This paper aims to solve the problem in a purely datadriven fashion to meet the timely and accuracy reconstruction requirements. The received signals first go through a backpropagation that transforms the input from the measurement domain to the image domain. The image domain consist of two feature maps represents the information from the real and imaginary parts of the backpropagated vector. Then the feature maps go through a UNet for multiresolution decomposition and localglobal composition. Note that beside the wide use in segmentation, the UNet are also applied to reconstruction problems. The proposed model is trained on simulated data, also succeeded in reconstructing images from real experimental data consisting of highly scattering objects. It also significantly shorten the processing time during inference, compared to previous optimization and regularization approaches.

A. Bentaieb and G. Hamarneh, “Adversarial Stain Transfer for Histopathology Image Analysis,” IEEE Transactions on Medical Imaging, vol. 37, no. 3, pp. 792–802, Mar. 2018.
histopathology, color normalization, GAN, style transfer
Color information is central to the visual and automatic analysis of histopathology tissue slides. Previous approaches to deal with the color variation among different histopathological images include: 1) bypass the problem by using only grayscale images. 2) rely on colorbased data augmentation to synthesize new images in an attempt to enforce robustness to color inconsistency. 3) preprocess using stain normalization techniques. This paper is a proofofconcept of the utility of adversarial training in stain color normalization.
The generative model is an UNet that generates fake images with the same stain of the target image and retains the structure. The discriminative model is an AlexNet that has two tasks: 1) to discriminate generated image from real images, and 2) to perform a specific image interpretation task. The objective function consists of an adversarial loss, a regularization loss to preserve structures of generated images, and a taskspecific loss that allows the model to simultaneously analyze images and learn taskspecific color transformations. The model is evaluated on three dataset and two tasks: tissue segmentation and classification. It achieves superior results in terms of accuracy and quality of normalized images compared to various baselines.

A. BenTaieb and G. Hamarneh, “Predicting Cancer with a Recurrent Visual Attention Model for Histopathology Images,” p. 8.
attention, histopathology
Deep neural networks are built upon relatively smallsized images. They are not scalable to multigigapixel 2D data such as whole slide histopathology images. Previous researches approach this problem with a patchbased representation, which losses the contextual information. It also increases the computational cost as most tissue areas are diagnostically irrelevant.
Inspired by the visual attention models and the fixation patterns of experienced histologists, this work presents an attentionbased model for predicting cancer from histopathology whole slide images. The author hypothesizes that patchbased analysis of tissue slides should be a sequential process in which a predictive model decides where to focus given the context of the entire tissue and the history of seen patches. They proposed a recurrent network with attention mechanism. At each time step, it receives input of a location parameter outputted from previous time step. The location parameter then points to a corresponding glimpse patch. Both the location parameter and the patch go through two separate networks and then combined by elementwise product to form a fixeddimensional feature vector that encodes the appearance and spatial information. Thus the system integrates features related to “where” and “what” patterns to look for when predicting the next glimpse location parameters.
The recurrent attention network maintains an internal state as memory. At each time step, based on the current feature vector, it learns to extract the most relevant information from the internal state, and output a prediction for the next location.
Experiments on a dataset of breast tissue images showed that the proposed model is capable of selectively attending to discriminative regions of tissues and accurately identifying abnormal areas with a limited sequence of visual glimpses.

S. Izadi, K. P. Moriarty, and G. Hamarneh, “Can Deep Learning Relax Endomicroscopy Hardware Miniaturization Requirements?,” arXiv:1806.08338 [cs], Jun. 2018.
superresolution, endomicroscopy
Smaller endoscopic hardware comes with the expense of reduced image quality. The authors aim at using softwarebased techniques to solve the hardware miniaturization. The endtoend fully convolutional network consists of the following parts in sequence: 1) two convolutional layers to extract lowlevel features; 2) a DenseNet to provide highlevel features. Each layer in a dense block receives the concatenation of outputs by all preceding layers as the input; 3) three upsampling layers using deconvolution; 4) a single convolutional integration layer to consolidate the features across the channels into a single channel. Qualitative and quantitative results indicate that the denselyconnected convolutional neural network can significantly yield higher PSNR and SSIM scores, resulting in superresolved images of higher quality.
Leave a Comment