[Reading] CNN architectures in medical imaging analysis

21 minute read


  1. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. ArXiv:1505.04597 [Cs]. Retrieved from http://arxiv.org/abs/1505.04597

    U-Net; Segmentation

    The task of medical imaging localization is to identify lesions’ location, i.e.: classify each pixel to be the lesion or not. The previous sliding-window approach is usually computational slow and redundant, because divides the whole image into small patches and runs each patch separately through the CNN. It also through away the context information. U-net built upon the fully convolutional network (fCNN). With the fCNN, it outputs a segmentation map. The model architecture consists of a contracting path (downsampling) to capture context, and a symmetric expanding path (upsampling) for precise localization. Therefore it yields a u-shaped architecture. The upsampling part also concatenates the corresponding part from the downsampling with the skip-connection. The results showed it preforms significantly better than the sliding-window convolutional network on different biomedical segmentation tasks.

    U-Net architecture

  2. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” arXiv:1606.04797 [cs], Jun. 2016.

    V-Net, 3D volumetric segmentation

    This work aims to build an end-to-end fully convolutional network for 3D volumetric medical image segmentation. The previous patch-wise classification approach only considered the local context and discard the global information. Different from previous approaches that deals with 3D input volumes slide-wise, the authors propose to use volumetric convolutions instead.

    V-Net architecture

    The overall architecture is similar to U-Net, which consists of a compression and decompression path. The 2D convolutions are replaced by 3D convolutions using volumetric kernels. The authors did not use any pooling layers, instead, they use non-overlapping 2 x 2 x 2 convolutional volume patches so that the resulting feature maps is halved. The rationale is to save memory footprint during training. The author also proposes a novel objective function based on Dice coefficient maximization. The author demonstrates fast and accurate results on prostate MRI test volumes.

  3. Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” arXiv:1606.06650 [cs], Jun. 2016.

    3D U-Net

    This work proposes a deep network that learns to generate dense volumetric segmentations, but only requires sparse annotated 2D slides for training. This is based on the observation that in many biomedical applications, only very few images are required to train a network that generalizes reasonably well. This is because each image already comprises repetitive structures with the corresponding variation. The network is based on the U-Net architecture with an analysis and synthesis path, and expand it to 3D with 3D convolutions, 3D max pooling and 3D up-convolutional layers. It avoids bottlenecks in the network architecture and uses batch normalization for faster convergence. To enable train on sparse annotations, is uses weighted softmax loss functions by setting the weights of unlabeled pixels to zero, and reducing weights for the background and incresase weights for the inner tubule. This makes it to learn from only the labeled ones and hence to generalize to the whole volume. The authors augment the data with rotation, scaling, gray value augmentation, and smooth dense deformation field. The model achieve an aveerage IoU of 0.863 for semi-automated setup, and equivalent preformance with a 2D implementation in the fully-automated setup.

  4. Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C. S., Liang, H., Baxter, S. L., … Zhang, K. (2018). Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 172(5), 1122-1131.e9. https://doi.org/10.1016/j.cell.2018.02.010


    This work applied transfer learning to classify macular degeneration and diabetic retinopathy using retinal optical coherence tomography (OCT) images. They used an Inception V3 architecture pretrained on ImageNet. The pretrained layer was frozen and connected with fully-connected layers. The author found fine-tuning the pretrained layers using backpropagation tended to decrease model performance due to overfitting. For the multi-class classification (4 classes, normal, two urgent referrals, and one routine referral), the model achieved an accuracy of 96.6%, with a sensitivity of 97.8%, a specificity of 97.4%, and a weighted error of 6.6%. This work is mainly conducted by clinicians, the main contribution is to collect a multi-center large dataset with disease Classification annotations.

  5. M. Havaei, N. Guizard, N. Chapados, and Y. Bengio, “HeMIS: Hetero-Modal Image Segmentation,” arXiv:1607.05194 [cs], Jul. 2016.

    missing imaging modalities; segmentation

    One challenge of the medical imaging analysis is the missing modalities. For example, the MRI imaging has different modalities such as T1, T2, T2 FLAIR, DWI. A typical task usually requires several modalities to be available. In clinical settings, however, the missing modalities are very common, since doctors will prescribe MRI according to the specific diseases. Two kinds of approaches exist to deal with the missing modalities: 1) Data synthesize; 2) a robust model to inference without the missing modalities. The paper focuses on the 2nd approach.

    It proposes an end-to-end deep learning framework (HeMIS, Hetero-Modal Image Segmentation architecture) that can segment medical images from incomplete multi-modal datasets. The framework consists of three parts:

    Hetero-Modal Image Segmentation architecture

    1. The back end: Each imaging modality goes through a separate convolutional pipeline independently to generate a feature map. This maps each modality into an embedding common to all modalities, within which vector algebra operations carry well-defined semantics (in part 2).

    2. The abstraction layers: For the available features maps extracted from part 1, the overall mean and variance is calculated as modality fusion.

    3. The front end: In combined the merged modalities in part 2 to go through another conv and softmax layer to produce the final segmentation output.

    The objective function is a pixelwise class cross-entropy loss. To train the network, the author applied pseudo-curriculum training, where the model starts learning from easy scenarios before turning to more difficult ones. Here the authors first trained the fCNN with all modalities, then randomly dropping one or several modalities to make the model more robust.

  6. A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris, “Multimodal MR Synthesis via Modality-Invariant Latent Representation,” IEEE Transactions on Medical Imaging, vol. 37, no. 3, pp. 803–814, Mar. 2018.

    missing imaging modalities; MRI synthesis

    As stated in the above paper, there are two main approaches to deal with the incomplete imaging modalities. This paper focuses on data synthesis to impute missing images. The authors proposed a deep fCNN with multi-input and multi-output for MR synthesis. This overcomes the previous image synthesis that only learns mappings between pairs of image modalities.


    The fCNN network takes aligned images with different modalities as input, and allows users to an arbitrary combination of modalities as output images. Its structure is a little similar to the previous paper. It composes of three stages: encoding, representation fusion, and decoding. In the encoder part, all inputs are projected into a shared latent representation space via a small U-Net architecture. In the fusion part, the pixel-wise max function fuses the latent representations into a single representation. And in the decoder part, a shallower fCNN maps the fused representation to the required output modality.

    The challenges for MRI synthesis is to build a model that can take as any subset of the input modalities to produce its output. Simply embedding inputs into the same representation space does not ensure that they share a meaningful latent representation, i.e. the meaning of the latent representation is dependent on its original modality. The magic happened at the cost function. The author produces a latent representation that is independent of the originating modality.

    The cost function is the mathematical expression of the following three goals:

    1) Each modality’s individual latent representation should produce all output as accurately as possible.

    It can be seen as the sum of each input modality’s average reconstruction error across all outputs.

    2) The latent representations from all input modalities should be close in the Euclidean sense.

    3) The fused latent representation should produce all outputs as accurately as possible.

    The model is evaluated on the ISLES and BRATS datasets and demonstrate statistically significant improvements over state-of-the-art methods for single input tasks.

  7. A. Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 02 2017.

    dermatology, transfer learning

    This is one of the early application and influencial work in medical image analysis with deep learning. The model is built upon Inception v3 CNN architecture with transfer learning to readopt the model to skin lesion classification. It used 0.1 M clinical images with over two thousand diseases. The authors compared the model’s performance with 21 dermatologists and showed it achieves performance on par with all tested doctors. However, the ROC curve problematically compares the discrete judgment of doctors for each image, with the continuous probability outputted by the CNN model. To verify the assersion that “AI surpasses doctors’ performances”, the work will need to provide more solid and equal comparision with statistical significance.

  8. J. Kawahara and G. Hamarneh, “Fully Convolutional Neural Networks to Detect Clinical Dermoscopic Features,” p. 8, 2018.

    dermatology, superpixel classification, segmentation

    This paper aims to classify clinical dermoscopic features within superpixels. It reformulates the superpixel classification task as a semantic segmentation problem. The CNN architecture built upon VGG16 network. The authors resize selected responses/feature maps throughout the network to match the size of the input image using bilinear interpolation. These selected resized feature maps are concatenated, allowing to directly consider feature maps from several network layers. This design is motivated by the authors’ observation that the appearance of dermoscopic features are subtle, and may be represented in shallower layers with high spatial resolutions. The authors then use 1 x 1 convolution to reduce the number of parameters. To deal with pixel-, class- and sample-imbalance, the author used a negative multi-label Sorensen-Dice-F1 loss function. The work achieved 1st place in the 2017 ISIC-ISBI part 2: Dermoscopic Feature Classification challenge.

  9. J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “7-Point Checklist and Skin Lesion Classification using Multi-Task Multi-Modal Neural Nets,” IEEE Journal of Biomedical and Health Informatics, pp. 1–1, 2018.

    dermatology, multi-task, multi-modal

    To distinguish between benign and malignant skin tumors, dermatologists use rule-based diagnostic checklists to facilitate diagnosis. This paper aims to classify the 7-point melanoma checklist criteria and perform skin lesion diagnosis with a deep convolutional neural network. To enable the network utilizes the multi-modal data(clinical and dermoscopic images, and patient meta-data), the authors designed a multi-modal multi-task loss function that considers different combinations of the input modalities. This makes the model robust to missing data at inference time. The experiment results are benchmarked along with the dataset made available.

  10. Z. Li et al., “Thoracic Disease Identification and Localization with Limited Supervision,” arXiv:1711.06373 [cs, stat], Nov. 2017.

    Classification and localization

    Manually labeling of medical images, especially the ones with localization, is costly to acquire. This work aims to classify and localize the disease within one model. It adopts the idea of handing an image as a group of grid cells, and treat each patch as a classification target (based on ideas of YOLO, SSD, MIL). For the model architecture, the inputs consist of two kinds of chest X-ray images: image with and without bounding boxes. The bounding box information is encoded with the patch disease classification, where if a patch is covered by the bounding box, its $k$th disease classification is 1. The input image first goes through the ResNet to encode into a set of abstracted feature maps. Then the feature map is up- or down-sampled to a $P x P$ patch grid. Then it goes through two convolution layers to generate the output tensor $P x P x K* for $K$ disease label prediction of the $P x P$ patches.

    The model designed two loss functions for input images with or without ground-truth bounding boxes. It defines the image-level probabilities given the image disease label with or without the bounding-box information. The probability is the multiplication of each individual grid’s probability for a disease label. The combined loss function is the cross-entropy loss of the weighted sum the image-level probabilities in two cases. The proposed model achieves significant accuracy improvement over the state-of-the-art approaches on classification and localization.

  11. T. Würfl et al., “Deep Learning Computed Tomography: Learning Projection-Domain Weights From Image Domain in Limited Angle Problems,” IEEE Transactions on Medical Imaging, vol. 37, no. 6, pp. 1454–1463, Jun. 2018.

    Reconstruction, CT

    This is an interesting work that combines the traditional image processing pipeline with learning-based approach. It maps an analytic reconstruction pipeline to a neural network as a one-to-one correspondence.

  12. Y. Sun, Z. Xia, and U. S. Kamilov, “Efficient and accurate inversion of multiple scattering with deep learning,” Optics Express, vol. 26, no. 11, p. 14678, May 2018.


    The reconstruction problem concerns of inversing the signals received by sensors to the unknown object of interest. For medical imaging, the reconstruction is often formulated as a linear inverse problem, but leads to reconstruction artifacts. When formulated as a nonlinear problem, it is common to solve it as an optimization problem. The objective functions typical includes a data-fidelity term and a regularizer. However, when the light scattering is strong, the problem becomes highly nonconvex, and the corresponding optimization problem is difficult to solve.

    This paper aims to solve the problem in a purely data-driven fashion to meet the timely and accuracy reconstruction requirements. The received signals first go through a backpropagation that transforms the input from the measurement domain to the image domain. The image domain consist of two feature maps represents the information from the real and imaginary parts of the backpropagated vector. Then the feature maps go through a U-Net for multi-resolution decomposition and local-global composition. Note that beside the wide use in segmentation, the U-Net are also applied to reconstruction problems. The proposed model is trained on simulated data, also succeeded in reconstructing images from real experimental data consisting of highly scattering objects. It also significantly shorten the processing time during inference, compared to previous optimization and regularization approaches.

  13. A. Bentaieb and G. Hamarneh, “Adversarial Stain Transfer for Histopathology Image Analysis,” IEEE Transactions on Medical Imaging, vol. 37, no. 3, pp. 792–802, Mar. 2018.

    histopathology, color normalization, GAN, style transfer

    Color information is central to the visual and automatic analysis of histopathology tissue slides. Previous approaches to deal with the color variation among different histopathological images include: 1) bypass the problem by using only grayscale images. 2) rely on color-based data augmentation to synthesize new images in an attempt to enforce robustness to color inconsistency. 3) pre-process using stain normalization techniques. This paper is a proof-of-concept of the utility of adversarial training in stain color normalization.

    The generative model is an U-Net that generates fake images with the same stain of the target image and retains the structure. The discriminative model is an AlexNet that has two tasks: 1) to discriminate generated image from real images, and 2) to perform a specific image interpretation task. The objective function consists of an adversarial loss, a regularization loss to preserve structures of generated images, and a task-specific loss that allows the model to simultaneously analyze images and learn task-specific color transformations. The model is evaluated on three dataset and two tasks: tissue segmentation and classification. It achieves superior results in terms of accuracy and quality of normalized images compared to various baselines.

  14. A. BenTaieb and G. Hamarneh, “Predicting Cancer with a Recurrent Visual Attention Model for Histopathology Images,” p. 8.

    attention, histopathology

    Deep neural networks are built upon relatively small-sized images. They are not scalable to multi-gigapixel 2D data such as whole slide histopathology images. Previous researches approach this problem with a patch-based representation, which losses the contextual information. It also increases the computational cost as most tissue areas are diagnostically irrelevant.

    Inspired by the visual attention models and the fixation patterns of experienced histologists, this work presents an attention-based model for predicting cancer from histopathology whole slide images. The author hypothesizes that patch-based analysis of tissue slides should be a sequential process in which a predictive model decides where to focus given the context of the entire tissue and the history of seen patches. They proposed a recurrent network with attention mechanism. At each time step, it receives input of a location parameter outputted from previous time step. The location parameter then points to a corresponding glimpse patch. Both the location parameter and the patch go through two separate networks and then combined by element-wise product to form a fixed-dimensional feature vector that encodes the appearance and spatial information. Thus the system integrates features related to “where” and “what” patterns to look for when predicting the next glimpse location parameters.

    The recurrent attention network maintains an internal state as memory. At each time step, based on the current feature vector, it learns to extract the most relevant information from the internal state, and output a prediction for the next location.

    Experiments on a dataset of breast tissue images showed that the proposed model is capable of selectively attending to discriminative regions of tissues and accurately identifying abnormal areas with a limited sequence of visual glimpses.

  15. S. Izadi, K. P. Moriarty, and G. Hamarneh, “Can Deep Learning Relax Endomicroscopy Hardware Miniaturization Requirements?,” arXiv:1806.08338 [cs], Jun. 2018.

    super-resolution, endomicroscopy

    Smaller endoscopic hardware comes with the expense of reduced image quality. The authors aim at using software-based techniques to solve the hardware miniaturization. The end-to-end fully convolutional network consists of the following parts in sequence: 1) two convolutional layers to extract low-level features; 2) a DenseNet to provide high-level features. Each layer in a dense block receives the concatenation of outputs by all preceding layers as the input; 3) three upsampling layers using deconvolution; 4) a single convolutional integration layer to consolidate the features across the channels into a single channel. Qualitative and quantitative results indicate that the densely-connected convolutional neural network can significantly yield higher PSNR and SSIM scores, resulting in super-resolved images of higher quality.

  16. J. J. Titano et al., “Automated deep-neural-network surveillance of cranial images for acute neurologic events,” Nature Medicine, p. 1, Aug. 2018.

    neurology, triage, natural language processing

    One of the major challenges in computer-assisted diagnosis (CAD) is the false positive due to unbalanced and ambiguous labels for the whole image rather than the specific region containing pathology. This work sought to avoid the challenges of CAD entirely by focusing on optimizing the triages to radiologists with CNN.

    The 3D-CNN model was based on ResNet and trained to identify whether an image contained acute neurological or noncritical findings for the purpose of triage. The training set was 37,236 images annotated with an automated NLP pipeline that inferring labels from the reports (silver standard). The model had an AUC of 0.88 of predicting the critical findings. It was then tested on a test set labeled with doctors (golden standard). The model had an AUC of 0.73 on the test set.

    Furthermore, to assess whether a deep-learning model could meaningfully triage studies, A RCT in a simulated clinical environment was conducted. The average time of the model is 150 times faster than for humans. The model also enabled more urgent studies to appear earlier in the workflow prioritized list. However, as the author mentioned, such a system is still idealized and inconsistent with the actual workflow of radiologists. This requires further studies to validate in a formal multicenter clinical trial.

  17. P. Rajpurkar et al., “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning,” arXiv:1711.05225 [cs, stat], Nov. 2017.

    CheXNet, detection, X-ray

    This work classifies diseases from chest X-ray. The group is from Stanford AI in medicine and imaging group. It’s another paper that compares the AI’s performance with human doctors.

    To detect and classify pneumonia-like features, the authors built upon the 121-layer DenseNet pre-trained on ImageNet, and fine-tuned the model. They replaced the final fully-connected layer with one new fully-connected layer that output a score with a sigmoid nonlinearity of the probability of pneumonia. The named the model CheXNet.

    The loss function is the binary weighted cross entropy loss. For each class label, the output probability is weighted by multiplying the ratio of the class in the training set. This is to overcome the unbalanced label in the dataset. (Although the authors didn’t report the exact portion of each class.)

    The data is ChestX-ray14 dataset which contains 112,120 frontal-view X-rays. The labels are 14 different thoracic pathology labels extracted automatically from radiology reports. The images that contain/not contain pneumonia-like features as 1/0.

    The test set was labeled manually by 4 radiologists on all 14 pathological labels. They can only access the X-ray without any other patients information or knowledge of the prevalence of the disease in the dataset. The ground-truth label was defined as the majority votes from 4 radiologists plus the CheXNet.

    To compare the performance of CheXNet vs. radiologist, they compute the F1 scores for 4 radiologists and the CheXNet. They used bootstrap to construct 95% confidence intervals(CI). They found CheXNet achieve an F1 score of 0.435, whereas the radiologists’ average is 0.387. To validate if the two F1 scores are significantly different, they compute the difference of F1 scores. And since the difference does not contain 0, they concluded the performance of CheXNet is statistically significantly higher than the radiologist.

    The also trained another model on all the 14 classifications, and the per-class AUROC achieves or outperforms the previous benchmark.

Leave a Comment