My learning experience of Udacity computer vision nanodegree

4 minute read


I completed the Udacity computer vision nanodegree of the April 2018 cohort. This is a 3-month online course focuses mainly on the deep learning techniques for computer vision, with a little touch on traditional computer vision, object tracking, and robot localization. For someone who is a novice in deep learning and computer vision such as me, this is an excellent introductory course that will bring the learner’s technique skill and understanding to the next level.

This is where I am before taking the course: I was a physician and completed my MD and residency training in neurology. Then I decided to change my career to be a researcher and now am a Ph.D. student in Interactive Arts and Technology. I just identified machine learning and computer vision are my real passion, and I would like to focus my research on medical-related machine learning and computer vision. However, at that time point, my only formal education in computer science is a graduate machine learning course. I desperately need more training in the state-of-the-art techniques in computer vision. As a MOOCer and life-long learner, I acquired most of my programming and machine learning skills from online courses. This computer vision nanodegree is a few available online courses at that time. Thanks to it, it allows advanced education happen under untraditional settings.

Computer Vision (CV) is one of the fastest growing fields because of the advancements in deep learning. It aims to train the computer to generate knowledge from images, just like we humans perceive information through our visual system. CV has already revolutionized healthcare, from detecting retinopathy to enhancing radiologists’ efficiency in screening lung cancer. Some doctors even claimed that the professional of pathologists and radiologists would be replaced by medical imaging information specialists 1. In addition to the medical imaging field, computer vision has broad application in healthcare, such as fall detection, hospital and operation room surveillance, ICU patients monitoring.

The computer vision pipeline includes data pre-processing, selecting areas of interest, feature extraction, and prediction/recognition. In the course, I experimented with ORB (Oriented FAST and Rotated BRIEF) and HOG (histogram of oriented gradients) algorithms, which are traditional (a.k.a. non-deep learning) computer vision algorithms. The descriptors generated from these algorithms are then used for different tasks such as classification or object detection.

After being familiar with the computer vision pipeline, I got to understand that the disadvantage of the above approach is that it relies heavily on feature engineering, which requires human experts to hand-craft the most distinguishable features. The deep learning technique aims to save the human effort and to extract features automatically, by training the model with an “end-to-end” method. Deep learning has now become the dominant method in computer vision research and applications. For example, to detect faces in an image, the first step in a traditional computer vision pipeline would be generating edges with human-designed filters. In contrast, we can also approach the same problem with a deep learning approach by building a deep convolutional neural network. The weights of the first few layers in the network, after visualization, will act as edge detectors too. Therefore, the deep learning models automate the feature extraction step in the pipeline. As the network goes deeper, it can discover complex patterns based on the extracted information from the previous layers. This is mainly the magic happens inside the deep learning algorithms.

Later in the course, I learned the two primary models of deep learning: CNN (convolutional neural network) and RNN (recurrent neural network). I found the part on LSTM and attention mechanisms are the most enlightening for me. It demonstrated the ideas with simplified mathematical examples, and fostered my understanding with vivid animations and programming quizzes. I think the real highlight of my learning experience is the support from my mentor. The mentorship and code review systems are the distinct features of Udacity. It addresses the drawback of online learning, where the students can’t get feedbacks as well as offline learning. The Udacity platform has one-on-one mentors who will help along the course. When I had questions on a concept, or got stuck with the code, I seek help from my mentor and he could always clear my confuse. In the second project, which was to build image captioning model with CNN and LSTM, I got stuck on a problem for a long time, and my model constantly outputs the same captions for all the input images. Finally, I consulted my mentor, and he pointed out the problem with the learning rate decay. After adjusting the learning rate, voilà, the model finally outputs captions correctly!

  1. Jha, S., & Topol, E. J. (2016). Adapting to Artificial Intelligence: Radiologists and Pathologists as Information Specialists. JAMA, 316(22), 2353–2354. 

Leave a Comment