By Frank La Vigne | March 2018
In my last column, I explored the Jupyter Notebook, an open source, browser-based software solution that allows users to create and share documents that contain live code, visualizations and text. While not ideal for creating applications, Jupyter Notebooks are a great way to explore and experiment with data. Think of them as a kind of interactive “scratch pad” for data science. Jupyter Notebooks provide a common format for data scientists to share code, insights and documentation. Many of the popular machine learning libraries, such as CNTK and TensorFlow, provide Jupyter Notebooks as documentation to facilitate learning how to use them. Fortunately, there’s an easy way to get access to numerous sample notebooks for all of the most popular libraries—all in one place and without installing software.
Data science and machine learning require new tools, many of which may be unfamiliar to .NET developers. Fortunately, Microsoft Azure provides a virtual machine (VM) image—called the Data Science Virtual Machine (DSVM)—that’s pre-loaded with a number of tools and utilities for data science and machine learning work. The DSVM image makes it easy to get started with deep learning technologies, and includes the Microsoft Cognitive Toolkit (CNTK), TensorFlow, Keras, PyTorch and more, already built, installed and configured. Each toolkit is primed for use.
Best of all, and most relevant to this article, numerous sample Jupyter Notebooks are also included. In short, the DSVM provides an easy way to get started without having to install anything on your local machine. The DSVM image is available in Windows, Ubuntu and CentOS. In this article, I will focus on the Ubuntu version, as I find it to be the most thorough. You can find a complete list of tools installed on the DSVM image at bit.ly/2Dc5uA4.
Creating a DSVM is easy. To get started, log into the Azure Portal and click New, then type “Data Science Virtual Machine” in the search textbox and hit enter. You should see search results similar to Figure 1. Click on the item Data Science Virtual Machine for Linux (Ubuntu).
Figure 1 Searching the Marketplace for the Data Science Virtual Machine Image
To create an instance of the DSVM, follow the instructions of the Create Virtual Machine blade. Provide a name for the VM, a username and choose “Password” for authentication type. Create a new resource group for the DSVM or use an existing one. Remember the Resource Group name in addition to the username and password. Leave the other options at their default values.
Click OK to move to the next step: picking a size for the VM. For now, size doesn’t matter, so just go with a size that fits your budget. If you don’t see an affordable option, click “View all” to see all VM configurations. Click Select to choose a size and configuration. On the third step, leave everything as defaults and click OK. The final step is the summary screen. It’s worth noting that some Azure plans do not include the DSVM and your account will be charged separately. The summary screen will inform you if that’s the case. Click Create to instantiate the VM to continue.Connecting to the DSVM GUI
The best way to get the most out of the DSVM is to connect to it graphically. While the blade for the VM instance on the portal provides information on how to connect via an SSH service, it doesn’t offer any guidance on connecting to it via a graphical shell. Fortunately, there’s extensive documentation on how to do this using Remote Desktop for Windows (bit.ly/2Davn3j). The DSVM Ubuntu image has the xcfe4 desktop environment already installed, and you created a local user account when provisioning the VM. All that’s left to do now is configure the remote desktop service to listen for incoming Remote Desktop Protocol (RDP) connections. To do this, you’ll need to enable xrdp, an open source RDP server that works with xfce.
To install xrdp on the Ubuntu DSVM, you must connect to it via an SSH service, which on Windows often requires a terminal program like PuTTY (putty.org). If your PC is running Windows 10 Build 1709 or later, you can opt instead to install Ubuntu on Windows via the Store (bit.ly/2Dm8fSR). To get the connection information, look for the Connect button on the blade for the DSVM instance. Click on it to bring up the dialog shown in Figure 2 to get the connection information.
Figure 2 Getting the SSH Connection Information for the DSVM
In the terminal window, enter the connection information and, when prompted, enter “yes” to trust the connection, followed by the password for the username. To install and enable the xrdp service to work with xfce4, enter the following three lines:
If prompted, respond with “Y” to enable use of extra disk space. Once xfce is installed on the DSVM, you need to create a rule in the Network Security Group to allow RDP traffic through to the VM. This can be done via the Azure Portal (bit.ly/2mDQ1lt), or by issuing the follow statement into the command line through the Azure CLI:
Be sure to replace “ResourceGroup” and “DSVM-Name” with the name of the Resource Group and VM name from earlier. Once that process completes, open the Remote Desktop Connect application, enter the IP address of the DSVM instance and click Connect. If prompted, choose to trust the machine and enter the credentials for the account created along with the DSVM.
All this configuration activity on the Ubuntu CLI may seem superfluous for a column about artificial intelligence, but bear in mind that many tools in this space assume a basic familiarity with Linux and the Bash command-line shell. The sooner you feel comfortable with Bash and Linux, the sooner you’ll be productive.
If you wish to avoid the preceding steps to install and configure xrdp, X2Go (wiki.x2go.org) is an alternative for Mac and Windows users. X2Go communicates directly to the DSVM via the xfce protocol, so there’s no need to install anything on the VM or to alter the Network Security Group. Simply install X2Go on your local desktop and connect with the IP address and username for the DSVM.
Once connected to the DSVM, click on the Applications menu in the upper-left corner of the GUI. Click on Development and click on JupyterHub on the resulting submenu. JupyterHub is a multi-user hub for managing multiple instances of the single-user Jupyter Notebook server. It can be used to serve notebooks to a group of users, such as students in a class, a research group or a team of data scientists.
A terminal window now opens and the system’s default browser (FireFox) opens to http://localhost:8888/tree. Click on the CNTK folder to access the Microsoft Cognitive Toolkit samples. Next, click on the CNTK_101_LogisticRegression.ipynb file to access the Logistic Regression and ML Primer notebook, which contains a tutorial for those new to machine learning and to the CNTK. This tutorial notebook is written for Python. If you receive the error, “Can’t execvp Jupyter: No such file or directory,” you’ll have to use X2Go to continue.Classifying Tumors
The problem posed in the CNTK_101_LogisticRegression notebook centers around classifying tumor growths as either malignant or benign. This is a classification problem, specifically a binary classification, as there are only two output classifications. To classify the type of growth in each patient, the hospital has provided the patient’s age and the size of his or her tumor. The working hypothesis is that that younger patients and patients with smaller tumors are less likely to have a malignant growth. Figure 3 shows how each patient in the data set is represented as a dot in the plot in the notebook. Red dots indicate malignant growths and blue dots indicate benign. The goal is to create a binary classifier to separate the malignant from the benign, similar to the scatter plot in the Out cell in the notebook and duplicated in Figure 3.
Figure 3 Scatter Plot with Binary Classifier Indicated by Green Line
The notebook includes a caution stating that the data set used here is merely a sample for educational purposes. A production classification system for determining the status of growths would involve more data points, features, test results and input from medical personnel to make the final diagnosis.The Five Stages of a Machine Learning Project
Machine learning projects generally break down into five stages: Reading the data, shaping the data, creating a model, learning the model’s parameters and evaluating the model’s performance. Reading the data entails loading the data set into a structure. For Python, this generally means a Pandas DataFrame (bit.ly/2EPC8rI). DataFrames are essentially a tabular data structure consisting of rows and columns.
The second step is shaping the data from the input format into a format that the machine learning algorithm accepts. Often, this process is referred to as “data cleansing” or “data munging.” Very often this phase consumes the bulk of the time and effort in machine learning projects.
It’s not until the third stage that the actual machine learning work begins. In this notebook, I’ll create a model to separate the benign growths from the malignant ones. Logistic Regression is a simple linear model that takes input values (represented by the blue circles in Figure 4) of what I’m classifying and computes an output. Each of the input values has a different degree of weight on the output, depicted in Figure 4 as line thickness.
Figure 4 Diagram of Logistic Regression Algorithm
The next step is to minimize the error, or loss, using an optimization technique. This notebook uses Stochastic Gradient Descent (SGD), a popular technique that usually begins by randomly initializing the model parameters. In this case, the model parameters are the weights and biases. For each row in the data set, the SGD optimizer can calculate the error between the predicted value and the corresponding true value. The algorithm will then apply gradient descent to create new model parameters after each observation. SGD is explained in further detail inside the notebook and in a YouTube video by Siraj Raval (bit.ly/2B8lHEz).
The final step is to evaluate the predictive model’s performance against test data. There are essentially four outcomes for this binary classifier:
In this scenario, outcome No. 3 would be a false negative. The tumor is benign, but the algorithm flagged it incorrectly as malignant. This is also known as a Type II Error. Outcome No. 4 represents the reverse, a false positive. The tumor is malignant, but the algorithm marked it as benign, a Type I Error. More information about the Type I and II Errors, please refer to the Wikipedia article on erroneous outcomes of statistical tests at bit.ly/2DccUU8.
Using the plot in Figure 5, you can determine that the algorithm labeled three malignant tumors as benign, while not mislabeling any benign tumors as malignant.
Figure 5 Three Improperly Flagged Tumors
The best way to quickly get started with exploring the various frameworks of data science, machine learning and artificial intelligence is through the DSVM image on Azure. It requires no installation or configuration, and it can be scaled up or down based on the problem to be solved. It’s a great way for people interested in machine learning to start experimenting right away. Best of all, the DSVM includes numerous Jupyter Notebook tutorials on the most popular machine learning frameworks.
In this article, I set up a DSVM and took the first steps into exploring the CNTK. By creating a binary classifier with logistic regression and Stochastic Gradient Descent, I trained an algorithm to determine whether or not a tumor was benign or malignant. While this model relied on only two dimensions and may not be ready for production, you did get an understanding of how this problem would be tackled in the real world.
Frank La Vigne leads the Data & Analytics practice at Wintellect and co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following technical expert for reviewing this article: Andy Leonard