Data Science In The Cloud
By: Gordon Fleetwood – ODSC data science team contributor
The comfort of working on one’s own machine cannot be understated, however it does limit your capacity to perform computationally intensive analysis. For example, a rule of thumb is that 25% of a computer memory is the upper limit for comfortably working with a given data set. For a machine with 8 gigabytes of RAM, this is 2 gigabytes. In a world of where more and more information is being stored, 2 gigabytes is a single drop in the data ocean. At some point, any Data Scientist will seek out an environment with much more space for carrying out analyses. More often than not, this means Data Science in the cloud. Here I’ll introduce some of your options, so you can pick your favorite and get trained at one of our Data Science Workshops, at ODSC East on May 20-22.
The Data Science Toolbox
This project by Data Scientist Jeroen Janssens leverages Ubuntu Linux to provide a virtual distribution for Data Science. It can be run locally via vagrant and VirtualBox, but hits its stride when paired with Amazon Web Services (AWS). The initial packages is lean, with only a few core packages – pandas, matplotlib, dplyr, and ggplot2, for example – accompanying the installations of Python and R, but it is easy to add more through either language’s package managers. Moreover, the dst command line tool makes it easy to add custom code bundles such as those accompanying a book. The only visible downside is the perceived lack of having a GUI for R, while instructions come for setting up an IPython notebook.
Microsoft Azure Machine Learning
The name may lead one to think Microsoft’s offering on the market is geared towards a particular section of Data Science, but it covers all the bases in its functionality. ML Studio provides a drag and drop graphical interface for all the choices in the entire analysis process – from importing data, preprocessing, and data visualization all the way to building models. Once built, these models can be deployed on Azure and accessed by an API. R and Python are supported, along with the now ubiquitous Jupyter notebook.
IBM’s Watson made a big splash a couple of years ago with its success on Jeopardy. Since then the company has moved the AI onto tasks such as tackling healthcare related problems and serving as a vital component of the backend of their own offering for Data Science in the cloud. Using BlueMix, a user can take advantage of a range of Machine Learning APIs such as tonal and sentiment analysis as well as visual recognition. These can be integrated into data-driven apps which can also be hosted the platform. Away from Watson’s fingers, IBM’s Data Science offering is big, with a number of database offerings and the ability to use Apache Hadoop,
Apache Spark, and Geospatial Analytics.
IBM seems to have put all their eggs in Spark’s growing Machine Learning toolbox when its comes to building models. Given this, the accompanying support for Scala is unsurprising, as is the ability to use Python given the integration of the Jupyter notebook into the platform. Rumors float around about R being able to be used as well, but the process seems to be convoluted.
Amazon Web Services is by far the most popular cloud computing option for developers. As noted above with The Data Science Toolbox, Data Scientists can use EC2 instances to set up environments. However, Amazon is looking to diversify and built separate platforms just for analytics. At the moment most of this toolbox leans more to Data Engineering with support for Elastic Search, Hadoop, and real time data streaming. Amazon Machine Learning proves to be the exception. With this tool users can upload data, create, train, and evaluate models, and deploy said models for real-time or batch predictions. The interesting aspect in this workflow is that the running of statistical tests on a data set once it is uploaded. This leads to the provision of suggestions to guide the user through the process of cleaning and arranging the data to make it ready to be passed into a model.
In addition to machine learning, Amazon also has AWS IoT, a framework to support a multitude of devices in the growing industry.
The options for carrying out Data Science in an online environment are wide and varied, and go beyond the options presented here. When its comes to getting the power and performance one needs to carry out analyses at scale, the current limit is the size of your wallet.