Companies impacting Open-source data science

Tags: ,

By: Gordon Fleetwood – ODSC data science team contributor

All of the these influential companies are represented at ODSC East. Over one hundred renowned Data Scientists will speak to their expertise at our Open Data Science Conference.


The philosophy of open-source is extremely important for progress in technology and Data Science. ODSC’s next conference, ODSC East, will be a hub for a host of Data Science experts and practitioners from various companies who have significant open-source contributions. Here are a few of these projects.

Google: TensorFlow

Google’s announcement of the open-sourcing of TensorFlow, their machine learning libraries built upon the use of data graphs, sent shock waves throughout the Machine Learning community. Who wouldn’t want to get their hands on the system which forms the basis upon which Google does its work in the field? Since its releases there have been a raft of tutorials and libraries built on top of TensorFlow, not to mention the Udacity class Google built to serve as an introduction to both the library and deep learning.

dna-data-library (1)

Microsoft: CNTK

The Convolutional Network Toolkit (CNTK) is Microsoft’s package for deep learning based on the use of directed graphs. One of its main draws is its tremendous speed when compared to other popular packages like Theano, Caffe, and TensorFlow (as reported by Microsoft themselves).

Continuum Analytics: Anaconda, Blaze Ecosystem

Python has notorious issues with packaging issues for different libraries, and there are many chilling stories online of hours of effort spent to overcome said issues. It’s no wonder that Continuum Analytics’ Anaconda platform has risen to become the go-to environment as it neatly dispels this nightmarish situation. In one package a user has at his/her disposal many of the packages used in Data Science and the ability to easily add to this environment. Another key component of Anaconda is Continuum’s own conda library. This serves the dual role of package and virtual environment manager, thus making it the single equivalent of Python’s pip and virtualenv libraries. Listing the company’s complete collection of open-source contributions would take a while, but the Blaze ecosystem needs to be mentioned as well. Blaze and its sister components make dealing with medium-sized data easier without having to resort to the sledge hammer of big data platforms.

RStudio: RStudio, Shiny

It’s hard not to consider the R language and the RStudio platform as conjoined twins. Though R has its own IDE, RStudio has all its capabilities and much more, making it the nominal de facto environment for R programmers. The company behind RStudio is also called RStudio, and is dedicated to creating open-source software and libraries to add to the language’s constantly growing ecosystem. These libraries are often the brainchildren of RStudio’s Chief Scientist, Hadley Wickham, and names like dplyr, RMarkdown, and ggplot2 are mainstays in the R world. Any discussion of RStudio’s work could not end without mentioning Shiny. Building data-centric web applications has never been easier.

Quantopian: Zipline, Pyfolio

 

Quantopian is a dream for those interested in playing with financial data for fun and for profit. The company provides a playground where one has access to fourteen years of financial data to test out any algorithmic trading strategy one might devise. On top of this, the team also has a number of libraries available for public use. Their most two well known open-source projects are two Python libraries: Zipline for algorithmic trading and Pyfolio for analysis of financial portfolios.

sales and trading2

CartoDB: CartoDB

In many ways, geo-spatial data analysis is a field onto itself and is not for the faint of heart. CartoDB eponymous platform provides a smooth experience that is open to both technical and non-technical users. One can transitional between pointing and clicking to writing custom code in the blink of an eye while working to create the perfect map or visualization. On top of its wonderful features, all the code is on GitHub for anyone’s viewing pleasure.

Other notable projects include Dato’s GraphLab machine learning library and RapidMiner’s predictive analytics platform, RapidMiner Studio. The use of many of these tools will be outlined at ODSC East. Don’t miss out on this great opportunity.