All the Best Parts of Pandas for Data Science
Pandas has been hailed by many in the data science community as the missing link between Python and analysis, a tool that can be leveraged in order to dramatically reduce overhead in data science projects, increase understandability and speed up workflows. Pandas comes loaded with a wide range of... Read more
K-Means Clustering Applied to GIS Data
GIS can be intimidating to data scientists who haven’t tried it before, especially when it comes to analytics. On its face, mapmaking seems like a huge undertaking. Plus esoteric lingo and strange datafile encodings can create a significant barrier to entry for newbies. There’s a reason why there are experts who... Read more
TensorLayer for Developing Complex Deep Learning Systems
This article describes TensorLayer, a modular Python wrapper library for TensorFlow allowing data scientists to streamline the development of complex deep learning systems. TensorLayer was released in September 2016 with a GitHub repo. A descriptive research paper followed in August 2017: TensorLayer: A Versatile Library for Efficient Deep Learning... Read more
Monthly Summary of Selected Trends, Activities and Insights for R – August 2018
Data for the trends and activities summarized here were obtained from popular websites used by the R community such as Google, GitHub, StackOverflow, Rstudio, METACRAN and R-Bloggers StackOverflow Number of StackOverflow Questions tagged R: 4,565 (8%  down from July) Number of Answers for R questions: 4,630 (3%  up from... Read more
Understanding the Hoeffding Inequality
If you read my last post on mathematically defining machine learning problems, then you’ll be familiar with the terminology here. Otherwise, I recommend you read that and then circle back here. The Hoeffding Bound is one of the most important results in machine learning theory, so you’d do well... Read more
Snakes in a Package: Combining Python and R with Reticulate
When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating; my ability to do data analysis increased substantially. As my applications grew in size and complexity, I started to... Read more
Three Popular Clustering Methods and When to Use Each
In the mad rush to find new ways of teasing apart labeled data, we often forget about everything we can do with unsupervised learning. Unsupervised machine learning can be very powerful in its own right, and clustering is by far the most common expression of this group of problems.... Read more
Gradient Boosting and XGBoost
In this article, I provide an overview of the statistical learning technique called gradient boosting, and also the popular XGBoost implementation, the darling of Kaggle challenge competitors. In general, gradient boosting is a supervised machine learning method for classification as well as regression problems. The overarching strategy involves producing... Read more
Machine Learning with H2O – Part 1
Big datasets pose computation problems for software such as R and python in addition to implementing basic machine learning algorithms that can seem like it would run forever. Most of the time it is difficult to even determine how much time it would take to run these algorithms. Enter H20,... Read more
Switching Between MySQL, PostgreSQL, and SQLite
How many times has one switched between Python to Java, resulting in constant backspaces to correct missing semicolons and other syntax idiosyncrasies to appease stubborn compilers? As with any language, SQL implementations also have their own quirks and tricks that can lead to irritating troubleshooting when syntax differences lead... Read more
Open Data Science - Your News Source for AI, Machine Learning & more