Monthly Summary of Selected Trends, Activities and Insights for R – August 2018
Data for the trends and activities summarized here were obtained from popular websites used by the R community such as Google, GitHub, StackOverflow, Rstudio, METACRAN and R-Bloggers StackOverflow Number of StackOverflow Questions tagged R: 4,565 (8%  down from July) Number of Answers for R questions: 4,630 (3%  up from... Read more
Understanding the Hoeffding Inequality
If you read my last post on mathematically defining machine learning problems, then you’ll be familiar with the terminology here. Otherwise, I recommend you read that and then circle back here. The Hoeffding Bound is one of the most important results in machine learning theory, so you’d do well... Read more
Snakes in a Package: Combining Python and R with Reticulate
When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating; my ability to do data analysis increased substantially. As my applications grew in size and complexity, I started to... Read more
Three Popular Clustering Methods and When to Use Each
In the mad rush to find new ways of teasing apart labeled data, we often forget about everything we can do with unsupervised learning. Unsupervised machine learning can be very powerful in its own right, and clustering is by far the most common expression of this group of problems.... Read more
Gradient Boosting and XGBoost
In this article, I provide an overview of the statistical learning technique called gradient boosting, and also the popular XGBoost implementation, the darling of Kaggle challenge competitors. In general, gradient boosting is a supervised machine learning method for classification as well as regression problems. The overarching strategy involves producing... Read more
Machine Learning with H2O – Part 1
Big datasets pose computation problems for software such as R and python in addition to implementing basic machine learning algorithms that can seem like it would run forever. Most of the time it is difficult to even determine how much time it would take to run these algorithms. Enter H20,... Read more
Switching Between MySQL, PostgreSQL, and SQLite
How many times has one switched between Python to Java, resulting in constant backspaces to correct missing semicolons and other syntax idiosyncrasies to appease stubborn compilers? As with any language, SQL implementations also have their own quirks and tricks that can lead to irritating troubleshooting when syntax differences lead... Read more
Generating Gender-Neutral Face Images with Semi-Adversarial Neural Networks to Enhance Privacy
I thought that it would be nice to have short and concise summaries of recent projects handy, to share them with a more general audience, including colleagues and students. So, I challenged myself to use fewer than 1000 words without getting distracted by the nitty-gritty details and technical jargon.... Read more
SQL Equivalents in R
Whenever I’m teaching introductory courses in data science using the R language, I often encounter students who use a different language like Python or Julia, and still others who are transitioning into data science from other fields and don’t know any data science language at all. The common thread... Read more
Using TensorFlow Object Detection to do Pixel Wise Classification
In the past, I have used TensorFlow Object Detection API to implement object detection with the output being bounding boxes around different objects of interest in the image. For more please look at my article. TensorFlow recently added new functionality and now we can extend the API to determine pixel... Read more