Setting Your Hypothesis Test Up For Success
I want to go deep with you on exactly how I work with stakeholders ahead of launching a hypothesis test. This step is crucial to make sure that once a test is done running, we’ll actually be able to analyze it. This includes: A well-defined hypothesis... Read more
Organizing Your Next Data Science Project to Minimize Headaches
Call it the data scientist’s curse, but every practitioner has had a data science project that became unmanageable at some point because of poor organizational choices early on. We’ve all been at our desks at 2 a.m. changing values and re-running our scripts for the 80th... Read more
Three Popular Clustering Methods and When to Use Each
In the mad rush to find new ways of teasing apart labeled data, we often forget about everything we can do with unsupervised learning. Unsupervised machine learning can be very powerful in its own right, and clustering is by far the most common expression of this... Read more
Performance of ranged accesses into arrays: modulo, multiply-shift and masks
Suppose that you wish to access values in an array of size n, but instead of having indexes in [0,n), you have arbitrary non-negative integers. This sort of problems happens when you build a hash table or other array-backed data structure. The naive approach to this... Read more
Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that... Read more
Are Vectorized Random Number Generators Actually Useful?
Our processors benefit from “SIMD” instructions. These instructions can operate on several values at once, thus greatly accelerating some algorithms. Earlier, I reported that you can multiply the speed of common (fast) random number generators such as PCG and xorshift128+ by a factor of three or... Read more
Convert Pandas Categorical Data for SciKit-Learn
As you encounter various data elements you should come across categorical data. Some individuals simply discard this data in their analysis or do not bring it into their models. That is certainly an option, however many times the categorical data represents information that we would typically want to... Read more
Training with PyTorch on Amazon SageMaker
PyTorch is a flexible open source framework for Deep Learning experimentation. In this post, you will learn how to train PyTorch jobs on Amazon SageMaker. I’ll show you how to: build a custom Docker container for CPU and GPU training, pass parameters to a PyTorch script, save the trained model. As usual, you’ll... Read more
Roaring Bitmaps in JavaScript
Roaring bitmaps are a popular data structure to represents sets of integers. Given such sets, you can quickly compute unions, intersections, and so forth. It is a convenient tool when doing data processing. I used to joke that Roaring bitmaps had been implemented in every language (Java,... Read more
Dask Development Log
This work is supported by Anaconda Inc To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected. Current... Read more