Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions... Read more
Are Vectorized Random Number Generators Actually Useful?
Our processors benefit from “SIMD” instructions. These instructions can operate on several values at once, thus greatly accelerating some algorithms. Earlier, I reported that you can multiply the speed of common (fast) random number generators such as PCG and xorshift128+ by a factor of three or four by vectorizing... Read more
Convert Pandas Categorical Data for SciKit-Learn
As you encounter various data elements you should come across categorical data. Some individuals simply discard this data in their analysis or do not bring it into their models. That is certainly an option, however many times the categorical data represents information that we would typically want to bring in to... Read more
Training with PyTorch on Amazon SageMaker
PyTorch is a flexible open source framework for Deep Learning experimentation. In this post, you will learn how to train PyTorch jobs on Amazon SageMaker. I’ll show you how to: build a custom Docker container for CPU and GPU training, pass parameters to a PyTorch script, save the trained model. As usual, you’ll find my code... Read more
Roaring Bitmaps in JavaScript
Roaring bitmaps are a popular data structure to represents sets of integers. Given such sets, you can quickly compute unions, intersections, and so forth. It is a convenient tool when doing data processing. I used to joke that Roaring bitmaps had been implemented in every language (Java, C, Rust, Go,... Read more
Dask Development Log
This work is supported by Anaconda Inc To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected. Current efforts for June... Read more
Git Tip: Apply a Patch
I learned a new thing this weekend: we apparently can apply a patch onto a branch/fork using git apply . There’s a few things to unpack here. First off, what’s a patchfile? The long story cut short is that a patchfile is nothing more than a plain text file that contains all information... Read more
Survey Analysis in SQL and R
Charco Hui, as his Honours project in Statistics, has been writing a package for complex-survey analysis using dplyr and dbplyr. It’s here. At the moment it has only been tested with MonetDB, using the github version (0.5.2) of MonetDBlite, but it should work with many other databases (not SQLite, at the moment). I hope... Read more
Dask Scaling Limits
This work is supported by Anaconda Inc. History For the first year of Dask’s life it focused exclusively on single node parallelism. We felt then that efficiently supporting 100+GB datasets on personal laptops or 1TB datasets on large workstations was a sweet spot for productivity, especially when avoiding the pain... Read more
IOTA – The Potential to Drive Data Science for IoT
I have a close circle of clued-on/tech savvy friends whose views I take seriously. For the last few weeks, one of these friends has been sending me emails extolling the merits of something called IOTA – which calls itself as the next generation Blockchain.  At first, I thought of IOTA... Read more
Open Data Science - Your News Source for AI, Machine Learning & more