Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions... Read more
Quoting and Macros in R
Miles McBain has a nice post about quoting in R and the tidyeval procedure. In it, there’s this footnote In truth there are other types of calls, and the ones Lisp nuts really bang on about are macro calls In this post I want to talk about the similarities... Read more
pyjanitor 0.3 Released!
A new release of pyjanitor is out! Two new features that I have added in include: Concatenating column names into a single column, such that each item is separated by a delimiter. Deconcatenating a column into multiple columns, separating on the basis of a delimiter. Both of these tasks come up... Read more
Testing Probability Distribution Generators
In the ‘regression tests’ that are part of any change to the base-R source code, there’s a file called p-r-random-tests.R. People notice it from time to time because the tests sometimes fail. That’s what is supposed to happen. Testing random number generators is hard, because it’s hard to specify what... Read more
Exploratory Data Analysis in R
Hi there! tl;dr: Exploratory data analysis (EDA) the very first step in a data project. We will create a code-template to achieve this with one function. Introduction EDA consists of univariate (1-variable) and bivariate (2-variables) analysis. In this post we will review some functions that lead us to the... Read more
Detecting Outliers
In this context, outliers are data observations that are distant from other observations. There are a number of reasons why variability may exist in the data that you are working on during your analysis. Outliers may cause serious problems in your efforts as a Data Scientist. title author date Detecting Outliers... Read more
Are Vectorized Random Number Generators Actually Useful?
Our processors benefit from “SIMD” instructions. These instructions can operate on several values at once, thus greatly accelerating some algorithms. Earlier, I reported that you can multiply the speed of common (fast) random number generators such as PCG and xorshift128+ by a factor of three or four by vectorizing... Read more
Pickle Isn’t Slow, It’s a Protocol
This work is supported by Anaconda Inc tl;dr: Pickle isn’t slow, it’s a protocol. Protocols are important for ecosystems. A recent Dask issue showed that using Dask with PyTorch was slow because sending PyTorch models between Dask workers took a long time (Dask GitHub issue). This turned out to be because... Read more
Convert Pandas Categorical Data for SciKit-Learn
As you encounter various data elements you should come across categorical data. Some individuals simply discard this data in their analysis or do not bring it into their models. That is certainly an option, however many times the categorical data represents information that we would typically want to bring in to... Read more
Training with PyTorch on Amazon SageMaker
PyTorch is a flexible open source framework for Deep Learning experimentation. In this post, you will learn how to train PyTorch jobs on Amazon SageMaker. I’ll show you how to: build a custom Docker container for CPU and GPU training, pass parameters to a PyTorch script, save the trained model. As usual, you’ll find my code... Read more
Open Data Science - Your News Source for AI, Machine Learning & more