fbpx
Snakes in a Package: Combining Python and R with Reticulate
When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating; my ability to do data analysis increased substantially. As my applications grew in size and complexity,... Read more
Machine Learning with H2O
Big datasets pose computation problems for software such as R and python in addition to implementing basic machine learning algorithms that can seem like it would run forever. Most of the time it is difficult to even determine how much time it would take to run... Read more
Building SAGA optimization for Dask Arrays
This work is supported by ETH Zurich, Anaconda Inc, and the Berkeley Institute for Data Science At a recent Scikit-learn/Scikit-image/Dask sprint at BIDS, Fabian Pedregosa (a machine learning researcher and Scikit-learn developer) and Matthew Rocklin (Dask core developer) sat down together to develop an implementation of the incremental optimization algorithm SAGA on parallel... Read more
pyjanitor 0.3 Released!
A new release of pyjanitor is out! Two new features that I have added in include: Concatenating column names into a single column, such that each item is separated by a delimiter. Deconcatenating a column into multiple columns, separating on the basis of a delimiter. Both of these... Read more
Detecting Outliers
In this context, outliers are data observations that are distant from other observations. There are a number of reasons why variability may exist in the data that you are working on during your analysis. Outliers may cause serious problems in your efforts as a Data Scientist. title author... Read more
Pickle Isn’t Slow, It’s a Protocol
This work is supported by Anaconda Inc tl;dr: Pickle isn’t slow, it’s a protocol. Protocols are important for ecosystems. A recent Dask issue showed that using Dask with PyTorch was slow because sending PyTorch models between Dask workers took a long time (Dask GitHub issue). This turned out... Read more
Dask Release 0.18.0
This work is supported by Anaconda Inc. I’m pleased to announce the release of Dask version 0.18.0. This is a major release with breaking changes and new features. The last release was 0.17.5 on May 4th. This blogpost outlines notable changes since the last release blogpost for... Read more
Category Encoders V1.2.8 Release
Been a while since a release, but category encoders has continued to advance with the help of lots of great contributors. I’ve just released v1.2.8, with primarily bugfixes, as well as some new features allowing a user to optionally add the category names in the output... Read more
Beyond Numpy Arrays in Python: Preparing the ecosystem for GPU, distributed, and sparse arrays
Executive Summary In recent years Python’s array computing ecosystem has grown organically to support GPUs, sparse, and distributed arrays. This is wonderful and a great example of the growth that can occur in decentralized open source development. However to solidify this growth and apply it across... Read more
Intelligently Assisted Form Fields with Henosis
Filling Out Forms Isn’t Fun Online forms are the worst. The often-long, sometimes multi-page forms can be a time-consuming and laborious process to fill out. Almost any other task is more enjoyable, even with the occasional prize drawing or other form of incentive. While large forms can... Read more