pyjanitor 0.3 Released!
A new release of pyjanitor is out! Two new features that I have added in include: Concatenating column names into a single column, such that each item is separated by a delimiter. Deconcatenating a column into multiple columns, separating on the basis of a delimiter. Both of these tasks come up... Read more
Detecting Outliers
In this context, outliers are data observations that are distant from other observations. There are a number of reasons why variability may exist in the data that you are working on during your analysis. Outliers may cause serious problems in your efforts as a Data Scientist. title author date Detecting Outliers... Read more
Pickle Isn’t Slow, It’s a Protocol
This work is supported by Anaconda Inc tl;dr: Pickle isn’t slow, it’s a protocol. Protocols are important for ecosystems. A recent Dask issue showed that using Dask with PyTorch was slow because sending PyTorch models between Dask workers took a long time (Dask GitHub issue). This turned out to be because... Read more
Dask Release 0.18.0
This work is supported by Anaconda Inc. I’m pleased to announce the release of Dask version 0.18.0. This is a major release with breaking changes and new features. The last release was 0.17.5 on May 4th. This blogpost outlines notable changes since the last release blogpost for 0.17.2 on March... Read more
Category Encoders V1.2.8 Release
Been a while since a release, but category encoders has continued to advance with the help of lots of great contributors. I’ve just released v1.2.8, with primarily bugfixes, as well as some new features allowing a user to optionally add the category names in the output column names of... Read more
Beyond Numpy Arrays in Python: Preparing the ecosystem for GPU, distributed, and sparse arrays
Executive Summary In recent years Python’s array computing ecosystem has grown organically to support GPUs, sparse, and distributed arrays. This is wonderful and a great example of the growth that can occur in decentralized open source development. However to solidify this growth and apply it across the ecosystem we... Read more
Intelligently Assisted Form Fields with Henosis
Filling Out Forms Isn’t Fun Online forms are the worst. The often-long, sometimes multi-page forms can be a time-consuming and laborious process to fill out. Almost any other task is more enjoyable, even with the occasional prize drawing or other form of incentive. While large forms can and often do... Read more
Predicting code bug risk with git metadata
One of the perks of working at Civis is the quarterly ‘Hack Time’. For one week each quarter, you get to explore an offbeat idea of your choice and then present the results to your colleagues. This past quarter I spent my time exploring some off-label uses for the... Read more
Dask Release 0.17.2
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation. I’m pleased to announce the release of Dask version 0.17.2. This is a minor release with new features and stability improvements. This blogpost outlines notable changes since the 0.17.0 release on February 12th. You can... Read more
Not all data analysis tools are created equal. Recently, I started looking into data sets to compete in Go Code Colorado (check it out if you live in CO). The problem with such diversity in data sets is finding a way to quickly visualize the data and do exploratory analysis. While... Read more