All the Best Parts of Pandas for Data Science
Pandas has been hailed by many in the data science community as the missing link between Python and analysis, a tool that can be leveraged in order to dramatically reduce overhead in data science projects, increase understandability and speed up workflows. Pandas comes loaded with a... Read more
TensorLayer for Developing Complex Deep Learning Systems
This article describes TensorLayer, a modular Python wrapper library for TensorFlow allowing data scientists to streamline the development of complex deep learning systems. TensorLayer was released in September 2016 with a GitHub repo. A descriptive research paper followed in August 2017: TensorLayer: A Versatile Library for... Read more
Snakes in a Package: Combining Python and R with Reticulate
When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating; my ability to do data analysis increased substantially. As my applications grew in size and complexity,... Read more
Machine Learning with H2O
Big datasets pose computation problems for software such as R and python in addition to implementing basic machine learning algorithms that can seem like it would run forever. Most of the time it is difficult to even determine how much time it would take to run... Read more
Building SAGA optimization for Dask Arrays
This work is supported by ETH Zurich, Anaconda Inc, and the Berkeley Institute for Data Science At a recent Scikit-learn/Scikit-image/Dask sprint at BIDS, Fabian Pedregosa (a machine learning researcher and Scikit-learn developer) and Matthew Rocklin (Dask core developer) sat down together to develop an implementation of the incremental optimization algorithm SAGA on parallel... Read more
pyjanitor 0.3 Released!
A new release of pyjanitor is out! Two new features that I have added in include: Concatenating column names into a single column, such that each item is separated by a delimiter. Deconcatenating a column into multiple columns, separating on the basis of a delimiter. Both of these... Read more
Detecting Outliers
In this context, outliers are data observations that are distant from other observations. There are a number of reasons why variability may exist in the data that you are working on during your analysis. Outliers may cause serious problems in your efforts as a Data Scientist. title author... Read more
Pickle Isn’t Slow, It’s a Protocol
This work is supported by Anaconda Inc tl;dr: Pickle isn’t slow, it’s a protocol. Protocols are important for ecosystems. A recent Dask issue showed that using Dask with PyTorch was slow because sending PyTorch models between Dask workers took a long time (Dask GitHub issue). This turned out... Read more
Dask Release 0.18.0
This work is supported by Anaconda Inc. I’m pleased to announce the release of Dask version 0.18.0. This is a major release with breaking changes and new features. The last release was 0.17.5 on May 4th. This blogpost outlines notable changes since the last release blogpost for... Read more
Category Encoders V1.2.8 Release
Been a while since a release, but category encoders has continued to advance with the help of lots of great contributors. I’ve just released v1.2.8, with primarily bugfixes, as well as some new features allowing a user to optionally add the category names in the output... Read more