Alexandru Agachi of Empiric Capital on “Handling Missing Data in Python/Pandas” at ODSC Europe 2018
Key Takeaways: It’s important to describe missing data and the challenges it poses. You need to clarify a confusing terminology that further adds to the field’s complexity. You should take the time to review methods for handling missing data. You need to learn how to apply robust multiple imputation... Read more
Exploring Scikit-Learn Further: The Bells and Whistles of Preprocessing
In my previous post, we constructed a simple cross-validated regression model using Scikit-Learn in 35 lines. It’s pretty amazing that we can perform machine learning with so little effort, but we just did the bare minimum in order to get a working model. Frankly, it didn’t even perform that well.... Read more
The Beginner’s Guide to Scikit-Learn
Scikit-Learn is one of the premier tools in the machine learning community, used by academics and industry professionals alike. At ODSC West, Scikit-Learn author Andreas Mueller will host a training session to give beginners a crash course.  As one of the primary contributors to Scikit-Learn, Mueller is one of... Read more
All the Best Parts of Pandas for Data Science
Pandas has been hailed by many in the data science community as the missing link between Python and analysis, a tool that can be leveraged in order to dramatically reduce overhead in data science projects, increase understandability and speed up workflows.   Pandas comes loaded with a wide range... Read more
TensorLayer for Developing Complex Deep Learning Systems
This article describes TensorLayer, a modular Python wrapper library for TensorFlow allowing data scientists to streamline the development of complex deep learning systems. TensorLayer was released in September 2016 with a GitHub repo. A descriptive research paper followed in August 2017: TensorLayer: A Versatile Library for Efficient Deep Learning... Read more
Snakes in a Package: Combining Python and R with Reticulate
When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating; my ability to do data analysis increased substantially. As my applications grew in size and complexity, I started to... Read more
Machine Learning with H2O – Part 1
Big datasets pose computation problems for software such as R and python in addition to implementing basic machine learning algorithms that can seem like it would run forever. Most of the time it is difficult to even determine how much time it would take to run these algorithms. Enter H20,... Read more
Building SAGA optimization for Dask Arrays
This work is supported by ETH Zurich, Anaconda Inc, and the Berkeley Institute for Data Science At a recent Scikit-learn/Scikit-image/Dask sprint at BIDS, Fabian Pedregosa (a machine learning researcher and Scikit-learn developer) and Matthew Rocklin (Dask core developer) sat down together to develop an implementation of the incremental optimization algorithm SAGA on parallel Dask datasets. The... Read more
pyjanitor 0.3 Released!
A new release of pyjanitor is out! Two new features that I have added in include: Concatenating column names into a single column, such that each item is separated by a delimiter. Deconcatenating a column into multiple columns, separating on the basis of a delimiter. Both of these tasks come up... Read more
Detecting Outliers
In this context, outliers are data observations that are distant from other observations. There are a number of reasons why variability may exist in the data that you are working on during your analysis. Outliers may cause serious problems in your efforts as a Data Scientist. title author date Detecting Outliers... Read more