On indexing operators and composition

On indexing operators and composition

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some […]

2017 ODSC Data Science Award: scikit-learn

2017 ODSC Data Science Award: scikit-learn

The ODSC Team was thrilled to present scikit-learn the Outstanding Data Science Project award, East, in Boston on May 5th.  Scikit-learn has been instrumental in making high quality machine learning algorithms more accessible to countless data scientists, students and practitioners.  As an active ongoing project it has made a tremendous contribution to the open source […]

Generalizing Abstract Arrays: opportunities and challenges

Generalizing Abstract Arrays: opportunities and challenges

Introduction: generic algorithms with AbstractArrays Somewhat unusually, this blog post is future-looking: it mostly focuses on things that don’t yet exist. Its purpose is to lay out the background for community discussion about possible changes to the core API for AbstractArrays, and serves as background reading and reference material for a more focused “julep” (a […]

Scraping CRAN with rvest

Scraping CRAN with rvest

I am one of the organizers for a session at userR 2017 this coming July that will focus on discovering and learning about R packages. How do R users find packages that meet their needs? Can we make this process easier? As somebody who is relatively new to the R world compared to many, this […]

How to visualize decision trees in Python

How to visualize decision trees in Python

Decision tree classifier is the most popularly used supervised learning algorithm. Unlike other classification algorithms, decision tree classifier in not a black box in the modeling phase.  What that’s means, we can visualize the trained decision tree to understand how the decision tree gonna work for the give input features. So in this article, you […]

Drawing a map of distributed data systems

Drawing a map of distributed data systems

How we created an illustrated guide to help you find your way through the data landscape. Designing Data-Intensive Applications, the book I’ve been working on for four years, is finally finished, and should be available in your favorite bookstore in the next week or two. An incomplete beta (Early Release) edition has been available for […]

More notebooks for Think Stats

More notebooks for Think Stats

More notebooks for Think Stats As I mentioned in the previous post, I am getting ready to teach Data Science in the spring, so I am going back through Think Stats and updating the Jupyter notebooks.  I am done with Chapters 1 through 6 now. If you are reading the book, you can get the notebooks by cloning this […]

Do Resampling Estimates Have Low Correlation to the Truth?

Do Resampling Estimates Have Low Correlation to the Truth?

The Answer May Shock You. One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV results and the true error rate. Let’s look at this with some simulated data. While this assertion is often correct, there are a few […]

The Complexities of Governing Machine Learning

The Complexities of Governing Machine Learning

Today’s businesses run on data. It’s essential for any corporation to look for insights about their customers based on the data they collect. That collected information drives everything from business strategy to customer service. In order to retrieve insights from the massive amounts of data they collect, companies are turning to machine learning, and for […]