How to visualize decision trees in Python
Decision tree classifier is the most popularly used supervised learning algorithm. Unlike other classification algorithms, decision tree classifier in not a black box in the modeling phase.  What that’s means, we can visualize the trained decision tree to understand how the decision tree gonna work for the give input... Read more
Generalizing Abstract Arrays: opportunities and challenges
Introduction: generic algorithms with AbstractArrays Somewhat unusually, this blog post is future-looking: it mostly focuses on things that don’t yet exist. Its purpose is to lay out the background for community discussion about possible changes to the core API for AbstractArrays, and serves as background reading and reference material... Read more
Drawing a map of distributed data systems
How we created an illustrated guide to help you find your way through the data landscape. Designing Data-Intensive Applications, the book I’ve been working on for four years, is finally finished, and should be available in your favorite bookstore in the next week or two. An incomplete beta (Early... Read more
More notebooks for Think Stats
More notebooks for Think Stats As I mentioned in the previous post, I am getting ready to teach Data Science in the spring, so I am going back through Think Stats and updating the Jupyter notebooks.  I am done with Chapters 1 through 6 now. If you are reading the book, you... Read more
The Complexities of Governing Machine Learning
Today’s businesses run on data. It’s essential for any corporation to look for insights about their customers based on the data they collect. That collected information drives everything from business strategy to customer service. In order to retrieve insights from the massive amounts of data they collect, companies are... Read more
Do Resampling Estimates Have Low Correlation to the Truth?
The Answer May Shock You. One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV results and the true error rate. Let’s look at this with some simulated data. While this assertion is... Read more
Handwritten digits recognition using Tensorflow with Python
The progress in technology that has happened over the last 10 years is unbelievable. Every corner of the world is using the top most technologies to improve existing products while also conducting immense research into inventing products that make the world the best place to live. Some of these... Read more
You Must Allow Me To Tell You How Ardently I Admire and Love Natural Language Processing
It is a truth universally acknowledged that sentiment analysis is super fun, and Pride and Prejudice is probably my very favorite book in all of literature, so let’s do some Jane Austen natural language processing. Project Gutenberg makes e-texts available for many, many books, including Pride and Prejudice which... Read more
The future of Machine Learning lies in its (human) past
Superficially different in goals and approach, two recent algorithmic advances, Bayesian Program Learning and Galileo, are examples of one of the most interesting and powerful new trends in data analysis. It also happens to be the oldest one. Bayesian Program Learning (BPL) is deservedly one of the most discussed... Read more
Thomas originally posted this article here at http://twiecki.github.io  Hierarchical models are underappreciated. Hierarchies exist in many data sets and modeling them appropriately adds a boat load of statistical power (the common metric of statistical power). I provided an introduction to hierarchical models in a previous blog post: Best Of Both Worlds:... Read more