Automating Data Wrangling – The Next Machine Learning Frontier
Editor’s note: Be sure to check out Alex’s talk at ODSC West 2019 this November, “The Last Frontier of Machine Learning – Data Wrangling.” Up to 95% of a data scientist’s time is spent data wrangling. Conversely, about 99% of data-scientists hate data wrangling. That’s problematic.... Read more
The Importance of PreProcessing Data the Right Way
There are so many different aspects of training a neural network that affect its performance. Many data scientists spend too much time thinking about learning rates, neuron structures, and epochs before actually using correctly optimized data. Without properly formatting data, your neural network will be useless,... Read more
Top Data Wrangling Skills Required for Data Scientists
Whatever you want to call it – data wrangling, data munging, or data transformation, the part of the Data Science Process sitting in between data acquisition and exploratory data analysis (EDA) is one of the core skills a data scientist must have. It includes a set... Read more
No Need for Deciphering – Learn How to Make Your Own Dataset Instead
Key Takeaways: By creating, capturing, and curating data, one can practice “data creationism” and be creative with data to make your own dataset. While Iris and Titanic are well-known datasets available to experiment with machine learning and data science, challenge yourself to create your own dataset.... Read more
Bayesian Estimation, Group Comparison, and Workflow
Over the past year, having learned about Bayesian inference methods, I finally see how estimation, group comparison, and model checking build upon each other into this really elegant framework for data analysis. Parameter Estimation The foundation of this is “estimating a parameter”. In a typical situation,... Read more
How Well Did Data Scientists Predict the 2018 World Cup? (Hint: Not Very)
This year’s World Cup in Russia was the most watched sporting event in history. GlobalWebIndex reports that up to 3.4 billion people – around half of the world’s population – watched some part of the tournament. As with past World Cups, a global prediction market emerged... Read more
The Best Mario Kart Character According to Data Science
Mario Kart was a staple of my childhood — my friends and I would spend hours after school as Mario, Luigi, and other characters from the Nintendo universe racing around cartoonish tracks and lobbing pixelated bananas at each other. One thing that always vexed our little group of... Read more
Starting a Data Science Project
I spoke in a Webinar this past Saturday about how to get into Data Science. One of the questions asked was “What does a typical day look like?”  I think there is a big opportunity to explain what really happens before any machine learning takes place... Read more
Predicting the Truncated xorshift32* Random Number Generator
Software programmers need random number generators. For this purpose, they often use functions with outputs that appear random. Gerstmann has a nice post about Better C++ Pseudo Random Number Generator. He investigates the following generator: uint32_t xorshift(uint64_t *m_seed) { uint64_t result = *m_seed * 0xd989bcacc137dcd5ull; *m_seed ^=... Read more
How Quickly Can You Compute the Dot Product Between Two Large Vectors?
A dot (or scalar) product is a fairly simple operation that simply sums the many products: float sum = 0; for (size_t i = 0; i < len; i++) { sum += x1 * x2; } return sum; It is nevertheless tremendously important. You know these... Read more