How to Play Fantasy Sports Strategically (and Win)
Daily Fantasy Sports is a multibillion-dollar industry with millions of annual users. The Imperial College Business School’s Martin Haugh created a framework to best those users by modeling what they’ll do and constructing a team based on it. Haugh presented his research on how to play Fantasy sports strategically... Read more
Thomas Wiecki of Quantopian on ‘Minding the Gap’ Between Statistics and Machine Learning at ODSC Europe 2018
Key Takeaways: It’s important for data scientists to understand the so-called “gap” between statistics and machine learning, and how there actually is a lot of commonality between the two; it’s just a matter of how you look at things. PyMC3 is a very useful probabilistic programming framework for Python.... Read more
Exploring the Central Limit Theorem in R
The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It’s certainly a concept that every data scientist should fully understand. In this article, we’ll go over some basic theory of the CLT, explain why it’s important for data scientists, and present some R code that... Read more
Mine Like Amazon with Market Basket Analysis
Pattern mining is an incredibly simple but powerful technique for discovering cooccurrences in large datasets. The most common approach to find those patterns is Market Basket Analysis, which is frequently pointed out as the method Amazon leverages for their “users also purchased” feature. Of course, that’s a dramatic oversimplification.... Read more
Building a Microservice for Twitter Real-Time Data Collection and Sentiment Analysis.
First of all, I would like to point out that the skill of building MVP and microservices for a data scientist is extremely useful! When you can build a prototype and test it in a working environment it just feels so much better and allows you to better understand... Read more
Intro to Ontologies
Statistical methods are inarguably the hottest approach to evaluating datasets at scale right now. They’re not without their weaknesses though – they’re ultimately heuristic, and some methods like neural networks require tremendous amounts of data to create a well-fitted model. That’s where semantics come in. If statistical methods attempt... Read more
Tips for Linear Regression Diagnostics
I like to call linear regression the data scientist’s “workhorse.” It may not be sexy, but it’s a tried and proven technique that can be very useful. When the problem you’re trying to solve requires the prediction of a numeric response variable using multiple continuous (numeric) and/or categorical predictors,... Read more
Joint, Conditional, and Marginal Probability Distributions
Joint probability, conditional probability, and marginal probability… These are three central terms when learning about probability, and they show up in Bayesian statistics as well. However… I never really could remember what they were, especially since we were usually taught them using formulas, rather than pictures. Well, for those... Read more
The Cold Start Problem
How do you operate a data-driven application before you have any data? This is known as the cold start problem. We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials... Read more
Distribution of Eigenvalues for Symmetric Gaussian Matrix
Symmetric Gaussian matrices The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let Mbe the average of A and its transpose, i.e. M = ½(A + AT).  The eigenvalues... Read more