## Mine Like Amazon with Market Basket Analysis

ModelingStatisticsmarket basket analysisStatisticsposted by Spencer Norris, ODSC October 12, 2018

Pattern mining is an incredibly simple but powerful technique for discovering cooccurrences in large datasets. The most common approach to find those patterns is Market Basket Analysis, which is frequently pointed out as the method Amazon leverages for their “users also purchased” feature. Of course, that’s a dramatic oversimplification.... Read more

## Building a Microservice for Twitter Real-Time Data Collection and Sentiment Analysis.

ModelingStatisticsposted by Alexander Osipenko September 21, 2018

First of all, I would like to point out that the skill of building MVP and microservices for a data scientist is extremely useful! When you can build a prototype and test it in a working environment it just feels so much better and allows you to better understand... Read more

## Intro to Ontologies

ModelingStatisticsposted by Spencer Norris, ODSC September 10, 2018

Statistical methods are inarguably the hottest approach to evaluating datasets at scale right now. They’re not without their weaknesses though – they’re ultimately heuristic, and some methods like neural networks require tremendous amounts of data to create a well-fitted model. That’s where semantics come in. If statistical methods attempt... Read more

## Tips for Linear Regression Diagnostics

ModelingStatisticsposted by Daniel Gutierrez, ODSC August 29, 2018

I like to call linear regression the data scientist’s “workhorse.” It may not be sexy, but it’s a tried and proven technique that can be very useful. When the problem you’re trying to solve requires the prediction of a numeric response variable using multiple continuous (numeric) and/or categorical predictors,... Read more

## Joint, Conditional, and Marginal Probability Distributions

ModelingStatisticsposted by Eric Ma August 15, 2018

Joint probability, conditional probability, and marginal probability… These are three central terms when learning about probability, and they show up in Bayesian statistics as well. However… I never really could remember what they were, especially since we were usually taught them using formulas, rather than pictures. Well, for those... Read more

## The Cold Start Problem

ModelingStatisticsposted by John Cook August 10, 2018

How do you operate a data-driven application before you have any data? This is known as the cold start problem. We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials... Read more

## Distribution of Eigenvalues for Symmetric Gaussian Matrix

ModelingStatisticsposted by John Cook August 7, 2018

Symmetric Gaussian matrices The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let Mbe the average of A and its transpose, i.e. M = ½(A + AT). The eigenvalues... Read more

## How Well Did Data Scientists Predict the 2018 World Cup? (Hint: Not Very)

Data WranglingModelingPredictive AnalyticsResearchStatisticspopular cultureworld cupposted by Alex Amari July 26, 2018

This year’s World Cup in Russia was the most watched sporting event in history. GlobalWebIndex reports that up to 3.4 billion people – around half of the world’s population – watched some part of the tournament. As with past World Cups, a global prediction market emerged allowing spectators to... Read more

## Attribution Based on Tail Probabilities

ModelingStatisticsposted by John Cook July 25, 2018

If all you know about a person is that he or she is around 5′ 7″, it’s a toss-up whether this person is male or female. If you know someone is over 6′ tall, they’re probably male. If you hear they are over 7″ tall, they’re almost certainly male.... Read more

## ECDFs: “Empirical Cumulative Distribution Function”

ModelingStatisticsposted by Eric Ma July 23, 2018

In my two SciPy 2018 co-taught tutorials, I made the case that ECDFs provide richer information compared to histograms. My main points were: We can more easily identify central tendency measures, in particular, the median, compared to a histogram. We can much more easily identify other percentile values, compared... Read more