Joint, Conditional, and Marginal Probability Distributions
Joint probability, conditional probability, and marginal probability… These are three central terms when learning about probability, and they show up in Bayesian statistics as well. However… I never really could remember what they were, especially since we were usually taught them using formulas, rather than pictures. Well, for those... Read more
The Cold Start Problem
How do you operate a data-driven application before you have any data? This is known as the cold start problem. We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials... Read more
Distribution of Eigenvalues for Symmetric Gaussian Matrix
Symmetric Gaussian matrices The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let Mbe the average of A and its transpose, i.e. M = ½(A + AT).  The eigenvalues... Read more
How Well Did Data Scientists Predict the 2018 World Cup? (Hint: Not Very)
This year’s World Cup in Russia was the most watched sporting event in history. GlobalWebIndex reports that up to 3.4 billion people – around half of the world’s population – watched some part of the tournament. As with past World Cups, a global prediction market emerged allowing spectators to... Read more
Attribution Based on Tail Probabilities
If all you know about a person is that he or she is around 5′ 7″, it’s a toss-up whether this person is male or female. If you know someone is over 6′ tall, they’re probably male. If you hear they are over 7″ tall, they’re almost certainly male.... Read more
ECDFs: “Empirical Cumulative Distribution Function”
In my two SciPy 2018 co-taught tutorials, I made the case that ECDFs provide richer information compared to histograms. My main points were: We can more easily identify central tendency measures, in particular, the median, compared to a histogram. We can much more easily identify other percentile values, compared... Read more
How Far is xy From yx on Average for Quaternions?
Given two quaternions x and y, the product xy might equal the product yx, but in general the two results are different. How different are xy and yx on average? That is, if you selected quaternions x and y at random, how big would you expect the difference xy – yx to be? Since this difference would increase proportionately if you increased the length of x or y, we can just... Read more
Low-Rank Matrix Perturbations
Here are a couple of linear algebra identities that can be very useful, but aren’t that widely known, somewhere between common knowledge and arcane. Neither result assumes any matrix has low rank, but their most common application, at least in my experience, is in the context of something of... Read more
Linear Regression and Planet Spacing
A while back I wrote about how planets are evenly spaced on a log scale. I made a bunch of plots, based on our solar system and the extrasolar systems with the most planets, and said noted that they’re all roughly straight lines. Here’s the plot for our solar system,... Read more
Statistical Software Matters
This is a picture of all the genetic associations found in genome-wide association studies, sorted by chromosome. You can find more detail at the NHGRI GWAS catalog     There are two chromosomes with many fewer associations. One is the Y chromosome. There isn’t much there because there isn’t much... Read more
Open Data Science - Your News Source for AI, Machine Learning & more