## Attribution Based on Tail Probabilities

ModelingStatisticsposted by John Cook July 25, 2018

If all you know about a person is that he or she is around 5′ 7″, it’s a toss-up whether this person is male or female. If you know someone is over 6′ tall, they’re probably male. If you hear they are over 7″ tall, they’re... Read more

## ECDFs: “Empirical Cumulative Distribution Function”

ModelingStatisticsposted by Eric Ma July 23, 2018

In my two SciPy 2018 co-taught tutorials, I made the case that ECDFs provide richer information compared to histograms. My main points were: We can more easily identify central tendency measures, in particular, the median, compared to a histogram. We can much more easily identify other... Read more

## How Far is xy From yx on Average for Quaternions?

ModelingStatisticsposted by John Cook July 16, 2018

Given two quaternions x and y, the product xy might equal the product yx, but in general the two results are different. How different are xy and yx on average? That is, if you selected quaternions x and y at random, how big would you expect the difference xy – yx to be? Since this difference would increase proportionately if you increased the length... Read more

## Low-Rank Matrix Perturbations

ModelingStatisticsposted by John Cook July 12, 2018

Here are a couple of linear algebra identities that can be very useful, but aren’t that widely known, somewhere between common knowledge and arcane. Neither result assumes any matrix has low rank, but their most common application, at least in my experience, is in the context... Read more

## Linear Regression and Planet Spacing

ModelingStatisticsposted by John Cook July 6, 2018

Linear Regression and Planet Spacing A while back I wrote about how planets are evenly spaced on a log scale. I made a bunch of plots, based on our solar system and the extrasolar systems with the most planets, and said noted that they’re all roughly straight... Read more

## Statistical Software Matters

ModelingStatisticsposted by Thomas Lumley June 29, 2018

This is a picture of all the genetic associations found in genome-wide association studies, sorted by chromosome. You can find more detail at the NHGRI GWAS catalog There are two chromosomes with many fewer associations. One is the Y chromosome. There isn’t much there because... Read more

## Partition numbers and Ramanujan’s approximation

ModelingStatisticsposted by John Cook June 25, 2018

The partition function p(n) counts the number of ways n unlabeled things can be partitioned into non-empty sets. (Contrast with Bell numbers that count partitions of labeled things.) There’s no simple expression for p(n), but Ramanujan discovered a fairly simple asymptotic approximation: How accurate is this approximation? Here’s a little Matheamtica code to see.... Read more

## Talking About Clinical Significance

ModelingStatisticsposted by John Mount June 22, 2018

In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals). An example would... Read more

## Stirling Numbers, Including Negative Arguments

ModelingStatisticsposted by John Cook June 20, 2018

Stirling numbers are something like binomial coefficients. They come in two varieties, imaginatively called the first kind and second kind. Unfortunately it is the second kind that are simpler to describe and that come up more often in applications, so we’ll start there. Stirling numbers of... Read more

## Fixed Points of Logistic Function

ModelingStatisticsposted by John Cook June 15, 2018

Here’s an interesting problem that came out of a logistic regression application. The input variable was between 0 and 1, and someone asked when and where the logistic transformation f(x) = 1/(1 + exp(a + bx)) has a fixed point, i.e. f(x) = x. So given logistic regression parameters a and b, when does... Read more