Chi Square Goodness of Fit Test

ModelingStatisticsposted by ODSC Community January 30, 2024

We have recently explored and derived the Chi-Square Distribution which you can check out here. I highly recommend reading that post if...

We have recently explored and derived the Chi-Square Distribution which you can check out here.

I highly recommend reading that post if you are unfamiliar with the Chi-Square Distribution, otherwise, this article won’t make a whole lot of sense to you!

Nevertheless, in this post we will have a discussion about one of the Chi-Square Tests, the goodness of fit test.

This test is used to verify if our distribution of sample data is in line with some expected distribution of that data. In other-words, it determines whether the difference between the sample and expected distribution is by random chance or if it is statistically significant.

In this article, we will dive through the maths behind the goodness of fit test and walk through an example problem to gain our intuition!

In-Person and Virtual Conference

April 23rd to 25th, 2024

Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.

Assumptions of the Test

• There is ONE CATEGORICAL variable
• Observations are INDEPENDENT
• The FREQUENCY COUNTS of each category in each variable should be GREATER THAN 5
• The FREQUENCY COUNTS in each group of the data must be MUTUALLY EXCLUSIVE
• The data is sampled RANDOMLY

Chi-Square Test Statistic

As with every hypothesis test, we have some test statistic that we need to find. For the Chi-Square Test it is:

Equation generated by author in LaTeX.

• v is the degrees of freedom
• O are the observed values from the sample
• are the expected values of the population
• n is the number of categories in the variable

This formula will make much more sense when we go through an example.

Note the Chi-Square distribution comes from the squaring of the numerator

Hypothesis Testing Steps

• Define the null, H_0, and alternate, H_1, hypotheses.
• Decide your significance level, which is the probability threshold for failing to reject or rejecting the null hypothesis. A value of 5% is typically chosen which will correspond to a certain critical value which is distribution dependent.
• Compute the test statistic, in our case it will be the Chi-Square statistic that is presented above.
• Compare the test statistic value to the critical value. If it is larger, then we reject the null hypothesis, otherwise we fail to reject the null hypothesis (this is for a right-tailed test).

To gain a more in-depth understanding of hypothesis testing and critical values, I would suggest reading my posts on Confidence Intervals and the Z-Tests which break the above steps down even further.

There are also many YouTube videos and websites that also do a great run-through of hypothesis testing steps.

Worked Example

Let’s run through a very simple example.

A sweetshop claims that each chocolate ball bag contains 70% milk chocolate balls and 30% white chocolate balls.

We pick one chocolate ball bag which contains 50 balls. In this bag, 30 are milk chocolate and 20 are white chocolate. So, this is a 60% milk chocolate and 40% white chocolate split.

Does this observation fit in with the sweetshop’s claim?

Hypotheses

• H_0 : The sweetshop’s claim is correct for a 70–30% milk to white chocolate split in each chocolate ball bag
• H_1 : The sweetshop’s claim is incorrect for a 70–30% milk to white chocolate split in each chocolate ball bag

We will use a 5% significance level for this example.

Contingency Table

We calculate the observed and expected chocolate types from our small sample and display it in contingency table:

Image generated by author.

Test Statistic

Now we compute the Chi-Square test statistic using the formula we displayed above:

Equation generated by author in LaTeX.

The degrees of freedom, df, formula is:

Equation generated by author in LaTeX.

Where n is the number of categories in our variable like we stated before. Therefore, in our case it is simply df = 1.

Critical Value

Using the Chi-Square Table, the critical value corresponding for a 5% significance level with 1 degree of freedom is 3.84.

Therefore, as our statistic is lower than the critical value, we fail to reject the null hypothesis.

Note: The Chi-Square Test is always typically a one-tailed test. The scope of this proof is out of this article but there is a great answer on StackExchange that explains why this is the case.

Conclusion

In this article we have walked through how to carry out a Chi-Square goodness of fit test. This test determines if your sample distribution of a single categorical variable is inline with the expected distribution.

In my next post we will discuss the other Chi-Square Test, the test for independence. This is frequently used for Feature Selection in Data Science!

In-Person and Virtual Conference

April 23rd to 25th, 2024

Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist, and the latest AI news to keep you in the loop. There is no “fluff” or “clickbait”, just pure actionable insights from a practicing Data Scientist.

Article originally posted here by Egor Howell. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1