# Chi Square Goodness of Fit Test

ModelingStatisticsposted by ODSC Community January 30, 2024 ODSC Community

We have recently explored and derived the **Chi-Square Distribution** which you can check out here.

I highly recommend reading that post if you are unfamiliar with the Chi-Square Distribution, otherwise, this article won’t make a whole lot of sense to you!

Nevertheless, in this post we will have a discussion about one of the Chi-Square Tests, the goodness of fit test.

This test is used to **verify if our distribution of sample data is in line with some expected distribution of that data**. In other-words, it determines whether the difference between the sample and expected distribution is by random chance or if it is statistically significant.

In this article, we will dive through the maths behind the goodness of fit test and walk through an example problem to gain our intuition!

# Assumptions of the Test

- There is
**ONE CATEGORICAL**variable - Observations are
**INDEPENDENT** - The
**FREQUENCY COUNTS**of each category in each variable should be**GREATER THAN 5** - The
**FREQUENCY COUNTS**in each group of the data must be**MUTUALLY EXCLUSIVE** - The data is sampled
**RANDOMLY**

# Chi-Square Test Statistic

As with every hypothesis test, we have some** test statistic** that we need to find. For the Chi-Square Test it is:

Equation generated by author in LaTeX.

is the degrees of freedom*v*are the observed values from the sample*O*are the expected values of the population*E*is the number of categories in the variable*n*

This formula will make much more sense when we go through an example.

Note the Chi-Square distribution comes from the squaring of the numerator

# Hypothesis Testing Steps

- Define the null,
and alternate,*H_0,*, hypotheses.*H_1* - Decide your
**significance level**, which is the**probability threshold for failing to reject or rejecting the null hypothesis**. A value of**5%**is typically chosen which will correspond to a certain**critical value**which is distribution dependent. - Compute the
**test statistic**, in our case it will be the Chi-Square statistic that is presented above. - Compare the test statistic value to the critical value.
**If it is larger, then we reject the null hypothesis, otherwise we fail to reject the null hypothesis**(this is for a right-tailed test).

To gain a more in-depth understanding of hypothesis testing and critical values, I would suggest reading my posts on Confidence Intervals and the Z-Tests which break the above steps down even further.

There are also many YouTube videos and websites that also do a great run-through of hypothesis testing steps.

# Worked Example

Let’s run through a very simple example.

A sweetshop claims that each chocolate ball bag contains 70% milk chocolate balls and 30% white chocolate balls.

We pick one chocolate ball bag which contains 50 balls. In this bag, 30 are milk chocolate and 20 are white chocolate. So, this is a 60% milk chocolate and 40% white chocolate split.

Does this observation fit in with the sweetshop’s claim?

## Hypotheses

:*H_0**The sweetshop’s claim is**correct**for a 70–30% milk to white chocolate split in each chocolate ball bag*:*H_1**The sweetshop’s claim is**incorrect**for a 70–30% milk to white chocolate split in each chocolate ball bag*

We will use a **5% significance level** for this example.

## Contingency Table

We calculate the observed and expected chocolate types from our small sample and display it in contingency table:

Image generated by author.

## Test Statistic

Now we compute the Chi-Square test statistic using the formula we displayed above:

Equation generated by author in LaTeX.

The degrees of freedom, ** df**, formula is:

Equation generated by author in LaTeX.

Where ** n** is the number of categories in our variable like we stated before. Therefore, in our case it is simply

*df = 1.*## Critical Value

Using the Chi-Square Table, the **critical value** corresponding for a **5% significance level** with 1 degree of freedom is **3.84**.

Therefore, as our statistic is **lower than the critical value, we fail to reject the null hypothesis.**

Note: The Chi-Square Test is always typically a one-tailed test. The scope of this proof is out of this article but there is a great answer on StackExchange that explains why this is the case.

# Conclusion

In this article we have walked through how to carry out a Chi-Square goodness of fit test. This test determines if your sample distribution of a single categorical variable is inline with the expected distribution.

In my next post we will discuss the other Chi-Square Test, the test for independence. This is frequently used for **Feature Selection** in Data Science!

# Another Thing!

I have a free newsletter, **Dishing the Data**, where I share weekly tips for becoming a better Data Scientist, and the latest AI news to keep you in the loop. There is no “fluff” or “clickbait”, just pure actionable insights from a practicing Data Scientist.

*Article originally posted here by Egor Howell. Reposted with permission.*