The Importance of P-Values in Data Science
ModelingStatisticsposted by Daniel Gutierrez, ODSC February 26, 2019 Daniel Gutierrez, ODSC
The field of data science makes use of concepts from a variety of disciplines, particularly computer science, mathematics, and applied statistics. One term that keeps popping up in data science circles (including many interviews for data scientist employment positions) is “p-value” which comes from statistics. This term is frequently misunderstood, so in this article, we will briefly review the correct uses of the term and the best ways to look at p-values for data scientists.
For your reference, a formal view of p-values is provided by the American Statistical Society in the paper: “The ASA’s Statement on p-Values: Context, Process, and Purpose.”
Statistics Point of View
The concepts of p-value and level of significance are important aspects of hypothesis testing and statistical methods like regression. However, they can be a little tricky to understand, especially for beginners, and a good understanding of these concepts can go a long way in understanding machine learning.
[Related article: Tips for Linear Regression Diagnostics]
Let’s set up a problem at a high level of abstraction. Consider two groups within a given population: a control group and an experimental group. The experimental group is a random sample taken from the population over which an experiment will be performed and then it will be compared with the control group. The difference in the groups is defined in terms of a test statistic such as the student’s t-test (e.g. a business wants to know if their product is bought more by men or women).
We need to define two additional terms: a null hypothesis means there is no difference between the two groups, while the alternate hypothesis means there is a statistically significant difference between the two groups.
We’ll make the assumption that the null hypothesis is true i.e. there is no difference between two groups. Then the experiment is performed on the experimental group. It is then checked to see if there is any significant effect on the group or not.
Now let’s consider the significance of the p-value. We need to calculate the probability that the effect on the group is attributable to chance. If you repeat the experiment repeatedly at the same sample size for the experimental group, what percentage of the time do you see a difference in the experimental group by chance?
The p-value is used to factually assess the strength of both the null and alternate hypothesis. P-values are decimal numbers between 0 and 1, which serves as a probabilistic reference to weigh the hypothesis. Sometimes, the value is also expressed as a percentage. A p-value greater than 0.05 means that more than 1/20 of the time, the experiment shows no difference between the two groups. The value 0.05 is typically used and is known as the level of significance (α). In a regression problem, you want the p-value to be much less than 0.05 for the variable to be considered as a significant variable. Typically, a small p-value (< 0.05) suggests that null hypothesis is to be rejected while a large p-value (> 0.05) denotes that null hypothesis is to be accepted due to lack of counter proposition against it. Values equal or nearer to 0.05 suggest that the data scientist can make the call.
[Related article: The Difference Between Data Scientists and Data Engineers]
Data Science Point of View
Now let’s consider the use of p-values in data science settings. For this example, we’ll use the R environment. Using the Boston data set found in the MASS package, we’ll fit a simple linear model using the predictor variable rm and the response variable made. In the summary function output, we see the p-values circled in red, which are very small values, indicating the probability the variable is not relevant. If the number is very small will R display the p-value in scientific notation, as in the example 2e-16 or 2×10-16.
Essentially, we interpret the p-value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Consequently, if we see a small p-value, then we can deduce that there is an association between the predictor and the response. This means we reject the null hypothesis, i.e. we assert that a relationship exists between the two variables if the p-value is small enough.
Caveats for Using p-values in Data Science
Sometimes data scientists are advised to take p-values with a grain of salt when working on machine learning problems. This is due to situations where p-values are wrongly interpreted. There is a frequently cited journal article that says that p-values are logically flawed when they are used informally, without giving much thought to statistical considerations.