Statistical Power

Building the intuition you need to master this important concept

Felipe Cezar
Analytics Vidhya

--

Statistical Power is a foundational concept in a data scientist’ toolkit. If you want to incorporate it into your problem solving skills, you need to build an intuition for it.

First Things First

Statistical Power is the probability that we will correctly reject the Null Hypothesis.

Assuming that β is the probability that you fail to reject the null hypothesis when it is actually false, then Power equals 1 − β.

Here is how you can calculate it in Python with statsmodels:

from statsmodels.stats.power import TTestIndPower, TTestPowerpower_analysis = TTestIndPower()# solves for a given variable (n, effect size, alpha, etc)
power_analysis.solve_power(effect_size=.2, nobs1=80, alpha=.05)
You, trying to understand how this is supposed to be “intuitive”

I know! We are still far away from making the concept intuitive, but hopefully it will be a lot clearer by the time you finish reading this article.

Thinking It Through

We want to get to a point where you intuitively understand that Power relates to the overlap between the sampling distributions you are testing for your Alternative Hypothesis.

Notice the dark blue area. This is the overlap we will be discussing

In order to get there, we will need to use some key concepts that I hope you are already familiar with, such as Hypothesis Testing, Significance Level, p-value, Sampling Distributions and the Central Limit Theorem.

Let’s do so with an illustrative example:

Imagine that we divide a 7th grade class into two groups: for one week group A will have a 30 minutes guided meditation session everyday before every class, and group B will carry on as usual. At the end of the week, both groups take a math test, and we find that group A’s average score is 5% higher than group B’s.

Null Hypothesis

In our little experiment, the null hypothesis (Ho) is that there is no significant difference in the mean test results. In other words, both of them belong to the same sampling distribution.

If we ran the statistical test and calculated a p-value that fell inside the grey area, we would not reject Ho. Alternatively, if the p-value fell inside the red area, than we would reject the Ho.

Alternative Hypothesis

On the other hand, our alternative hypothesis (Ha) is that there is a difference in the mean score of the tests. In other words, each group mean score belongs to a distinct sampling distribution.

You will understand why this dark blue overlap is there soon…

Now, let’s assume that the alternative hypothesis is true, meaning that meditation does have an effect and therefore group B belongs in a distinct distribution.

In that case, if we fail to reject the null hypothesis, we are making an error — more specifically, a Type II error. But how could we make that mistake?

We would make this mistake if the calculated p-value fell inside the dark blue overlap.

That is because the dark blue overlap is outside the red rejection area but is also consistent with a two distributions hypothesis (Ha). In other words, we don’t reject the null hypothesis when the p-value is below the threshold, even if it is inside the alternative hypothesis distribution.

How To Increase Power

There are three things that will affect Power:

  1. Effect Size
  2. Significance Level
  3. Sample Size

Effect Size

Before we discuss it, we must first acknowledge that effect size is not under our control. In our 7th grade example, the effect size would be the difference in each group mean score. This difference is a only a reflection of the meditation (or not, depending on the statistical test conclusion).

Effect size will affect Power because it will influence that dark blue overlap we keep talking about: assuming that there are two distributions (Ha), the distance between each distribution center (mean score) is directly related to that difference in mean scores, i.e. effect size. This will become more obvious in the charts below.

Here are two scenarios, each one with a different effect size. See if you can spot the difference:

This is the overlap for an effect size of 0.6
And this is the overlap for an effect size of 0.3

The effect size, here calculated as the Cohen’s d, can be understood as a measure of the distance of the curves. The closer the curves are together, the less distinguishable they are. In other words, the closer the curves, the bigger the overlap

Significance Level

Also knows as alpha, it affects Power because it affects the dark blue overlap. I know, this is getting repetitive.

The significance level and the dark blue overlap are mutually exclusive: when a p-value falls inside the red area (i.e. rejection area), there can be no Type II error, because we will be rejecting the null hypothesis. Therefore, when α increases, the dark blue overlap must shrink:

The blue overlap cannot intersect the red area.
So if red grows, dark blue overlap shrinks.

It is important to bear in mind that increasing alpha with the intent of increasing power is not good practice. You want the significance level to reflect other criteria, not to hack the Power of the test.

Sample Size

It relates to Power because it affects the dark blue overlap ;-)

But how so, you ask? Great question my friend. This is the trickiest to understand, but also the most important one because it is usually the only thing a data scientist can try to control in order to achieve a certain Power level. In our 7th example, we could increase the number of students in each group if we wanted, and that would increase the Power of the test.

Before we go down this road, let’s take a detour. Let’s think about how the sample size (n) affects a sampling distribution of means.

Assume that U is a set of integers that goes from 0 to 99.

From this set, we will draw 20 samples and calculate their means. We will do it 4 times, but each time we will draw samples of different sizes.

I have written some (very) simple Python code to do this for us:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')
U = pd.Series(range(100))# Sampling Distribution A
samples_a = [U.sample(n=1).values for sample in range(20)]
sampling_distr_a = [sample.mean() for sample in samples_a]
# Sampling Distribution B
samples_b = [U.sample(n=2).values for sample in range(20)]
sampling_distr_b = [sample.mean() for sample in samples_b]
# Sampling Distribution C
samples_c = [U.sample(n=3).values for sample in range(20)]
sampling_distr_c = [sample.mean() for sample in samples_c]
# Sampling Distribution D
samples_d = [U.sample(n=4).values for sample in range(20)]
sampling_distr_d = [sample.mean() for sample in samples_d]
# plots all 4 sampling distributions
f, (ax1, ax2, ax3, ax4) = plt.subplots(4, sharex=True, sharey=True, figsize=(12, 15))
f.suptitle('Distributions for 20 sample means'.title(), fontsize=19)
sns.distplot(sampling_distr_a, ax=ax1)
ax1.set_title('single point samples (n=1)'.title())
sns.distplot(sampling_distr_b, ax=ax2)
ax2.set_title('single point samples (n=2)'.title())
sns.distplot(sampling_distr_c, ax=ax3)
ax3.set_title('single point samples (n=3)'.title())
sns.distplot(sampling_distr_d, ax=ax4)
ax4.set_title('single point samples (n=4)'.title())
plt.show()

A key point to bear in mind here is that the sampling distribution of means will be a Normal Distribution, as stated by the Central Limit Theorem. That is true even though the set U is not normally distributed.

Now let’s see how the sampling distribution of the means will look like for each sample size (n):

Notice how the distribution starts to look more Normal, becomes narrower and taller as n increases. But why it that?

Think of the distribution for n=1: out of the 20 samples we draw, we are bound to have some extreme values like 99, even if they are rare.

However, when n=2, even though extreme values will also come up, it will be even rarer for both of them to be extreme. The result is that the means from the 20 samples with n=2 will be more concentrated around the center.

If n=3, this effect will be even more pronounced, as observed in the charts above.

The effect n has on the sampling distribution will also hold true for our intuition of the Power analysis: when we increase sample size, the distributions assumed for the the alternative hypothesis become narrower, more centrally concentrated. That in turn will make the dark blue overlap shrink.

n=30
n=60

By the way, the same intuition holds true for the shape of the t-distribution that you have learned in the past: the higher degrees of freedom (and consequently n), the more normal the curve becomes.

It Is All About Overlap

Hopefully you now have an intuitive knowledge of what Power is, and how you can influence it. It all comes down to this statement:

β (beta) is directly related to the dark blue overlap.

This should make it easier for you to grasp what I mentioned at the start of the article:

Statistical Power is the probability that we will correctly reject the Null Hypothesis.

Assuming that β is the probability that you fail to reject the null hypothesis when it is actually false, then Power equals 1 − β.

If you want play around with different combinations of effect size, alpha and sample size, visit this amazing resource created by Kristoffer Magnusson

--

--