Nowadays, many companies - Netflix, Amazon, Uber, but also smaller - constantly run experiments (A/B testing) in order to test new features and implement those, which the users find best and which, in the end, lead to revenue growth. Data scientists‚?? role is to help in evaluating these experiments - in other words - verify if the results from these tests are reliable and can/should be used in the decision-making process.
In this article, I provide an introduction to power analysis. Shortly speaking, power is used to report confidence in the conclusions drawn from the results of an experiment. It can also be used for estimating the sample size required for the experiment, i.e., a sample size in which - with a given level of confidence - we should be able to detect an effect. By effect one can understand many things, for instance, more frequent conversion within a group, but also higher average spend of customers going through a certain signup flow in an online shop etc.
Firstly, I introduce a bit of theory and then carry out an example of power analysis in Python. You can find the link to my repo at the end of the article.
In order to understand the power analysis, I believe it is important to understand three related concepts: significance level, Type I/II errors and the effect size.
In hypothesis testing, significance level (often denoted as Greek letter alpha) is the probability of rejecting the null hypothesis (H0), when it was in fact true. A metric closely related to the significance level is the p-value, which is the probability of obtaining a result at least as extreme (a result even further from the null hypothesis), provided that the H0 was true. What does that mean in practice? In case of drawing a random sample from a population, it is always possible that the observed effect would have occurred only due to sampling error.
The result of an experiment (or for example a linear regression coefficient) is statistically significant when the associated p-value is smaller than the chosen alpha. The significance level should be specified before setting up the study and depends on the field of research/business needs.
The second concept worth mentioning is the types of errors we can commit while statistically testing a hypothesis. When we reject a true H0 we are talking about a Type I error (false positive). This is the error connected to the significance level (see above). The other case occurs when we fail to reject a false H0, which is considered to be a Type II error (false negative). You may recall these notions from a confusion matrix!
The last thing to consider it the effect size, which is the quantified magnitude of a phenomenon present in the population. Effect size can be calculated using different metrics depending on the context, for example:
Difference in means between two groups, e.g., Cohen‚??s d
Now that we have revised the key concepts related to power analysis, we can finally talk about statistical power. The statistical power of a hypothesis test is simply the probability that the given test correctly rejects the null hypothesis (which means the same as accepting the H1) when the alternative is in fact true.
The higher statistical power of an experiment means a lower probability of committing a Type II error. It also means a higher probability of detecting an effect when there is an effect to detect (true positive). This can be illustrated by the following formula:
Power = Pr(reject H0 | H1 is true) = 1 - Pr(fail to reject H0 | H0 is false)
In practice, results from experiments with too little power will lead to wrong conclusions, which in turn will affect the decision-making process. That is why only results with an acceptable level of power should be taken into consideration. It is quite common to design experiments with a power level of 80%, which translates to a 20% probability of committing a Type II error.
Power analysis is built from the following building blocks:
I have not talked about sample size before, as it is pretty self-explanatory. The only thing worth adding is that some tests consider sample size jointly from two groups, while for others sample sizes must be specified separately (in the case when they are not equal).
These four metrics are all related to each other. As an example: decreasing the significance level can lead to an increase in power, while a larger sample could make the effect easier to detect.
The idea of power analysis can be brought down to the following: by having three out of four metrics, we estimate the missing one. This comes in handy in two ways:
when we are designing an experiment, we can assume what level of significance, power and effect size is acceptable to us and - as a result - estimate how big a sample we need to gather for such an experiment to yield valid results.
when we are validating an experiment, we can see if, given the used sample size, effect size and significance level, the probability of committing a Type II error is acceptable from the business perspective.
Aside from calculating one value for a given metric, we can perform a kind of sensitivity analysis by carrying out power analysis multiple times (for different values of the components) and presenting the results on a plot. This way we could see - for example - how does the necessary sample size change with an increase or decrease of the significance level. This can naturally be extended to a 3D plane for 3 metrics.
Example in Python
In this example, I carry out power analysis for the case of the independent two-sample t-test (equal sample sizes and variances). Library statsmodels contains functions for conducting a power analysis for a couple of most commonly used statistical tests.
Let‚??s start with an easy example by assuming that we would like to know how big a sample we need to collect for our experiment, if we accept power at the level of 80%, significance level of 5% and the expected effect size is 0.8. To do so, we need to run the following commands and arrive at the required sample size of 25.
Having done that, it is time to take it a step further. We would like to see how does the power change when we modify the rest of the building blocks. To do so we plot power with respect to the other parameters. I begin the analysis by inspecting how does the sample size influence the power (while keeping the significance level and the effect size at certain levels). I have chosen [0.2, 0.5, 0.8] as the considered effect size values, as these correspond to the thresholds for small/medium/large, as defined in case of Cohen‚??s d.
From the plots, we can infer that an increase in the sample/effect size leads to an increase in power. In other words, the bigger the sample, the higher the power, keeping other parameters constant. Below I also present the plots for two remaining building blocks on the x-axis and the results are pretty self-explanatory.
Finally, I would like to expand the analysis to three dimensions. To do so, I fix the significance level at 5% (which is often used in practice) and create a grid of possible combinations of the sample and effect sizes. Then I need to obtain the power values for each combination. To do this I use numpy‚??s meshgrid and vectorize.
For creating the 3d plot I chose plotly, as it is really easy to quickly obtain nice, interactive plots, which can be then embedded in this post. To use it you must create a free account here and obtain an API key.
Summing up, power analysis is nowadays mostly used in case of A/B testing, and can be used both when planning an experiment/study or evaluating the results. As many companies use the frequentist approach to hypothesis testing, it is definitely good to know how to carry out the power analysis and how to present its implications.