I receive a lot of queries regarding sample size calculation for this article. Recently, someone asked a question that involved calculating sample size with Epi Info 7. I believe many more would benefit from a public response, hence this article.
Epi Info™ is public domain set of software tools developed by the United States’ Centers for Disease Control and Prevention (CDC) for use by public health professionals and researchers. The latest version of Epi Info is Epi Info 7.
It provides for easy data entry form and database construction, a customized data entry experience, and data analyses with epidemiologic statistics, maps, and graphs for public health professionals who may lack an information technology background. It also includes a tool for sample size calculation.
A. Getting and installing Epi Info 7
Epi Info™ can be downloaded from here.
The CDC has produced several tutorial videos for Epi Info 7 that can be viewed here.
If you want to view the video providing instructions on downloading the software, you may do so here.
B. Sample size calculation
B.1. Background information
The investigator wishes to determine the prevalence of Non-Alcoholic Fatty Liver Disease (NAFLD) among persons with Coronary Artery Disease (CAD).
Therefore, the study population is patients with CAD.
Since one intends to determine the prevalence of NAFLD among CAD patients, the outcome of interest is NAFLD
Similarly, the exposure is CAD (those who have CAD are ‘exposed’)
By extension, those without CAD (general population) would constitute the ‘unexposed’ group.
As the investigator wishes to determine the prevalence of NAFLD, the appropriate study design is a cross-sectional study (cross-sectional studies are also called ‘prevalence’ studies).
B. 2. Launching Cross-Sectional Study within Stat Calc tool of Epi Info 7
Step 1: Launch Epi Info 7 (please watch video above on downloading and installing Epi Info 7)
Step 2: Select StatCalc from the menu of options (shown in red)
Step 3: Select Cross Sectional Study from the options (shown in red)
B.3. Requirements for calculating sample size using Epi Info 7 (cross-sectional studies)
In order to calculate sample size using Epi Info 7, one requires to provide the following information (shown in red):
Confidence level: usually set at 95%
Power: usually set at 80%
Ratio of unexposed to exposed: depends upon the outcome of interest and study population
% outcome in unexposed group: the proportion of unexposed people with the outcome of interest (in this case, the proportion of general population with NAFLD)
% outcome in exposed group: the proportion of exposed people with the outcome of interest (in this case, the proportion of CAD patients with NAFLD)
The values of Odds Ratio and Risk ratio will be populated automatically based on the other values supplied.
B. 4. Obtaining the values for sample size calculation
Although we already know what values to supply for confidence level and power, other values are unknown. These need to be determined from literature.
In this case, we need to determine two values:
- the proportion of NAFLD in the general population
- the proportion of NAFLD among patients with CAD
The proportion of NAFLD in the general population is reported to be between 5 to 30%
The proportion of NAFLD among CAD patients is reported to be between 69.2% to 80.4%
The above study is from South Korea, so we must try to obtain literature from India for better estimation.
B. 5. Performing the sample size calculation
Having obtained all the information required, we can now proceed with sample size estimation.
Step 1: Selecting the desired confidence level.
The default value is 99.9%, but this may inflate the estimate. Therefore, we click on the drop-down menu and choose 95% instead.
Step 2: Supplying the desired power.
Typically, the power is kept at 80%. Increasing the value will increase the sample size.
Step 3: Supplying the ratio of unexposed to exposed individuals.
Here, one must provide a single value, not ratios (1; not 1:1). If the proportion of unexposed will be less than the number of exposed, the value will be less than one. In the present example, the ratio is approximately 30:70 (The study population will consist of CAD patients. Among them, those without NAFLD would be around 30%, while those with NAFLD would be around 70%). Performing the calculation (30/70), one obtains a value of 0.4- this is supplied in the appropriate cell.
Step 4: Supplying the percentage of outcome among unexposed.
We already know from literature that this value lies between 5% to 30% (NAFLD in general population). Since 5% is rather low, we will use 9% instead.
Once this value has been entered, the remaining cells are automatically populated. The default value for percentage outcome in exposed group is 0%. The values in the grid reflect sample size estimates based on this data. Since the value for percentage outcome in exposed group is non-zero, we will ignore the output for now.
Step 5: Supplying the percentage outcome in exposed group
After supplying the value for percentage outcome in exposed group, we can now examine the output in the adjoining grid.
The first column provides estimates based on the approach described by Kelsey et al. According to this approach, the total sample size required is 27 subjects.
The second column provides estimates based on the approach described by Fleiss et al. They described two approaches- one without continuity correction, and another with continuity correction. The second column provides estimates without continuity correction. Here, the total sample size is 23 subjects.
The third column provides estimates with continuity correction, and estimate 30 subjects for the study.
Details of the approaches may be found here.
Note: The above estimates are the lowest possible for a study on this topic.
Step 6: Refining the estimate
This requires one to manipulate the values to obtain a reasonable estimate. While we will not alter the values obtained from literature, we can increase the others.
First, we will sequentially increase the value for confidence level, and see how that alters the estimate:
When the confidence level is increased from 95% to 99%, the estimate increased to a maximum of 42 subjects (above).
When the confidence level is increased to 99.9%, the maximum estimate is 59 (above).
With the confidence level at 99.99%, the maximum estimate touches 76 subjects (above).
What would happen if the ratio of unexposed to exposed is reduced further (even fewer unexposed compared to exposed)?
The maximum estimate now touches 86 subjects (above).
Is this the maximum possible sample size estimate for the study? Perhaps not.
Remember, the prevalence of NAFLD in the general population ranges from 5% to 30%. While we supplied the lower value, we did not do so for the higher value. Let us see what happens when the higher value (30%) is supplied.
First, we will keep the confidence level at 95%; and the percentage of outcome in unexposed as 30%.
As can be seen above, the maximum estimate is 72 subjects.
Let us increase the confidence level value to 99%, keeping everything else the same.
Now the maximum estimate is 101 (above).
What if the confidence level value were to be increased further? Let us see what happens when the value is increased to 99.9%.
As can be seen above, all estimates are in excess of 130, with the maximum being 142.
If the confidence level were increased to 99.99%, the sample size would increase correspondingly.
If the value of power were increased from 80% to 90%, the estimated sample size will increase further.
What is the maximum possible sample size with the available data?
As can be seen above, now the estimates are around the 500 subject value.
The final sample size chosen should be the largest feasible value, considering available resources (time, materials, manpower, money).
The estimation of sample size is informed by existing literature. A thorough review of literature should be performed before determining the values for use in calculation.
Increasing the confidence level increases the estimated sample size.
Increasing power will increase the sample size.
If the difference in values of percentage outcome in exposed and unexposed is small, the sample size will increase, and vice versa (lower sample size estimate when the values were 9% and 69.2%, as compared to 30% and 69.2%).
Larger sample sizes are preferred as they have greater power to detect a difference when it exists.
Link to Epi Info user guide:
Link to previous article on sample size calculation for cross sectional studies: