Disclaimer: This article is primarily intended for an online group of post graduate students in Community Medicine that I am involved with. The group was created to provide supplemental instruction to members on topics of common interest. Instruction is in bite-sized portions, since all members are busy PG students. Conceptual understanding is emphasized. Membership to that (Whatsapp) group is through invitation only. However, others interested in participating in the discussions and related activities in Google classroom may indicate the same by sending me a message on Facebook.
This article will focus on situations wherein the outcome variable is continuous, and normally distributed.
When I say a variable is normally distributed, I mean that the data follow a normal (Gaussian) distribution- the mean, median and mode coincide (are the same/ similar), has a bell-shaped curve (when visualized using a histogram), etc.
Why is it important to assess if a continuous variable is normally distributed or not? Simply because it influences the choice of statistical test of significance available for hypothesis testing. When data are normally distributed, one can employ ‘parametric’ tests of significance. Typically, when using such tests of significance, the mean assumes importance (the test(s) see(s) if there is a difference between means). Generally, when the sample size exceeds 30, the values follow a normal distribution. However, one must verify of this is true before applying parametric tests of significance.
Let us consider various scenarios when the outcome variable is continuous, and normally distributed.
Scenario 1: One wants to know if there is a difference in mean values between two groups that are unrelated (The outcome/ response variable is continuous and normally distributed; predictor variable is categorical, and has two levels.).
Example: One wants to know if the mean income varies by sex. Here, the response variable is income, and is normally distributed. The predictor variable is sex, and has two levels- male and female. The income of males is not related to that of females and vice-versa (the two are ‘independent’ of each other).
Appropriate statistical test: Independent-samples t Test
Scenario 2: One wants to know if there is a difference in mean values before and after some intervention (The outcome/ response variable is continuous and normally distributed).
Example: One wants to know if the mean weight varies before and after exercise. Here, the response variable is body weight after exercise, and is normally distributed. Unlike the previous scenario, the predictor variable is also body weight, but before exercise. Here, the post-exercise weight is ‘dependent’ on the pre-exercise weight.
Appropriate statistical test: Paired-samples t Test
Scenario 3: One wants to know if there is a difference in mean values between more than two groups that are unrelated (The outcome/ response variable is continuous and normally distributed; predictor variable is categorical, and has more than two levels.).
Example: One wants to know if the mean weight varies by social class (there are more than two social classes- Low, Middle, High). The social classes are mutually exclusive- one cannot belong to more than one social class at any point in time.
Appropriate statistical test: Analysis of Variance (ANOVA)
Note: When the predictor categorical variable has more than two levels, a direct comparison of means is not feasible, therefore, regression methods are used to determine if there is a statistically significant difference between any two levels of the categorical variable. The results of ANOVA only indicate that there is a significant between two levels, but does not identify which two levels are responsible for the same. In order to determine the specific levels causing statistical significance, one needs to apply one of many post-hoc tests. This will pinpoint the two levels having significant difference in values.
Link to previous article in the series: