6 Week 6: Hypothesis Testing

You’ve made it to the final lab of this course, yay!

The lab this week will be all about hypothesis testing, one of the most ubiquitous elements of statistical theory, and the conclusion of your journey into inferential statistics (for now…). Almost any quantitative research paper you read will feature a hypothesis test in one form or another. Moreover, the quantitative research you are going to conduct in other courses or capstone will likely rely heavily on the framework of hypothesis testing. During this lab we will first discuss the theories at the foundation of hypothesis testing. In the second part of the lab we are going to practice conducting null hypothesis significance tests (NHSTs) in R.

To get started, download the lab template here (right click: save as) or from Canvas. Copy the lab template to your lab folder and double click the Lab project file (not the .Rmd) to open RStudio.

6.1 Learning goals

During this lab you will do the following:

  1. Discuss various concepts related to hypothesis testing
  2. Conduct one-sample, paired-samples and independent samples \(t\)-tests in R

6.2 Statistical theory: hypothesis testing

6.2.1 Question 1

What is a research hypothesis? What is a statistical hypothesis? What is the difference between the two? Explain in your own words.

6.2.2 Question 2

What is the null and alternative hypothesis of the following cases:

6.2.2.1 Case 1

You flip a coin 50 times and count the number of heads flipped (\(X = 12\)). Suppose you want to conduct a hypothesis test to determine if the underlying probability of flipping heads for this particular coin is equal to \(0.30\).

6.2.2.2 Case 2

A random survey of 75 death row inmates revealed that the mean length of time on death row is 17.4 years with a standard deviation of 6.3 years. Suppose you want to conduct a hypothesis test to determine if the mean time on death row is 15 years.16

6.2.2.3 Case 3

You are a researcher investigating the effect of age on working memory. You have collected data on how many random digits can be remembered for a brief period of time in a group of 20 young adults (\(\bar{X} = 7.2\) digits) and 20 elderly participants (\(\bar{X} = 6.9\) digits). The standard deviation is 1 digit in both groups. Suppose you want to conduct a hypothesis test to determine if young adults have a better working memory than elderly.

6.2.2.4 Case 4

A course coordinator is interested if their course about statistics actually has an effect on the statistical ability of students. The course coordinator has 200 students take a quiz about statistics before the course starts (mean pre-course quiz grade = 50 points) and has the same students take a quiz after completion of the course (mean post-course quiz grade = 75 points). The course coordinator made sure the questions were different between the quizzes, but equivalent in difficulty. The standard deviation of the difference is 10 points. Suppose the course coordinator wants to conduct a hypothesis test to determine if the course had any effect on the statistical ability of students (i.e. quiz grades getting better or worse).

6.2.3 Question 3

What is the appropriate statistical test to conduct for each of the statistical hypotheses stated in Question 2? Explain why a particular statistical test is appropriate. Assume the population standard deviation is unknown for case 2, 3 and 4.

6.2.4 Question 4

What is meant by the critical region for a test statistic? What does it mean if the test statistic of your sample falls within the critical region? Explain in your own words.

6.2.5 Question 5

What is the meaning of the p-value of a statistical test? Explain in your own words.

6.2.6 Question 6

The following is the output of the statistical tests which tested the hypotheses of the cases stated in Question 2. Comment on the results of each statistical test. Do you reject or retain the null hypothesis in each case?

6.2.6.1 Case 1

> binom.test(X,N,theta)

Exact binomial test

data:  X and N
number of successes = 12, number of trials = 50, p-value = 0.4407
alternative hypothesis: true probability of success is not equal to 0.3
95 percent confidence interval:
  0.1306099 0.3816907
sample estimates:
probability of success 
  0.24 

6.2.6.2 Case 2

> t.test(time_deathrow, mu = 15)

One Sample t-test

data:  time_deathrow
t = 3.2991, df = 74, p-value = 0.001493
alternative hypothesis: true mean is not equal to 15
95 percent confidence interval:
  15.9505 18.8495
sample estimates:
mean of x 
 17.4 

6.2.6.3 Case 3

> t.test(young_adults,elderly,alternative = "greater")

Welch Two Sample t-test

data:  young_adults and elderly
t = 0.94868, df = 38, p-value = 0.1744
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
  -0.2331456        Inf
sample estimates:
mean of x mean of y 
  7.2       6.9 

6.2.6.4 Case 4

> t.test(pre,post,paired =T)

Paired t-test

data:  pre and post
t = -80.58, df = 199, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -25.6118 -24.3882
sample estimates:
mean of the differences 
  -25

6.2.7 Question 7

What does it mean if the result of a statistical test is significant? Does statistical significance say anything about whether the result is important?

6.2.8 Question 8

What type of error is committed in the following cases? Which type of error has a greater consequence in this particular case?17

6.2.8.1 Case 1

You are a beginning rock climber and have bought some second hand climbing equipment on Marktplaats. Assume your null hypothesis about the safety of the rock climbing equipment is: the climbing equipment is safe. What type of error do you commit if you think the rock climbing equipment is not safe, when in fact it is safe?

6.2.8.2 Case 2

You are a beginning rock climber and have bought some second hand climbing equipment on Marktplaats. Assume your null hypothesis about the safety of the rock climbing equipment is: the climbing equipment is safe. What type of error do you commit if you think the rock climbing equipment is safe, when in fact it is not safe?

6.2.9 Question 9

What does the power of a statistical test refer to? Explain in your own words.

6.3 \(t\)-tests in R

I’ll be honest. Conducting a \(t\)-test in R is sometimes slightly bothersome. Part of that has to do with the fact that the t.test() function was not really written with “tidy data” in mind (which we have become used to over the past couple of weeks), but also because t.test() can do more than just one thing. The function can actually do several different \(t\)-tests, so to use it you have to be quite specific in setting its arguments, depending on which \(t\)-test you want to perform. How to use t.test() is actually explained quite extensively in the textbook, so I will not go over all the details here, but I will highlight some important tips to succesfully conduct \(t\)-tests in R.

Remember the three types of \(t\)-test we can conduct are: the one-sample \(t\)-test, the paired samples \(t\)-test and the independent samples \(t\)-test. I will focus on using the independent samples \(t\)-test here and how to use to other \(t\)-tests should follow from that. More details on how to use the different \(t\)-tests are given in the relevant sections of the textbook.

Suppose we want to compare the mean grade of one group of students to a different group of students. Maybe the two groups were instructed using a different method (a “classic” method and a “modern” method) and we have some reason to believe the instruction method has an effect on the exam grade. We conduct an independent samples \(t\)-test as follows:

set.seed(123)
classic <- rnorm(n = 10, mean = 70, sd = 7) # generate 10 grades for the classic group
modern <- rnorm(n = 10, mean = 75, sd = 7) # generate 10 grades for the modern group

results <- t.test(classic,modern)

print(results)
## 
##  Welch Two Sample t-test
## 
## data:  classic and modern
## t = -1.9029, df = 17.872, p-value = 0.07329
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.4973418   0.6213934
## sample estimates:
## mean of x mean of y 
##  70.52238  76.46035

I want to briefly explain some of the arguments t.test() can take before having a detailed look at the results. All arguments were set at their default in the example above (i.e. they were not explicitly changed from the default setting), but sometimes you have to change an argument from its default depending on the \(t\)-test you are conducting. The most common arguments to change are:

  • alternative: specify which alternative hypothesis to test. By default you test a two-sided hypothesis (“two.sided”), but you can also test a one-sided hypothesis (“greater” or “less”).
  • mu: specify the value of mu for the null hypothesis (one-sample \(t\)-test) or the difference between groups (independent or paired-samples \(t\)-test). The value of mu is 0 by default.
  • paired: indicate whether the groups are paired or not (not paired by default, i.e. independent)
  • conf.level: set the confidence level of the confidence interval (0.95 by default)

With that out of the way, let’s focus on the results of the independent samples \(t\)-test above. The \(t\)-test tested whether the difference between groups was equal to 0 or not (two-sided). The output reveals we shouldn’t reject the null hypothesis, because our p-value is not smaller than the value of \(\alpha\) commonly used (0.05). We saved the results of the \(t\)-test in a variable called results. If we want to select a specific result of the \(t\)-test (for example the confidence interval), we can do that as follows:

results$conf.int
## [1] -12.4973418   0.6213934
## attr(,"conf.level")
## [1] 0.95

The output of a \(t\)-test doesn’t look very pretty and you might want to make it a bit more aesthetically pleasing before you incorporate this result in a research paper. You can do this by putting the results of the \(t\)-test in a dataframe and printing it as a table as follows:

results_df <- data.frame(results$statistic,
                         results$parameter,
                         results$p.value,
                         results$conf.int[1],
                         results$conf.int[2])

knitr::kable(results_df,col.names = c("Statistic","d.f.","p","CI lower","CI upper"))
Statistic d.f. p CI lower CI upper
t -1.902867 17.87244 0.0732939 -12.49734 0.6213934

While this is all working quite nicely, the data provided to t.test() in the example above was in the form of two vectors. The data type we have been working with most during this course is a data frame. For example, the Gapminder and Stroop data are in the form of a data frame. We need to wrangle a bit with our data frames before t.test() will work with it. I will briefly demonstrate how to use t.test() with data frames in wide and long format in the examples below. First, with a wide format:

# For this example, we put classic and modern in a data frame as follows:
df_grades <- data.frame(classic, modern)

# This data frame is in a wide format:
head(df_grades)
##    classic   modern
## 1 66.07667 83.56857
## 2 68.38876 77.51870
## 3 80.91096 77.80540
## 4 70.49356 75.77478
## 5 70.90501 71.10911
## 6 82.00545 87.50839
# We can use t.test() with a data frame in wide format as follows:
results <- t.test(df_grades$classic, df_grades$modern)
print(results)
## 
##  Welch Two Sample t-test
## 
## data:  df_grades$classic and df_grades$modern
## t = -1.9029, df = 17.872, p-value = 0.07329
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.4973418   0.6213934
## sample estimates:
## mean of x mean of y 
##  70.52238  76.46035

So far, so good. What if our data frame has a long format?

library(tidyverse)
# Pivot our data frame from wide to long:
df_grades_long <- df_grades %>% pivot_longer(cols = c(classic,modern), names_to = "Instruction", values_to = "Grade")

# This data frame is in a long format:
head(df_grades_long)
## # A tibble: 6 x 2
##   Instruction Grade
##   <chr>       <dbl>
## 1 classic      66.1
## 2 modern       83.6
## 3 classic      68.4
## 4 modern       77.5
## 5 classic      80.9
## 6 modern       77.8

We can use t.test() with a data frame in long format in two ways. The first way is not the preferred way of doing it, but I will show it for the sake of completeness. It goes as follows:

# Select data of the classic instruction group
classic_group <- df_grades_long %>% 
  filter(Instruction == "classic") %>%
  select(Grade)

# Select data of the modern instruction group
modern_group <- df_grades_long %>% 
  filter(Instruction == "modern") %>%
  select(Grade)

# Do the t-test
results <- t.test(classic_group$Grade, modern_group$Grade)
print(results)
## 
##  Welch Two Sample t-test
## 
## data:  classic_group$Grade and modern_group$Grade
## t = -1.9029, df = 17.872, p-value = 0.07329
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.4973418   0.6213934
## sample estimates:
## mean of x mean of y 
##  70.52238  76.46035

The preferred way is actually using the formula and data arguments of t.test():

results <- t.test(formula = Grade ~ Instruction, data = df_grades_long)
print(results)
## 
##  Welch Two Sample t-test
## 
## data:  Grade by Instruction
## t = -1.9029, df = 17.872, p-value = 0.07329
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.4973418   0.6213934
## sample estimates:
## mean in group classic  mean in group modern 
##              70.52238              76.46035

This way of doing a \(t\)-test is especially relevant for you when working with the Gapminder data in your assignment, which is a data frame in long format.

6.4 Stroop data

Let’s have another look at the Stroop experiment data from last week. The Stroop data should still be saved in your lab folder, so load the data as we did during the last lab:

library(tidyverse)
library(readxl)
# read_xlsx depends on where you have placed and how you have named your file
df_stroop_wide <- read_xlsx("data/stroop.xlsx") 
df_stroop_long <- pivot_longer(df_stroop_wide,cols = c(Congruent,Incongruent),names_to = "Condition",values_to = "RT")

If you don’t have a copy of last week’s data saved, you can download an example file from Canvas under the lecture for Week 6.

6.4.1 \(t\)-test exercises

A colleague hypothesizes that the average reaction time for the Congruent condition is 1000ms in the population (\(\mu = 1000\)). Suppose you want to test if there is a statistically significant difference between the mean reaction time in the Congruent condition of your sample and the mean of the population.

  1. State the null and alternative hypothesis for this test.
  2. Conduct the appropriate statistical test to test these hypotheses.
  3. What do you conclude about the null hypothesis?

We continue working with the Stroop data from last week, but this time we are interested in testing if there is a statistically significant difference in reaction time between the Congruent and Incongruent condition. We strongly expect a positive difference and only want to test positive Stroop effects (i.e. Incogruent - Congruent > 0). Complete the following exercises:

  1. State the null and alternative hypothesis for this test.
  2. Conduct the appropriate statistical test to test these hypotheses.
  3. What do you conclude about the null hypothesis?

Finally, we are going to finish this lab by do something “illegal.” Don’t worry, you aren’t going to get arrested, but I do want you to understand the gravity of the statistical crime we are going to commit. It’s for a good reason, so we are allowing it for once…

We know the data in the Stroop experiment was collected in pairs: you took part in both the Congruent and Incongruent condition. However, how to analyze the Stroop data if we disregard this fact and pretend the data was collected in independent groups instead? So, let’s pretend, for didactic purposes, the reaction times for the Congruent condition were collected in one group of students and the Incongruent condition in a different group of students. We want to test if there is a difference between the two groups.

  1. State the null and alternative hypothesis for this test.
  2. Conduct the appropriate statistical test to test these hypotheses.
  3. What do you conclude about the null hypothesis?

When you have completed all exercises and are happy with your progress today, please knit your document and submit it to Canvas. If you finish before the time is up, think about what topics from this course you would like to revise, work on your assignment, or help out your fellow students.


  1. Case from Introductory Statistics by OpenStax↩︎

  2. Cases from Introductory Statistics by OpenStax↩︎