5 Week 5: Estimation and Sampling Theory

In the textbook we learned that we can use various concepts from sampling theory (e.g. the law of large numbers, sampling distributions and the central limit theorem) to estimate something about a population from a sample. In the first part of the lab we are going to discuss sampling theory and sampling distributions in more detail. During the second part of the lab we are going to use R to estimate things about a population using our own data data. You will notice this lab contains fewer “ready-made” code examples for you to copy/paste and adjust than in previous weeks. As we progress through the course, we expect you to be able to code more “from scratch” (with the help of a few hints), based on what you’ve learned in previous weeks and the resources available to you.

To get started, download the lab template here (right click: save as) or from Canvas. Copy the lab template to your lab folder and double click the Lab project file (not the .Rmd) to open RStudio.

5.1 Learning goals

During this lab you will do the following:

Discuss various concepts related to sampling theory, sampling distributions and estimation
Work with real data to estimate population parameters from a sample using R

5.2 Collecting some data

Make sure you have completed the experiment and have registered your reaction times in the online spreadsheet provided by your tutor. Completing the experiment should take approximately 5 minutes. Your tutor will compile an excel file with the experimental data to complete the exercises at the end of this lab.

5.3 Statistical theory: sampling and estimation

5.3.1 Question 1

What is the difference between a population and a sample? Explain in your own words.

5.3.2 Question 2

Identify what type of sampling is conducted in the two cases below. Why is the sampling strategy in case 2 preferred when your population is not homogeneous?

5.3.2.1 Case 1

Suppose a researcher is interested in the average annual income of people living in The Netherlands. The researcher contacts the Belastingdienst (the Dutch tax authorities) to obtain a sample of income data. The Belastingdienst agrees to share 1,000 (anonymized) tax returns. The tax returns are selected completely at random and each member of the population has an equal probability of being sampled.

5.3.2.2 Case 2

You work for a polling organization in the USA and are tasked with surveying the popularity of the current president. You have resources to survey 500 participants. Instead of selecting 500 participants completely at random from the population, you make an effort to (randomly) sample 100 participants from rural areas and 400 participants from urban areas (the approximate ratio between rural inhabitants and urban inhabitants in the USA).

5.3.3 Question 3

Identify the source of sampling bias in the following two cases. Explain whether sampling bias could be a threat to the validity of each study.

5.3.3.1 Case 1

The course coordinator of a large course at EUC is investigating optical illusions. During a lecture, they include a slide containing a Ponzo illusion (see image below). Afterwards, they ask students whether the yellow lines are of equal length or not and they register the answers.

5.3.3.2 Case 2

A capstone student wants to investigate the use of recreational drugs by students in The Netherlands. They send out a survey about recreational drug use to 30 of their friends currently studying at EUC. They also ask their friends to forward the survey to anyone they know currently studying in The Netherlands.

5.3.4 Question 4

What is the difference between a population parameter and a sample statistic? Explain in your own words.

5.3.5 Question 5

In the example below we have R toss a fair coin ten times and calculate the mean amount of heads. The mean amount of heads (i.e. the proportion of heads) is calculated by: \(\frac{\text{number of heads}}{\text{number of flips}}\)

set.seed(123)
theta <- 0.5 # fair coin so P(Heads) = 0.5
N <- 10 # number of flips

flips <- rbinom(n = N, size = 1, prob = theta)
print(flips)

##  [1] 0 1 0 1 1 0 1 1 1 0

print(mean(flips))

## [1] 0.6

In this example that works out to be: \(\frac{\text{6 heads}}{\text{10flips}} = 0.6\)

In the long run, what do you expect the proportion of heads to be?
Adjust the code above to flip 100 coins and 10,000 coins respectively. What is the proportion of heads for 100 and 10,000 coins?
Explain why the proportion of heads gets closer and closer to \(0.5\) as the sample size increases.

5.3.6 Question 6

What is the difference between the distribution of your sample and the sampling distribution of the sample mean? How does a sampling distribution relate to the population? Explain in your own words.

5.3.7 Question 7

Consider the following two distributions. The first distribution is the sampling distribution of the sample mean of rolling one die 100,000 times. Of course, the sample mean of rolling one die is just the value of a single roll.

# set parameters of imaginary dice rolling experiment
n_rolls <- 100000
n_dice <- 1

# sample 100,000 dice rolls (single die)
mean_rolls <- replicate(n_rolls, mean(sample(1:6, size = n_dice, replace = TRUE)))

# plot histogram of sample means
# sample mean of 1 die is simply the value of a single roll
dice_rolls <- data.frame(mean_rolls)
ggplot(dice_rolls,aes(x=mean_rolls)) + 
  geom_histogram(binwidth = 1,fill="white",color="black")

The second distribution is the sampling distribution of the sample mean of rolling four dice 100,000 times. Here, we have R roll 4 dice 100,000 times and calculate the mean of each sample of 4 dice.

# set parameters of imaginary experiment
n_rolls <- 100000
n_dice <- 4

# sample 100,000 dice rolls (four dice)
mean_rolls <- replicate(n_rolls, mean(sample(1:6, size = n_dice, replace = TRUE)))

# plot histogram of sample means
dice_rolls <- data.frame(mean_rolls)
ggplot(dice_rolls,aes(x=mean_rolls)) + 
  geom_histogram(binwidth = 0.25,fill="white",color="black")

Why does the shape of the second distribution differ from the first distribution?
What is the relationship between the sample size (i.e. the amount of dice rolled per sample) and the standard deviation of the sampling distribution?

5.3.8 Question 8

Suppose we have information about the height of all Dutch males over 18 years of age. Height is normally distributed; the mean height and standard deviation of the population is as follows:

\[\mu = 180 cm\] \[ \sigma = 10 cm\]

What is the standard deviation of the sampling distribution of the sample mean when samples of size 16 are randomly drawn from the population?
What is the standard deviation of the sampling distribution of the sample mean when samples of size 100 are randomly drawn from the population?

5.3.9 Question 9

Suppose you collect the mean reaction time in a sample of 9 participants with a mean of 200ms (\(\bar{X} = 200 \text{ ms}\)). You know the true population standard deviation is equal to 25 ms (\(\sigma = 25 \text{ ms}\)).

What is the 95% confidence interval for the mean of this sample?
What happens to the confidence interval of the mean if the sample size increases to 25?

5.4 Stroop data

Let’s have a look at the data from our experiment and estimate something about the population based on our sample¹⁵. In a typical Stroop experiment, subjects name the color of words as fast as they can. The trick is that sometimes the color of the word is the same as the name of the word, and sometimes it is not, like you’ve probably noticed. Here is an example of the what you saw:

Congruent trials occur when the color and word match. So, the correct answers for each of the congruent stimuli shown would be to say, red, green, blue and yellow. Incongruent trials occur when the color and word mismatch. The correct answers for each of the incongruent stimuli would be: blue, yellow, red, green. What happens is that people are faster to name the color of the congruent items compared to the color of the incongruent items. This difference (incongruent reaction time - congruent reaction time) is called the Stroop effect.

The data your tutor provided can be loaded as follows:

library(readxl)
# Depending where you have placed and how you have named your file
df_stroop_wide <- read_xlsx("data/stroop.xlsx")

Having a quick look at the dataframe will reveal the data is formatted in a so-called “wide” format. That is, each row are the repeated responses of a single participant and each response is in a separate column (variable).

head(df_stroop_wide)

## # A tibble: 6 x 3
##   Participant Congruent Incongruent
##         <dbl>     <dbl>       <dbl>
## 1           1      1200        1500
## 2           2      1100        1350
## 3           3       900        1000
## 4           4       555        1111
## 5           5      1200        1500
## 6           6      1100        1350

Data entered into excel is often in the wide format, but a lot of functions in R require data to be in a so-called “long” format. In the long format, each variable forms a column, each observation forms a row, and each cell is a single measurement. To give a concrete example, compare the table of the average yearly temperature in three countries below:

Table 5.1: Average temperature in three countries for 1995-1997 in wide format
Country	1995	1996	1997
Sweden	5	12	4
Denmark	5	7	8
Norway	3	11	9

with the following table:

Table 5.2: Average temperature in three countries for 1995-1997 in long format
Country	Year	Average temperature
Sweden	1995	5
Denmark	1995	5
Norway	1995	3
Sweden	1996	12
Denmark	1996	7
Norway	1996	11
Sweden	1997	4
Denmark	1997	8
Norway	1997	9

Both tables contain the same data, but Table 5.1 is in wide format, while Table 5.2 is in long format.

There are various advantages working with data in a long format (explained here), but what is important to know is that most operations on your data are easier in the long format, while some operations are easier in the wide format. Furthermore, tidyverse functions like ggplot() expect data in the long format, so it’s important to learn how to go from wide format data to long format data (and vice versa). To functions to switch between data formats are pivot_longer() (wide to long) and pivot_wider() (long to wide).

To change the format of our data from wide to long, we use the pivot_longer() function as follows:

df_stroop_long <- df_stroop_wide %>% 
  pivot_longer(cols = c(Congruent,Incongruent),names_to = "Condition",values_to = "RT")

head(df_stroop_long)

## # A tibble: 6 x 3
##   Participant Condition      RT
##         <dbl> <chr>       <dbl>
## 1           1 Congruent    1200
## 2           1 Incongruent  1500
## 3           2 Congruent    1100
## 4           2 Incongruent  1350
## 5           3 Congruent     900
## 6           3 Incongruent  1000

Our data is now in the long format, like Table 5.2.

Let’s perform some operations on our data. The reaction times in our Stroop experiment are given in milliseconds, which is how reaction times are usually represented, but suppose we want to convert them to seconds instead. The mutate() function allows us to do that easily in long format data as follows:

df_stroop_long_sec <- df_stroop_long %>% 
  mutate(RT_sec = RT / 1000)

head(df_stroop_long_sec)

## # A tibble: 6 x 4
##   Participant Condition      RT RT_sec
##         <dbl> <chr>       <dbl>  <dbl>
## 1           1 Congruent    1200   1.2 
## 2           1 Incongruent  1500   1.5 
## 3           2 Congruent    1100   1.1 
## 4           2 Incongruent  1350   1.35
## 5           3 Congruent     900   0.9 
## 6           3 Incongruent  1000   1

Note: mutate() creates an additional variable (RT_sec) with the reaction times in seconds and preserves the variable RT. Use transmute() if you would like to drop the existing reaction time variable (RT) from your dataframe.

To pivot back from the long format to the wide format, we can use pivot_wider() as follows:

df_stroop_wide_sec <- df_stroop_long_sec %>% 
  pivot_wider(names_from = Condition, values_from = c(RT,RT_sec))

head(df_stroop_wide_sec)

## # A tibble: 6 x 5
##   Participant RT_Congruent RT_Incongruent RT_sec_Congruent RT_sec_Incongruent
##         <dbl>        <dbl>          <dbl>            <dbl>              <dbl>
## 1           1         1200           1500            1.2                 1.5 
## 2           2         1100           1350            1.1                 1.35
## 3           3          900           1000            0.9                 1   
## 4           4          555           1111            0.555               1.11
## 5           5         1200           1500            1.2                 1.5 
## 6           6         1100           1350            1.1                 1.35

Now, let’s perform an operation on the wide format data. Suppose we want to calculate the difference between the Congruent and Incongruent condition in seconds. This can be achieved in wide format data with mutate() as follows:

df_stroop_wide_sec_diff <- df_stroop_wide_sec %>% 
  mutate(RT_sec_diff = RT_sec_Congruent - RT_sec_Incongruent)

head(df_stroop_wide_sec_diff)

## # A tibble: 6 x 6
##   Participant RT_Congruent RT_Incongruent RT_sec_Congruent RT_sec_Incongru…
##         <dbl>        <dbl>          <dbl>            <dbl>            <dbl>
## 1           1         1200           1500            1.2               1.5 
## 2           2         1100           1350            1.1               1.35
## 3           3          900           1000            0.9               1   
## 4           4          555           1111            0.555             1.11
## 5           5         1200           1500            1.2               1.5 
## 6           6         1100           1350            1.1               1.35
## # … with 1 more variable: RT_sec_diff <dbl>

Notice that the particular expression RT_sec_Congruent - RT_sec_Incongruent in mutate() only made sense in the context of wide format data.

We could have achieved the above with different operations on the long format data, but the lesson is that you should pay attention to the format of your data: not all data you import is in the long format, while some functions require it, and the format of your data dictates how and which operations you can perform.

Complete the Stroop data exercises below in your lab template.

5.4.1 Stroop data exercises

Calculate the mean and standard deviation of the Stroop effect (i.e. the difference between the incongruent and congruent condition). There are several ways to achieve this and you can use either the wide format or long format data. Of course, there are differences how to approach this based on the format of your data. Pick the option that suits you most.
Calculate the 95% confidence interval of the Stroop effect (hint: use qt() with the appropriate degrees of freedom to find the correct value for a 95% confidence interval).
Why do you need to use a t-distribution for calculating the confidence interval in this particular instance?

This next exercise is combining everything we have learned so far.

Have a look at the following paper. The authors of this paper investigated the Stroop effect in two different conditions and the main results are summarized in Figure 1.

Adjust/complete the code example below to recreate Figure 1 of the paper linked. We only have two conditions, so your figure will only include two bars + errorbars (congruent and incongruent).

df_summary <- df_stroop_long %>%
  group_by(...) %>%
  summarise(n = ...,
            meanRT = ...,
            sdRT = ...) %>%
  mutate(se = ...)

ggplot(df_summary , aes(x=... ,y= ...)) +
  geom_col() +
  geom_errorbar(aes(ymin = ... , ymax = ... ))

When you have completed all exercises and are happy with your progress today, please knit your document and submit it to Canvas. If you finish before the time is up, start with the required readings of Week 6, work on your assignment, or help out your fellow students.

Example adapted from Lab 10 in “Answering questions with data” by Matthew Crump↩︎