Chi-square

Oct. 1

The Chi-square Test

Bridge In

Learning Outcomes

Upon completing this module, students will be able to:

  • Conduct \(\chi^2\) tests in R.
  • Create appropriate data visualizations for comparisons of categorical data.
  • Report results of \(\chi^2\) tests in APA style.
  • Accurately interpret the results of \(\chi^2\) tests.
  • Evaluate the strength of evidence provided by \(\chi^2\) tests in studies using \(\chi^2\) tests of independence for categorical data.

The Study: Reminders Through Association

Summary

Someone briefly remind us of the design of the study (focus on study 5).

What We’re Reproducing

Results from two \(\chi^2\) tests from study 5.

We’ll Do Together

“…participants are more likely to follow through when they are assigned a cue-based reminder (in the forced-reminder through-association condition, 87%) than when no cue-based reminder is available (none condition, 59%), \(\chi^2\)(1, N = 305) = 30.22, p < .001.”

On Your Own

“…those in the costly-reminder-through-association condition were not only more likely to earn the bonus (74%) than those in the none condition (59%), \(\chi^2\)(1, N = 297) = 7.23, p = .007,…”

Let’s Get Started

tRy it! Setup

Complete the steps in the “Setup” portion of the lab activity.

  1. Download “RTA_study5.csv” and “codebook database.xlsx” from Canvas.
  2. Import “RTA_study5.csv” to R.
  3. Convert the following variables to factors, condition, choice, and correct. Read from “codebook database.xlsx” to identify appropriate factor labels.

Import the Data

Import “RTA_study5.csv” to R.

dta <- read.csv("data/RTA_study5.csv")

Convert Variables to Factors

Convert the following variables to factors, condition, choice, and correct. Read from “codebook database.xlsx” to identify appropriate factor labels.

Convert condition to Factor

The codebook tells us the levels/labels for condition.

dta$condition <- factor(dta$condition,
  levels = 1:4,
  labels = c("Free", "None", "Costly", "All")
)
##  [1] Free   Costly Costly None   Free   Free   All    All    None   All   
## Levels: Free None Costly All

Convert choice to Factor

What does choice tell us?

dta$choice <- factor(dta$choice,
  levels = c(0, 1),
  labels = c("did not take reminder", "took reminder")
)
##  [1] took reminder         took reminder         did not take reminder
##  [4] did not take reminder did not take reminder did not take reminder
##  [7] did not take reminder did not take reminder did not take reminder
## [10] took reminder        
## Levels: did not take reminder took reminder

Convert correct to Factor

What does correct tell us?

dta$correct <- factor(dta$correct,
  levels = c(0, 1),
  labels = c("incorrect", "correct")
)
##  [1] correct   incorrect incorrect correct   correct   correct   incorrect
##  [8] correct   correct   correct  
## Levels: incorrect correct

Reproduce Results

Result 1

“…participants are more likely to follow through when they are assigned a cue-based reminder (in the forced-reminder through-association condition, 87%) than when no cue-based reminder is available (none condition, 59%), \(\chi^2\)(1, N = 305) = 30.22, p < .001.”

tRy it! Subset & Drop Levels

  1. Create a subset of your data.frame that includes only the relevant levels of condition.
  2. Use droplevels() to drop the extra levels of condition.

Hint: You can use either | or %in% to subset with one line of code. Otherwise, you could do it in two steps.

Subset

Option 1: Multiple Steps

dta1 <- dta
dta1 <- subset(dta1, condition != "Free")
dta1 <- subset(dta1, condition != "Costly")

Subset (2)

Option 2: Using |

Remember that | means “or”.

dta1 <- subset(dta, condition == "All" | condition == "None")

Can you imagine a situation where this approach might be unwieldy?

Subset (3)

Option 3: Using %in%

dta1 <- subset(dta, condition %in% c("All", "None"))

Check Your Work (an aside)

Why should you check your work as you go?

  • Just because R didn’t return an error, doesn’t mean your code did what you wanted.
  • Sometimes errors later in the code are the result of an unrecognized mistake earlier.
  • Code that does the wrong thing but doesn’t return an error is harder to catch.
  • It saves time “debugging” down the line.

Did Our Subsetting Work?

We don’t need to see this in your code

We can check with the function all(), which returns:

  • TRUE if all values in the vector are TRUE.
  • FALSE if any values in the vector are FALSE.
  • NA otherwise.

Did Our Subsetting Work? (2)

all(dta1$condition %in% c("All", "None"))
## [1] TRUE

all() can be a useful tool for testing your code.

Did Our Subsetting Work? (3)

Alternatively, we can use summary(), which will count the number of times each factor level occurs.

summary(dta1$condition)
##   Free   None Costly    All 
##      0    153      0    152

Drop Extra Levels

Why do we need to do this?

levels(dta1$condition)
## [1] "Free"   "None"   "Costly" "All"

Drop Extra Levels (2)

dta1$condition <- droplevels(dta1$condition)
levels(dta1$condition)
## [1] "None" "All"

Mosaic Plots

A Mosaic Plot

tRy it! Make a Mosiac Plot

  1. Read the documentation for mosaicplot(). Which argument(s) is/are required for the default S3 method?
  2. Create a contingency table using the function recommended in the mosaicplot() documentation.
  3. Make a mosaic plot.
  4. Remove the main title.
  5. Give appropriate labels to the x and y axes.

Making a Mosaic Plot

  • How many required arguments for mosaicplot()? How do you know?
  • What is the required argument?
  • What value does it expect?

Contingency Table First

cont_table <- table(dta1$condition, dta1$correct)
cont_table <- table(Condition = dta1$condition, Response = dta1$correct)

Now Make the Plot

mosaicplot(cont_table)

Change the Labels

mosaicplot(table(dta1$condition, dta1$correct),
  main = NA,
  xlab = "Condition",
  ylab = "Response"
)

Bar Plot Using ggplot2

A ggplot2 Bar Plot

The Basics

ggplot(dta1, aes(x = condition, fill = correct)) +
  geom_bar(position = "fill")

The Percentages

What Percentages?

“…participants are more likely to follow through when they are assigned a cue-based reminder (in the forced-reminder through-association condition, 87%) than when no cue-based reminder is available (none condition, 59%), \(\chi^2\)(1, N = 305) = 30.22, p < .001.”

How do we interpret these percentages?

Calculating These Percentages

Several ways to do this. E.g., using the contingency table.

##          Response
## Condition incorrect correct
##      None        63      90
##      All         20     132
cont_table / rowSums(cont_table)
##          Response
## Condition incorrect   correct
##      None 0.4117647 0.5882353
##      All  0.1315789 0.8684211

Using Subsets

  • Recall that FALSE = 0, and TRUE = 1.
  • Therefore, the mean of a logical vector is the proportion of TRUE values.
  • For example:
mean(c(TRUE, TRUE, TRUE, FALSE))
## [1] 0.75
mean(c(1, 1, 1, 0))
## [1] 0.75

Using Subsets (2)

mean(dta1$correct[dta1$condition == "All"] == "correct")
## [1] 0.8684211
mean(dta1$correct[dta1$condition == "None"] == "correct")
## [1] 0.5882353

Using Subsets (3)

With tapply()

tapply(dta1$correct == "correct", dta1$condition, mean)
##      None       All 
## 0.5882353 0.8684211

Actually Doing a Chi-square Test

tRy it! Chi-square Test

Use chisq.test() to conduct a \(\chi^2\) test to compare the groups.

  • Consult the documentation for chisq.test() to learn how this is done.
  • By default, chisq.test() applies a continuity correction for 2x2 tables. The authors of the paper do not apply this correction, so you will need to change the default argument to reproduce their results.

Chi-square Test

Two ways of doing this that are equivalent.

Option 1

result1 <- chisq.test(x = cont_table, correct = FALSE)

Option 2

result1 <- chisq.test(x = dta1$condition,
  y = dta1$correct,
  correct = FALSE
)

Results of the Chi-square

Test Statistic

result1
## 
##  Pearson's Chi-squared test
## 
## data:  dta1$condition and dta1$correct
## X-squared = 30.22, df = 1, p-value = 3.857e-08

N

sum(cont_table)
## [1] 305

The Chi-square Distribution

chisq

Assumptions of Chi-square

What Are the Assumptions?

Basically two main assumptions:

  1. Expected frequencies are sufficiently large.
  2. Observations are independent.

Expected Frequencies

Expected frequencies of at least 80% of cells should be greater than 5.

result1$expected
##               dta1$correct
## dta1$condition incorrect  correct
##           None  41.63607 111.3639
##           All   41.36393 110.6361

Expected Frequencies too Small?

  • Fisher’s exact test computes exact p values.
  • Does not use expected frequencies to compute p.
  • Generally preferred over chi-square test (especially for small samples).

tRy it! Fisher’s Exact Test

Use fisher.test() to compare the groups using Fisher’s exact test.

Fisher’s Exact Test

fisher.test(x = cont_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_table
## p-value = 3.919e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.537016 8.614158
## sample estimates:
## odds ratio 
##   4.596306

Odds Ratio

Effect size.

  • OR = 1 Exposure unrelated to odds of outcome.
  • OR > 1 Exposure associated with higher odds of outcome.
  • OR < 1 Exposure associated with lower odds of outcome.

Odds Ratio For This

  • OR = 1 Participants in all condition were no more likely to get it correct.
  • OR > 1 Participants in all condition more likely to get it correct.
  • OR < 1 Participants in all condition less likely to get it correct.