Data Visualization

Sep. 24, 2020

Housekeeping

Recordings

  • Where do I find the recordings?
  • When will the recordings become available?

Asking for Help

Before emailing for help, try the following:

  • Restarting R and rerunning your code.
  • Read the error message.
  • Search the internet.
  • Read the assigned readings.
  • Ask classmate(s) for help.

How to Ask for Help

If you have done your best to solve the problem and still need help, please email us. A good email will include:

  • Concise, detailed description of the problem.
  • Attached script (with comments!).
  • Email sent to both Rebecca and Zak.

Please do all you can to make it as easy as possible for us to help you. It takes a long time, and there are only two of us.

Review the Assignment

Part 1: Data Import

Import the Data

  1. Import passwords.sav into R as a data.frame. Save the data.frame as an object named pw.
library(foreign)
pw <- foreign::read.spss(file = "data/passwords.sav",
  to.data.frame = TRUE
)

Inspect the Data Frame

  1. Use View(), head(), and str() to inspect pw.
head(pw)
##   rank  password            category value time_unit offline_crack_sec rank_alt
## 1    1 password  password-related     6.91   years            2.17e+00        1
## 2    2 123456    simple-alphanumeric 18.52   minutes          1.11e-05        2
## 3    3 12345678  simple-alphanumeric  1.29   days             1.11e-03        3
## 4    4 1234      simple-alphanumeric 11.11   seconds          1.11e-07        4
## 5    5 qwerty    simple-alphanumeric  3.72   days             3.21e-03        5
## 6    6 12345     simple-alphanumeric  1.85   minutes          1.11e-06        6
##   strength font_size
## 1        8        11
## 2        4         8
## 3        4         8
## 4        4         8
## 5        8        11
## 6        4         8

Inspect the Data Frame

  1. Use View(), head(), and str() to inspect pw.
str(pw)
## 'data.frame':    507 obs. of  9 variables:
##  $ rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ password         : chr  "password " "123456   " "12345678 " "1234     " ...
##  $ category         : chr  "password-related   " "simple-alphanumeric" "simple-alphanumeric" "simple-alphanumeric" ...
##  $ value            : num  6.91 18.52 1.29 11.11 3.72 ...
##  $ time_unit        : chr  "years  " "minutes" "days   " "seconds" ...
##  $ offline_crack_sec: num  2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
##  $ rank_alt         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ strength         : num  8 4 4 4 8 4 8 4 7 8 ...
##  $ font_size        : num  11 8 8 8 11 8 11 8 11 11 ...
##  - attr(*, "codepage")= int 65001

Password Lengths

  1. The column password is a character vector. Use the function nchar() to count the length of each password.
nchar(pw$password)
##   [1] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
##  [38] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
##  [75] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [112] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [149] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [186] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [223] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [260] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [297] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [334] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [371] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [408] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [445] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [482] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Why are they all the Same?

  1. Something seems fishy… Look more closely at pw$password. Why is the result of nchar() the same for each password? This is explained in the “Details” section of the documentation for foreign::read.spss().

“Fixed length strings (including value labels) are padded on the right with spaces by SPSS, and so are read that way by R.”

Trim That Whitespace

  1. Use trimws() to remove the leading and trailing whitespace from each character string of pw$password.
pw$password <- trimws(pw$password)

Average Strength of Short Passwords

  1. Calculate the average strength of passwords with exactly 4 characters.
pw_with_4_chars <- subset(pw, nchar(password) == 4)

mean(pw_with_4_chars$strength)
## [1] 4.428571
mean(pw$strength[nchar(pw$password) == 4])
## [1] 4.428571
pw$nchar <- nchar(pw$password)

mean(pw$strength[pw$nchar == 4])
## [1] 4.428571

Working with Factors

The objective of this section of the assignment is to determine the average strength of the different categories of passwords.

Trim Trailing Whitespace

  1. Use trimws() to remove leading and trailing whitespace from pw$category.
pw$category <- trimws(pw$category)

Convert to Factor

  1. Convert pw$category to a factor with the following levels and labels:
Level Number Level Label
1 name Name
2 cool-macho Cool/macho
3 simple-alphanumeric Simple alphanumeric
4 fluffy Fluffy
5 sport Sport
6 nerdy-pop Nerdy pop
7 animal Animal
8 password-related Password-related
9 food Food
10 rebellious-rude Rebellious/rude

Convert to Factor (Code)

pw$category <- factor(pw$category,
  levels = c(
    "name",
    "cool-macho",
    "simple-alphanumeric",
    "fluffy",
    "sport",
    "nerdy-pop",
    "animal",
    "password-related",
    "food",
    "rebellious-rude"
  ),
  labels = c(
    "Name",
    "Cool/macho",
    "Simple alphanumeric",
    "Fluffy",
    "Sport",
    "Nerdy pop",
    "Animal",
    "Password-related",
    "Food",
    "Rebellious/rude"
  )
)

Average Crack Time

  1. Calculate the average offline crack time of passwords in each of the 10 categories. Optional: Use by(), aggregate(), or tapply() to do this with just one line of code.
mean(pw$offline_crack_sec[pw$category == "Name"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Cool/macho"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Simple alphanumeric"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Fluffy"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Sport"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Nerdy pop"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Animal"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Password-related"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Food"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Rebellious/rude"], na.rm = TRUE)

Using by()

by(pw$offline_crack_sec,
  INDICES = pw$category,
  FUN = mean, na.rm = TRUE
)
## pw$category: Name
## [1] 0.2794351
## ------------------------------------------------------------ 
## pw$category: Cool/macho
## [1] 0.3471083
## ------------------------------------------------------------ 
## pw$category: Simple alphanumeric
## [1] 0.6123755
## ------------------------------------------------------------ 
## pw$category: Fluffy
## [1] 0.1607586
## ------------------------------------------------------------ 
## pw$category: Sport
## [1] 1.106457
## ------------------------------------------------------------ 
## pw$category: Nerdy pop
## [1] 1.288615
## ------------------------------------------------------------ 
## pw$category: Animal
## [1] 0.2435385
## ------------------------------------------------------------ 
## pw$category: Password-related
## [1] 2.247346
## ------------------------------------------------------------ 
## pw$category: Food
## [1] 0.1999104
## ------------------------------------------------------------ 
## pw$category: Rebellious/rude
## [1] 0.4044709

Using aggregate()

aggregate(offline_crack_sec ~ category,
  data = pw,
  FUN = mean,
  na.rm = TRUE
)
##               category offline_crack_sec
## 1                 Name         0.2794351
## 2           Cool/macho         0.3471083
## 3  Simple alphanumeric         0.6123755
## 4               Fluffy         0.1607586
## 5                Sport         1.1064572
## 6            Nerdy pop         1.2886154
## 7               Animal         0.2435385
## 8     Password-related         2.2473456
## 9                 Food         0.1999104
## 10     Rebellious/rude         0.4044709

Using tapply()

tapply(pw$offline_crack_sec,
  INDEX = pw$category,
  FUN = mean, na.rm = TRUE
)
##                Name          Cool/macho Simple alphanumeric              Fluffy 
##           0.2794351           0.3471083           0.6123755           0.1607586 
##               Sport           Nerdy pop              Animal    Password-related 
##           1.1064572           1.2886154           0.2435385           2.2473456 
##                Food     Rebellious/rude 
##           0.1999104           0.4044709

Plots

Histogram

  1. Create a histogram showing the distribution of strength for password strengths ≤ 10.
pwsub <- subset(pw, strength <= 10)
hist(pwsub$strength)

Histogram (Output)

Scatter Plot

  1. Create a scatter plot showing the relationship between password strength and password length. Include only passwords with strength ≤ 10.
plot(
  x = pwsub$strength,
  y = nchar(pwsub$password),
  xlab = "Password Strength",
  ylab = "Password Length"
)

Scatter Plot (Output)

Common Assignment Mistakes

foreign::read.spss(file = "passwords.sav")
to.data.frame = TRUE
foreign::read.spss(file = "passwords.sav", to.data.frame = TRUE)

pw <- read.spss(file = "passwords.sav", to.data.frame = TRUE)
library(foreign)

pw <- read.spss(file.choose(), 
  to.data.frame = TRUE)

nchar("password")
nchar(pw$password)

trimws(pw$password)

Don’t Include in Your Script

Don’t include these:

  • Calls to install.packages().
  • Calls to help() or ?.
  • Broken code/the mistakes you made along the way.

Today’s Lab: Data Visualization

The Datasaurus Dozen

Different Datasets

Datasaurus

Same Statistics!

  • Mx = 54.26
  • My = 47.83
  • SDx = 16.76
  • SDy = 26.93
  • r = -0.06

Learning Outcomes

  1. Give reasons why data visualization is a useful tool for science communication
  2. Know and apply the APA guidelines for figures in APA manuscripts.
  3. Understand the basic “grammar of graphics” used by ggplot2
  4. Apply the basic “grammar of graphics” to visualize data in R.

APA Guidelines for Figures

What not to do…

Not an APA figure

Guidelines

  • Simple, legible font
  • Axes labeled with units of measurement
  • All features labeled and explained
  • Figure notes appear below
  • Accessible use of colour Accessible colour

APA Figure Example

Sample APA figure

APA Table Example

Sample APA table

The Grammar of Graphics

The Grammar of Graphics

wilkinson-gg

Layered Grammar of Graphics

gg

Example

Scale and Coords

Example scale and coordinate system

Geometric Objects

Example geoms with aesthetic mappings

Example

Example plot

tRy it! Install and Load ggplot2

Install ggplot2.

install.packages("ggplot2")

Load ggplot2.

library(ggplot2)

tRy it! Data Import

Download “anscombe_long.csv” from Canvas and import it to R. Assign the resulting data.frame object the name anscombe_long.

anscombe_long <- read.csv("data/anscombe_long.csv")

tRy it! Convert to Factor

Convert to the column dataset to a factor with levels 1 = “I”, 2 = “II”, 3 = “III”, and 4 = “IV”.

anscombe_long$dataset <- factor(anscombe_long$dataset,
  levels = 1:4,
  labels = c("I", "II", "III", "IV")
)

The Data: Anscombe’s Quartet

“Four x-y datasets which have the same traditional statistical properties (mean, variance, correlation, regression line, etc.), yet are quite different.”

The Data: Anscombe’s Quartet

id dataset x y
1 I 10 8.04
2 I 8 6.95
3 I 13 7.58
4 I 9 8.81
5 I 11 8.33
6 I 14 9.96
7 I 6 7.24
8 I 4 4.26
9 I 12 10.84
10 I 7 4.82
11 I 5 5.68
1 II 10 9.14
2 II 8 8.14
3 II 13 8.74
4 II 9 8.77
5 II 11 9.26
6 II 14 8.10
7 II 6 6.13
8 II 4 3.10
9 II 12 9.13
10 II 7 7.26
11 II 5 4.74
1 III 10 7.46
2 III 8 6.77
3 III 13 12.74
4 III 9 7.11
5 III 11 7.81
6 III 14 8.84
7 III 6 6.08
8 III 4 5.39
9 III 12 8.15
10 III 7 6.42
11 III 5 5.73
1 IV 8 6.58
2 IV 8 5.76
3 IV 8 7.71
4 IV 8 8.84
5 IV 8 8.47
6 IV 8 7.04
7 IV 8 5.25
8 IV 19 12.50
9 IV 8 5.56
10 IV 8 7.91
11 IV 8 6.89

Same Statistical Properties?

tapply(anscombe_long$x,
  INDEX = anscombe_long$dataset,
  FUN = mean
)
##   I  II III  IV 
##   9   9   9   9

Same Statistical Properties?

Table 1

Statistical Properties of x and y in Four Datasets

Mx SDx My SDy cor(x, y)
I 9 3.32 7.5 2.03 0.82
II 9 3.32 7.5 2.03 0.82
III 9 3.32 7.5 2.03 0.82
IV 9 3.32 7.5 2.03 0.82

So… how are they different?

Plot 1

histograms of y

Components Covered

Components of gg covered in plot 1

Data

gg

Plot 1: Data

ggplot(data = anscombe_long)

Aesthetics

aesthetics

What Are Aesthetics?

From the documentation for ggplot2::aes():

“Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms.”

Aesthetic Options

Argument The Value is Mapped to:
x Where the geom is placed along the x-axis.
y Where the geom is placed along the y-axis.
colour The line colour of the geom.
fill The fill of the geom.
linetype Different linetypes (e.g., solid, dashed, or dotted).
shape Different shapes (e.g., square, circle, diamond).
size The size of the geom.
alpha The transparency of the geom.

Plot 1: Aesthetics

ggplot(data = anscombe_long, aes(x = y))

Geometric Objects

geoms

What are Geometric Objects?

There are many, and we’ll learn more over the course of the lab. Today, we’ll learn about these:

  • geom_histogram()
  • geom_point()
  • geom_jitter()
  • geom_smooth()
  • geom_boxplot()

Plot 1: Geometric Objects

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram()

Plot 1: Geometric Objects

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1)

Learning About Geoms

Learn more about a geom by visiting the documentation for that geom. Let’s start with geom_histogram()

?geom_histogram

The documentation will tell you required and optional aesthetics for a geom.

tRy it! Consult the Documentation

Scour the documentation for geom_histogram(). Which aesthetics can be mapped to geom_histogram()?

tRy it! Consult the Documentation (Answer)

geom_histogram aesthetics

Facets

facets

Plot 1: Facets

plot1

Adding Facets to Plot 1

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(facets = "dataset")

Adding Facets to Plot 1, Scales

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(facets = "dataset", scales = "free")

The Glamour of Graphics

From a conference talk given by William Chase.

glamour of graphics talk

Grammar vs. Glamour

Grammar of Graphics

“A tool that enables us to concisely describe the components of a graphic.”

Glamour of Graphics

“A tool that enables us to concisely describe the design of a graphic.”

Histogram Colour and Fill

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "white", fill = "red") +
  facet_wrap(facets = "dataset", scales = "free")

Histogram Colour and Fill (plot)

Specify Colour by Hex

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "#1d3557", fill = "#457b9d") +
  facet_wrap(facets = "dataset", scales = "free")

Specify Colour by Hex (plot)

Finding Colours for Plots

Colour Palettes

Colour Blind Filter

Titles and Labels

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "#1d3557", fill = "#457b9d") +
  facet_wrap(facets = "dataset", scales = "free") +
  labs(
    title = "Anscombe's Quartet",
    subtitle = "Distributions of y",
    x = NULL,
    y = NULL
  )

Titles and Labels (plot)

ggplot Themes

Use ggplot themes to change design elements of your plot. There are many and they all start with theme_. For example, our plot uses theme_minimal(), which is a good theme because it removes a lot of extraneous elements.

Add a Theme

ggplot(anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "#1d3557", fill = "#457b9d") +
  facet_wrap(facets = "dataset", scales = "free") +
  labs(
    title = "Anscombe's Quartet",
    subtitle = "Distributions of y",
    x = NULL,
    y = NULL
  ) +
  theme_minimal()

Add a Theme (plot)

We’re Done! Save it!

ggsave(filename = "plot1.png")
ggsave(filename = "plot1.png", width = 9, height = 6, dpi = 320)

Plot 2: Boxplot

Boxplots

Five-number summary + outliers

  1. Minimum (excluding outliers)
  2. Maximum (excluding outliers)
  3. Median
  4. Lower quartile (median of the lower half of the dataset)
  5. Upper quartile (median of the upper half of the dataset)
  6. Outliers

tRy it! Recreate the Example Boxplot

ggplot(anscombe_long, aes(x = dataset, y = y)) +
  geom_boxplot()

Scatter Plots with Trendlines

Plot 3

tRy it! Create Plot 3

Plot 3 uses two new geoms: geom_smooth() and geom_point().

Create Plot 3

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_smooth() +
  geom_point() +
  facet_wrap("dataset") +
  theme_minimal()

Create Plot 3

Adjust geom_smooth

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_smooth(method = "lm") +
  geom_point() +
  facet_wrap("dataset") +
  theme_minimal()

Adjust geom_smooth

Remove CI Around Smooth

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point() +
  facet_wrap("dataset") +
  theme_minimal()

Remove CI Around Smooth

Summary

What Did We Learn Today?

  • Why does data visualization matter?
  • What are APA guildelines for figures?
  • Basics of ggplot2 and layered grammar of graphics.
  • Learned the geoms:
    • geom_histogram()
    • geom_boxplot()
    • geom_point()
    • geom_smooth()

Assignment

What’s Due?

  • Lab assignment due Monday at 11:59 PM.
  • Quiz 1 is due by next Wednesday at 11:59 PM.

Quiz 1 Information

  • Applied quiz.
  • Covers content in labs 1–3.
  • Includes:
    • Importing data.
    • Converting variables to factors.
    • Data visualization with ggplot2.

Timed quiz. You can access it in Canvas. Open book, open notes, open internet. Not open friends.

See You Next Time!