Data Visualization

Sep. 24, 2020

Housekeeping

Recordings

Where do I find the recordings?
When will the recordings become available?

Asking for Help

Before emailing for help, try the following:

Restarting R and rerunning your code.
Read the error message.
Search the internet.
Read the assigned readings.
Ask classmate(s) for help.

How to Ask for Help

If you have done your best to solve the problem and still need help, please email us. A good email will include:

Concise, detailed description of the problem.
Attached script (with comments!).
Email sent to both Rebecca and Zak.

Please do all you can to make it as easy as possible for us to help you. It takes a long time, and there are only two of us.

Review the Assignment

Part 1: Data Import

Import the Data

Import passwords.sav into R as a data.frame. Save the data.frame as an object named pw.

library(foreign)

pw <- foreign::read.spss(file = "data/passwords.sav",
  to.data.frame = TRUE
)

Inspect the Data Frame

Use View(), head(), and str() to inspect pw.

head(pw)

##   rank  password            category value time_unit offline_crack_sec rank_alt
## 1    1 password  password-related     6.91   years            2.17e+00        1
## 2    2 123456    simple-alphanumeric 18.52   minutes          1.11e-05        2
## 3    3 12345678  simple-alphanumeric  1.29   days             1.11e-03        3
## 4    4 1234      simple-alphanumeric 11.11   seconds          1.11e-07        4
## 5    5 qwerty    simple-alphanumeric  3.72   days             3.21e-03        5
## 6    6 12345     simple-alphanumeric  1.85   minutes          1.11e-06        6
##   strength font_size
## 1        8        11
## 2        4         8
## 3        4         8
## 4        4         8
## 5        8        11
## 6        4         8

Inspect the Data Frame

Use View(), head(), and str() to inspect pw.

str(pw)

## 'data.frame':    507 obs. of  9 variables:
##  $ rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ password         : chr  "password " "123456   " "12345678 " "1234     " ...
##  $ category         : chr  "password-related   " "simple-alphanumeric" "simple-alphanumeric" "simple-alphanumeric" ...
##  $ value            : num  6.91 18.52 1.29 11.11 3.72 ...
##  $ time_unit        : chr  "years  " "minutes" "days   " "seconds" ...
##  $ offline_crack_sec: num  2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
##  $ rank_alt         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ strength         : num  8 4 4 4 8 4 8 4 7 8 ...
##  $ font_size        : num  11 8 8 8 11 8 11 8 11 11 ...
##  - attr(*, "codepage")= int 65001

Password Lengths

The column password is a character vector. Use the function nchar() to count the length of each password.

nchar(pw$password)

##   [1] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
##  [38] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
##  [75] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [112] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [149] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [186] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [223] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [260] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [297] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [334] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [371] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [408] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [445] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [482] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Why are they all the Same?

Something seems fishy… Look more closely at pw$password. Why is the result of nchar() the same for each password? This is explained in the “Details” section of the documentation for foreign::read.spss().

“Fixed length strings (including value labels) are padded on the right with spaces by SPSS, and so are read that way by R.”

Trim That Whitespace

Use trimws() to remove the leading and trailing whitespace from each character string of pw$password.

pw$password <- trimws(pw$password)

Average Strength of Short Passwords

Calculate the average strength of passwords with exactly 4 characters.

pw_with_4_chars <- subset(pw, nchar(password) == 4)

mean(pw_with_4_chars$strength)

## [1] 4.428571

mean(pw$strength[nchar(pw$password) == 4])

## [1] 4.428571

pw$nchar <- nchar(pw$password)

mean(pw$strength[pw$nchar == 4])

## [1] 4.428571

Working with Factors

The objective of this section of the assignment is to determine the average strength of the different categories of passwords.

Print Unique Values of Category

Use unique(pw$category) to print the unique values of category to the console.

unique(pw$category)

##  [1] "password-related   " "simple-alphanumeric" "animal             "
##  [4] "sport              " "cool-macho         " "name               "
##  [7] "fluffy             " "food               " "nerdy-pop          "
## [10] "rebellious-rude    " "                   "

Trim Trailing Whitespace

Use trimws() to remove leading and trailing whitespace from pw$category.

pw$category <- trimws(pw$category)

Convert to Factor

Convert pw$category to a factor with the following levels and labels:

Level Number	Level	Label
1	name	Name
2	cool-macho	Cool/macho
3	simple-alphanumeric	Simple alphanumeric
4	fluffy	Fluffy
5	sport	Sport
6	nerdy-pop	Nerdy pop
7	animal	Animal
8	password-related	Password-related
9	food	Food
10	rebellious-rude	Rebellious/rude

Convert to Factor (Code)

pw$category <- factor(pw$category,
  levels = c(
    "name",
    "cool-macho",
    "simple-alphanumeric",
    "fluffy",
    "sport",
    "nerdy-pop",
    "animal",
    "password-related",
    "food",
    "rebellious-rude"
  ),
  labels = c(
    "Name",
    "Cool/macho",
    "Simple alphanumeric",
    "Fluffy",
    "Sport",
    "Nerdy pop",
    "Animal",
    "Password-related",
    "Food",
    "Rebellious/rude"
  )
)

Average Crack Time

Calculate the average offline crack time of passwords in each of the 10 categories. Optional: Use by(), aggregate(), or tapply() to do this with just one line of code.

mean(pw$offline_crack_sec[pw$category == "Name"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Cool/macho"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Simple alphanumeric"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Fluffy"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Sport"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Nerdy pop"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Animal"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Password-related"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Food"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Rebellious/rude"], na.rm = TRUE)

Using `by()`

by(pw$offline_crack_sec,
  INDICES = pw$category,
  FUN = mean, na.rm = TRUE
)

## pw$category: Name
## [1] 0.2794351
## ------------------------------------------------------------ 
## pw$category: Cool/macho
## [1] 0.3471083
## ------------------------------------------------------------ 
## pw$category: Simple alphanumeric
## [1] 0.6123755
## ------------------------------------------------------------ 
## pw$category: Fluffy
## [1] 0.1607586
## ------------------------------------------------------------ 
## pw$category: Sport
## [1] 1.106457
## ------------------------------------------------------------ 
## pw$category: Nerdy pop
## [1] 1.288615
## ------------------------------------------------------------ 
## pw$category: Animal
## [1] 0.2435385
## ------------------------------------------------------------ 
## pw$category: Password-related
## [1] 2.247346
## ------------------------------------------------------------ 
## pw$category: Food
## [1] 0.1999104
## ------------------------------------------------------------ 
## pw$category: Rebellious/rude
## [1] 0.4044709

Using `aggregate()`

aggregate(offline_crack_sec ~ category,
  data = pw,
  FUN = mean,
  na.rm = TRUE
)

##               category offline_crack_sec
## 1                 Name         0.2794351
## 2           Cool/macho         0.3471083
## 3  Simple alphanumeric         0.6123755
## 4               Fluffy         0.1607586
## 5                Sport         1.1064572
## 6            Nerdy pop         1.2886154
## 7               Animal         0.2435385
## 8     Password-related         2.2473456
## 9                 Food         0.1999104
## 10     Rebellious/rude         0.4044709

Using `tapply()`

tapply(pw$offline_crack_sec,
  INDEX = pw$category,
  FUN = mean, na.rm = TRUE
)

##                Name          Cool/macho Simple alphanumeric              Fluffy 
##           0.2794351           0.3471083           0.6123755           0.1607586 
##               Sport           Nerdy pop              Animal    Password-related 
##           1.1064572           1.2886154           0.2435385           2.2473456 
##                Food     Rebellious/rude 
##           0.1999104           0.4044709

Plots

Histogram

Create a histogram showing the distribution of strength for password strengths ≤ 10.

pwsub <- subset(pw, strength <= 10)

hist(pwsub$strength)

Histogram (Output)

Scatter Plot

Create a scatter plot showing the relationship between password strength and password length. Include only passwords with strength ≤ 10.

plot(
  x = pwsub$strength,
  y = nchar(pwsub$password),
  xlab = "Password Strength",
  ylab = "Password Length"
)

Scatter Plot (Output)

Common Assignment Mistakes

foreign::read.spss(file = "passwords.sav")
to.data.frame = TRUE

foreign::read.spss(file = "passwords.sav", to.data.frame = TRUE)

pw <- read.spss(file = "passwords.sav", to.data.frame = TRUE)
library(foreign)

pw <- read.spss(file.choose(), 
  to.data.frame = TRUE)

nchar("password")

nchar(pw$password)

trimws(pw$password)

Don’t Include in Your Script

Don’t include these:

Calls to install.packages().
Calls to help() or ?.
Broken code/the mistakes you made along the way.

Today’s Lab: Data Visualization

The Datasaurus Dozen

Different Datasets

Datasaurus

Same Statistics!

M_x = 54.26
M_y = 47.83
SD_x = 16.76
SD_y = 26.93
r = -0.06

Learning Outcomes

Give reasons why data visualization is a useful tool for science communication
Know and apply the APA guidelines for figures in APA manuscripts.
Understand the basic “grammar of graphics” used by ggplot2
Apply the basic “grammar of graphics” to visualize data in R.

APA Guidelines for Figures

What not to do…

Not an APA figure

Guidelines

Simple, legible font
Axes labeled with units of measurement
All features labeled and explained
Figure notes appear below
Accessible use of colour

APA Figure Example

Sample APA figure

APA Table Example

Sample APA table

The Grammar of Graphics

wilkinson-gg

Layered Grammar of Graphics

Example

Scale and Coords

Example scale and coordinate system

Geometric Objects

Example geoms with aesthetic mappings

Example

Example plot

tRy it! Install and Load `ggplot2`

Install ggplot2.

install.packages("ggplot2")

Load ggplot2.

library(ggplot2)

tRy it! Data Import

Download “anscombe_long.csv” from Canvas and import it to R. Assign the resulting data.frame object the name anscombe_long.

anscombe_long <- read.csv("data/anscombe_long.csv")

tRy it! Convert to Factor

Convert to the column dataset to a factor with levels 1 = “I”, 2 = “II”, 3 = “III”, and 4 = “IV”.

anscombe_long$dataset <- factor(anscombe_long$dataset,
  levels = 1:4,
  labels = c("I", "II", "III", "IV")
)

The Data: Anscombe’s Quartet

“Four x-y datasets which have the same traditional statistical properties (mean, variance, correlation, regression line, etc.), yet are quite different.”

The Data: Anscombe’s Quartet

id	dataset	x	y
1	I	10	8.04
2	I	8	6.95
3	I	13	7.58
4	I	9	8.81
5	I	11	8.33
6	I	14	9.96
7	I	6	7.24
8	I	4	4.26
9	I	12	10.84
10	I	7	4.82
11	I	5	5.68
1	II	10	9.14
2	II	8	8.14
3	II	13	8.74
4	II	9	8.77
5	II	11	9.26
6	II	14	8.10
7	II	6	6.13
8	II	4	3.10
9	II	12	9.13
10	II	7	7.26
11	II	5	4.74
1	III	10	7.46
2	III	8	6.77
3	III	13	12.74
4	III	9	7.11
5	III	11	7.81
6	III	14	8.84
7	III	6	6.08
8	III	4	5.39
9	III	12	8.15
10	III	7	6.42
11	III	5	5.73
1	IV	8	6.58
2	IV	8	5.76
3	IV	8	7.71
4	IV	8	8.84
5	IV	8	8.47
6	IV	8	7.04
7	IV	8	5.25
8	IV	19	12.50
9	IV	8	5.56
10	IV	8	7.91
11	IV	8	6.89

Same Statistical Properties?

tapply(anscombe_long$x,
  INDEX = anscombe_long$dataset,
  FUN = mean
)

##   I  II III  IV 
##   9   9   9   9

Same Statistical Properties?

Table 1

Statistical Properties of x and y in Four Datasets

	Mx	SDx	My	SDy	cor(x, y)
I	9	3.32	7.5	2.03	0.82
II	9	3.32	7.5	2.03	0.82
III	9	3.32	7.5	2.03	0.82
IV	9	3.32	7.5	2.03	0.82

So… how are they different?

Plot 1

histograms of y

Components Covered

Components of gg covered in plot 1

Data

Plot 1: Data

ggplot(data = anscombe_long)

Aesthetics

aesthetics

What Are Aesthetics?

From the documentation for ggplot2::aes():

“Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms.”

Aesthetic Options

Argument	The Value is Mapped to:
x	Where the geom is placed along the x-axis.
y	Where the geom is placed along the y-axis.
colour	The line colour of the geom.
fill	The fill of the geom.
linetype	Different linetypes (e.g., solid, dashed, or dotted).
shape	Different shapes (e.g., square, circle, diamond).
size	The size of the geom.
alpha	The transparency of the geom.

Plot 1: Aesthetics

ggplot(data = anscombe_long, aes(x = y))

Geometric Objects

geoms

What are Geometric Objects?

There are many, and we’ll learn more over the course of the lab. Today, we’ll learn about these:

geom_histogram()
geom_point()
geom_jitter()
geom_smooth()
geom_boxplot()

Plot 1: Geometric Objects

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram()

Plot 1: Geometric Objects

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1)

Learning About Geoms

Learn more about a geom by visiting the documentation for that geom. Let’s start with geom_histogram()

?geom_histogram

The documentation will tell you required and optional aesthetics for a geom.

tRy it! Consult the Documentation

Scour the documentation for geom_histogram(). Which aesthetics can be mapped to geom_histogram()?

tRy it! Consult the Documentation (Answer)

geom_histogram aesthetics

The Glamour of Graphics

From a conference talk given by William Chase.

glamour of graphics talk

Grammar vs. Glamour

Grammar of Graphics

“A tool that enables us to concisely describe the components of a graphic.”

Glamour of Graphics

“A tool that enables us to concisely describe the design of a graphic.”

Histogram Colour and Fill

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "white", fill = "red") +
  facet_wrap(facets = "dataset", scales = "free")

Histogram Colour and Fill (plot)

Specify Colour by Hex

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "#1d3557", fill = "#457b9d") +
  facet_wrap(facets = "dataset", scales = "free")

Specify Colour by Hex (plot)

Finding Colours for Plots

Colour Palettes

https://coolors.co/
http://colours.cafe/ for links to FB, Instagram, and Twitter.

https://www.toptal.com/designers/colorfilter/

Titles and Labels

ggplot(data = anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "#1d3557", fill = "#457b9d") +
  facet_wrap(facets = "dataset", scales = "free") +
  labs(
    title = "Anscombe's Quartet",
    subtitle = "Distributions of y",
    x = NULL,
    y = NULL
  )

Titles and Labels (plot)

ggplot Themes

Use ggplot themes to change design elements of your plot. There are many and they all start with theme_. For example, our plot uses theme_minimal(), which is a good theme because it removes a lot of extraneous elements.

Add a Theme

ggplot(anscombe_long, aes(x = y)) +
  geom_histogram(binwidth = 1, colour = "#1d3557", fill = "#457b9d") +
  facet_wrap(facets = "dataset", scales = "free") +
  labs(
    title = "Anscombe's Quartet",
    subtitle = "Distributions of y",
    x = NULL,
    y = NULL
  ) +
  theme_minimal()

Add a Theme (plot)

We’re Done! Save it!

ggsave(filename = "plot1.png")

ggsave(filename = "plot1.png", width = 9, height = 6, dpi = 320)

Plot 2: Boxplot

Boxplots

Five-number summary + outliers

Minimum (excluding outliers)
Maximum (excluding outliers)
Median
Lower quartile (median of the lower half of the dataset)
Upper quartile (median of the upper half of the dataset)
Outliers

tRy it! Recreate the Example Boxplot

ggplot(anscombe_long, aes(x = dataset, y = y)) +
  geom_boxplot()

Scatter Plots with Trendlines

Plot 3

tRy it! Create Plot 3

Plot 3 uses two new geoms: geom_smooth() and geom_point().

Create Plot 3

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_smooth() +
  geom_point() +
  facet_wrap("dataset") +
  theme_minimal()

Create Plot 3

Adjust geom_smooth

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_smooth(method = "lm") +
  geom_point() +
  facet_wrap("dataset") +
  theme_minimal()

Adjust geom_smooth

Remove CI Around Smooth

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point() +
  facet_wrap("dataset") +
  theme_minimal()

Remove CI Around Smooth

Summary

What Did We Learn Today?

Why does data visualization matter?
What are APA guildelines for figures?
Basics of ggplot2 and layered grammar of graphics.
Learned the geoms:

geom_histogram()
geom_boxplot()
geom_point()
geom_smooth()

Assignment

What’s Due?

Lab assignment due Monday at 11:59 PM.
Quiz 1 is due by next Wednesday at 11:59 PM.

Quiz 1 Information

Applied quiz.
Covers content in labs 1–3.
Includes:

Importing data.
Converting variables to factors.
Data visualization with ggplot2.

Timed quiz. You can access it in Canvas. Open book, open notes, open internet. Not open friends.

Data Visualization

Sep. 24, 2020

Housekeeping

Recordings

Asking for Help

How to Ask for Help

Review the Assignment

Part 1: Data Import

Import the Data

Inspect the Data Frame

Inspect the Data Frame

Password Lengths

Why are they all the Same?

Trim That Whitespace

Average Strength of Short Passwords

Working with Factors

Print Unique Values of Category

Trim Trailing Whitespace

Convert to Factor

Convert to Factor (Code)

Average Crack Time

Using by()

Using aggregate()

Using tapply()

Plots

Histogram

Histogram (Output)

Scatter Plot

Scatter Plot (Output)

Common Assignment Mistakes

Don’t Include in Your Script

Today’s Lab: Data Visualization

The Datasaurus Dozen

Different Datasets

Same Statistics!

Learning Outcomes

APA Guidelines for Figures

What not to do…

Guidelines

APA Figure Example

APA Table Example

The Grammar of Graphics

The Grammar of Graphics

Layered Grammar of Graphics

Example

Scale and Coords

Geometric Objects

Example

tRy it! Install and Load ggplot2

tRy it! Data Import

tRy it! Convert to Factor

The Data: Anscombe’s Quartet

The Data: Anscombe’s Quartet

Same Statistical Properties?

Same Statistical Properties?

Plot 1

Components Covered

Data

Plot 1: Data

Aesthetics

What Are Aesthetics?

Aesthetic Options

Plot 1: Aesthetics

Geometric Objects

What are Geometric Objects?

Plot 1: Geometric Objects

Plot 1: Geometric Objects

Learning About Geoms

tRy it! Consult the Documentation

tRy it! Consult the Documentation (Answer)

Facets

Plot 1: Facets

Adding Facets to Plot 1

Adding Facets to Plot 1, Scales

The Glamour of Graphics

Grammar vs. Glamour

Grammar of Graphics

Glamour of Graphics

Histogram Colour and Fill

Histogram Colour and Fill (plot)

Using `by()`

Using `aggregate()`

Using `tapply()`

tRy it! Install and Load `ggplot2`