Data Visualization

Sep. 24, 2020

Housekeeping

Recordings

  • Where do I find the recordings?
  • When will the recordings become available?

Asking for Help

Before emailing for help, try the following:

  • Restarting R and rerunning your code.
  • Read the error message.
  • Search the internet.
  • Read the assigned readings.
  • Ask classmate(s) for help.

How to Ask for Help

If you have done your best to solve the problem and still need help, please email us. A good email will include:

  • Concise, detailed description of the problem.
  • Attached script (with comments!).
  • Email sent to both Rebecca and Zak.

Please do all you can to make it as easy as possible for us to help you. It takes a long time, and there are only two of us.

Review the Assignment

Part 1: Data Import

Import the Data

  1. Import passwords.sav into R as a data.frame. Save the data.frame as an object named pw.
library(foreign)
pw <- foreign::read.spss(file = "data/passwords.sav",
  to.data.frame = TRUE
)

Inspect the Data Frame

  1. Use View(), head(), and str() to inspect pw.
head(pw)
##   rank  password            category value time_unit offline_crack_sec rank_alt
## 1    1 password  password-related     6.91   years            2.17e+00        1
## 2    2 123456    simple-alphanumeric 18.52   minutes          1.11e-05        2
## 3    3 12345678  simple-alphanumeric  1.29   days             1.11e-03        3
## 4    4 1234      simple-alphanumeric 11.11   seconds          1.11e-07        4
## 5    5 qwerty    simple-alphanumeric  3.72   days             3.21e-03        5
## 6    6 12345     simple-alphanumeric  1.85   minutes          1.11e-06        6
##   strength font_size
## 1        8        11
## 2        4         8
## 3        4         8
## 4        4         8
## 5        8        11
## 6        4         8

Inspect the Data Frame

  1. Use View(), head(), and str() to inspect pw.
str(pw)
## 'data.frame':    507 obs. of  9 variables:
##  $ rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ password         : chr  "password " "123456   " "12345678 " "1234     " ...
##  $ category         : chr  "password-related   " "simple-alphanumeric" "simple-alphanumeric" "simple-alphanumeric" ...
##  $ value            : num  6.91 18.52 1.29 11.11 3.72 ...
##  $ time_unit        : chr  "years  " "minutes" "days   " "seconds" ...
##  $ offline_crack_sec: num  2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
##  $ rank_alt         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ strength         : num  8 4 4 4 8 4 8 4 7 8 ...
##  $ font_size        : num  11 8 8 8 11 8 11 8 11 11 ...
##  - attr(*, "codepage")= int 65001

Password Lengths

  1. The column password is a character vector. Use the function nchar() to count the length of each password.
nchar(pw$password)
##   [1] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
##  [38] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
##  [75] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [112] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [149] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [186] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [223] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [260] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [297] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [334] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [371] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [408] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [445] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [482] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Why are they all the Same?

  1. Something seems fishy… Look more closely at pw$password. Why is the result of nchar() the same for each password? This is explained in the “Details” section of the documentation for foreign::read.spss().

“Fixed length strings (including value labels) are padded on the right with spaces by SPSS, and so are read that way by R.”

Trim That Whitespace

  1. Use trimws() to remove the leading and trailing whitespace from each character string of pw$password.
pw$password <- trimws(pw$password)

Average Strength of Short Passwords

  1. Calculate the average strength of passwords with exactly 4 characters.
pw_with_4_chars <- subset(pw, nchar(password) == 4)

mean(pw_with_4_chars$strength)
## [1] 4.428571
mean(pw$strength[nchar(pw$password) == 4])
## [1] 4.428571
pw$nchar <- nchar(pw$password)

mean(pw$strength[pw$nchar == 4])
## [1] 4.428571

Working with Factors

The objective of this section of the assignment is to determine the average strength of the different categories of passwords.

Trim Trailing Whitespace

  1. Use trimws() to remove leading and trailing whitespace from pw$category.
pw$category <- trimws(pw$category)

Convert to Factor

  1. Convert pw$category to a factor with the following levels and labels:
Level Number Level Label
1 name Name
2 cool-macho Cool/macho
3 simple-alphanumeric Simple alphanumeric
4 fluffy Fluffy
5 sport Sport
6 nerdy-pop Nerdy pop
7 animal Animal
8 password-related Password-related
9 food Food
10 rebellious-rude Rebellious/rude

Convert to Factor (Code)

pw$category <- factor(pw$category,
  levels = c(
    "name",
    "cool-macho",
    "simple-alphanumeric",
    "fluffy",
    "sport",
    "nerdy-pop",
    "animal",
    "password-related",
    "food",
    "rebellious-rude"
  ),
  labels = c(
    "Name",
    "Cool/macho",
    "Simple alphanumeric",
    "Fluffy",
    "Sport",
    "Nerdy pop",
    "Animal",
    "Password-related",
    "Food",
    "Rebellious/rude"
  )
)

Average Crack Time

  1. Calculate the average offline crack time of passwords in each of the 10 categories. Optional: Use by(), aggregate(), or tapply() to do this with just one line of code.
mean(pw$offline_crack_sec[pw$category == "Name"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Cool/macho"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Simple alphanumeric"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Fluffy"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Sport"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Nerdy pop"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Animal"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Password-related"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Food"], na.rm = TRUE)
mean(pw$offline_crack_sec[pw$category == "Rebellious/rude"], na.rm = TRUE)

Using by()

by(pw$offline_crack_sec,
  INDICES = pw$category,
  FUN = mean, na.rm = TRUE
)
## pw$category: Name
## [1] 0.2794351
## ------------------------------------------------------------ 
## pw$category: Cool/macho
## [1] 0.3471083
## ------------------------------------------------------------ 
## pw$category: Simple alphanumeric
## [1] 0.6123755
## ------------------------------------------------------------ 
## pw$category: Fluffy
## [1] 0.1607586
## ------------------------------------------------------------ 
## pw$category: Sport
## [1] 1.106457
## ------------------------------------------------------------ 
## pw$category: Nerdy pop
## [1] 1.288615
## ------------------------------------------------------------ 
## pw$category: Animal
## [1] 0.2435385
## ------------------------------------------------------------ 
## pw$category: Password-related
## [1] 2.247346
## ------------------------------------------------------------ 
## pw$category: Food
## [1] 0.1999104
## ------------------------------------------------------------ 
## pw$category: Rebellious/rude
## [1] 0.4044709

Using aggregate()

aggregate(offline_crack_sec ~ category,
  data = pw,
  FUN = mean,
  na.rm = TRUE
)
##               category offline_crack_sec
## 1                 Name         0.2794351
## 2           Cool/macho         0.3471083
## 3  Simple alphanumeric         0.6123755
## 4               Fluffy         0.1607586
## 5                Sport         1.1064572
## 6            Nerdy pop         1.2886154
## 7               Animal         0.2435385
## 8     Password-related         2.2473456
## 9                 Food         0.1999104
## 10     Rebellious/rude         0.4044709

Using tapply()

tapply(pw$offline_crack_sec,
  INDEX = pw$category,
  FUN = mean, na.rm = TRUE
)
##                Name          Cool/macho Simple alphanumeric              Fluffy 
##           0.2794351           0.3471083           0.6123755           0.1607586 
##               Sport           Nerdy pop              Animal    Password-related 
##           1.1064572           1.2886154           0.2435385           2.2473456 
##                Food     Rebellious/rude 
##           0.1999104           0.4044709

Plots

Histogram

  1. Create a histogram showing the distribution of strength for password strengths ≤ 10.
pwsub <- subset(pw, strength <= 10)
hist(pwsub$strength)

Histogram (Output)

Scatter Plot

  1. Create a scatter plot showing the relationship between password strength and password length. Include only passwords with strength ≤ 10.
plot(
  x = pwsub$strength,
  y = nchar(pwsub$password),
  xlab = "Password Strength",
  ylab = "Password Length"
)

Scatter Plot (Output)

Common Assignment Mistakes

foreign::read.spss(file = "passwords.sav")
to.data.frame = TRUE
foreign::read.spss(file = "passwords.sav", to.data.frame = TRUE)

pw <- read.spss(file = "passwords.sav", to.data.frame = TRUE)
library(foreign)

pw <- read.spss(file.choose(), 
  to.data.frame = TRUE)

nchar("password")
nchar(pw$password)

trimws(pw$password)

Don’t Include in Your Script

Don’t include these:

  • Calls to install.packages().
  • Calls to help() or ?.
  • Broken code/the mistakes you made along the way.

Today’s Lab: Data Visualization

The Datasaurus Dozen

Different Datasets