Data Wrangling with Tidyverse

Bella Ratmelia

Today’s Outline

  1. Loading our data into RStudio environment
  2. Data wrangling with dplyr and tidyr (part of the tidyverse package)

Checklist when you start RStudio

  • Load the project we created last session and open the R script file.
  • Make sure that Environment panel is empty (click on broom icon to clean it up)
  • Clear the Console and Plots too.
  • Re-run the library(tidyverse) and read_csv portion in the previous session

Refresher: Loading from CSV into a dataframe

Use read_csv from readr package (part of tidyverse) to load our data into a dataframe

# import tidyverse library
library(tidyverse)

# read the CSV and save into a dataframe called chile_data
chile_data <- read_csv("data/chile_voting.csv")

# "peek at the data, pay attention to the data types!
glimpse(chile_data)

Cleaning data for analysis

  • Why do it in R? Because it’s much efficient to do so in R, especially if your data is large (e.g. millions of rows, hundreds of columns) and you have repetitive clean up tasks.
  • Incorrect or inconsistent data can lead to false conclusions, so it’s important to clean and prep it correctly.
  • Having a clear understanding of the desired data shape is essential as real data often differs from what you imagine! Refer to codebook, actual questionnaire, appendix for guidance.
  • Data cleaning techniques differ based on the problems, data type, and the research questions you are trying to answer. Various methods are available, each with its own trade-offs.

About dplyr and tidyr

About the data

The data is from a national survey conducted in April and May of 1988 by FLACSO/Chile, capturing voting intentions for the 1988 Chilean plebiscite. The dataset contains information about respondents’ demographic characteristics and their voting intentions. This data can also be found from carData package! (more on this later)

Key variables in the dataset:

Explanatory notes on each column
Variable Description
region Region of voters: C (Central), M (Metropolitan Santiago area), N (North), S (South), SA (city of Santiago)
population Population size of respondent’s community
sex Sex of voters: F (female), M (male)
age Age in years
education Education level of voters: P (Primary), PS (Post-secondary), S (Secondary)
income Monthly income, in Pesos
statusquo Scale of support for the status-quo in numerical value
vote Voter’s decision: A (will abstain), N (will vote no), U (undecided), Y (will vote yes)

Prelim checks for your data

It’s good practice to do some preliminary checks on your data to get a better sense of it!

A few things that you can do:

  • Check for duplicates

  • Check for missing values

  • Check on overall distributions of the categorical data

  • Plot the distribution of the numerical/continuous data

Data wrangling activities specific to our data

Scenario: We are a junior Research Assistant (RA) in a research team currently studying about the 1988 Chilean plebiscite. The team’s primary goal is to identify any discernible and interesting patterns in the voter demographic data.

As a start, we have been asked to do the following data cleaning and processing tasks:

  1. Remove all rows with empty values (NA)

  2. Separate the demographic information (region, sex, age, education) into a separate CSV called chile-demographics.csv

  3. Retrieve only respondents aged 30 or older, arrange it from oldest to youngest, and save it into a separate CSV called chile-30plus.csv

  4. Make the character columns such as region, sex, education, and vote to be in uppercase.

  5. Convert all the categorical columns such as region, sex, education, and vote to Factor. Make sure education is ordered.

  1. Create a new column called age_group that categorizes age into groups: “18-29”, “30-44”, “45-59”, “60+”.

  2. Create a new column called high_income that is TRUE if income is above the median, FALSE otherwise.

  3. Create a new column called support_level that categorizes statusquo into “Unsupportive” (<= 0) and “Supportive” (> 0). Save all of these changes you made from step 3 onwards into a new CSV called chile_voting_processed.csv

  4. Generate summary stats of income grouped by region and education. The summary stats should include mean, median, max, min, std, and n (number of observations).

  5. Reshape the data to have region as rows and vote options as columns, with the values being the count of votes for each option in each region.

Let’s wrangle our data!

Task #1

A strategy I’d like to recommend: briefly read over the dplyr + tidyr documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.

Remove all rows with empty values (NA)

chile_data <- chile_data |> 
  drop_na()

The number of observations after all rows containing NAs are removed:

dim(chile_data)
[1] 2431    8

Interlude: Pipe Operator ( |> )

  • The pipe operator (|>) allows us to chain multiple operations without creating intermediate dataframes.

  • Super handy when we perform several data wrangling tasks using tidyverse in sequence.

  • Helps with readability, especially for complex operations.

  • Keyboard shortcut: Ctrl+Shift+M on Windows, Cmd+Shift+M on Mac

Notice that we have to create a “temp” dataframes called age_df and age_df_desc in this method.

age_df <- filter(chile_data, age >= 30)
age_df_desc <- arrange(age_df, desc(age))
write_csv(age_df_desc, "data-output/chile-30plus.csv")

No “temp” dataframe needed here! :D

chile_data |> 
    filter(age >= 30) |> 
    arrange(desc(age)) |> 
    write_csv("data-output/chile-30plus.csv")

Task #2

Separate the demographic information (region, sex, age, education) into a separate CSV called chile-demographics.csv

# not using the |> operator here!

demographics_df <- select(chile_data, region, sex, age, education)
write_csv(demographics_df, "data-output/chile-demographics.csv")

The first few rows in the CSV:

# A tibble: 2,431 × 4
  region sex     age education
  <chr>  <chr> <dbl> <chr>    
1 N      M        65 P        
2 N      M        29 PS       
3 N      F        38 P        
# ℹ 2,428 more rows

Task #3

Retrieve only respondents aged 30 or older, arrange it from oldest to youngest, and save it into a separate CSV called chile-30plus.csv

chile_data |> 
    filter(age >= 30) |> 
    arrange(desc(age)) |> 
    write_csv("chile-30plus.csv")

The first few rows in the resulting CSV:

# A tibble: 1,581 × 8
  region population sex     age education income statusquo vote 
  <chr>       <dbl> <chr> <dbl> <chr>      <dbl>     <dbl> <chr>
1 N          125000 F        70 P          15000     1.49  Y    
2 N          250000 F        70 P          75000     1.40  Y    
3 N          250000 M        70 P           7500    -0.366 N    
# ℹ 1,578 more rows

Group exercise 1 (solo attempts ok)

Time: 5 minutes!

Retrieve only respondents from the Metropolitan Santiago area (M), with Secondary education (S), and who are undecided (U) about their vote. Keep only their region, education, vote, and income columns in a new dataframe called santiago_undecided in descending order of income.

Show answer
santiago_undecided <- chile_data |> 
    filter(region == "M" & education == "S" & vote == "U") |> 
    select(region, education, vote, income) |> 
    arrange(desc(income)) 

The first few rows of santiago_undecided:

# A tibble: 5 × 4
  region education vote  income
  <chr>  <chr>     <chr>  <dbl>
1 M      S         U      35000
2 M      S         U      15000
3 M      S         U      15000
# ℹ 2 more rows

Task #4

Make all character columns such as region, sex, education, and vote to be in uppercase.

cols_to_change <- c("region", "sex", "education", "vote")
chile_data <- chile_data |> 
    mutate(across(cols_to_change, toupper))

The end result:

# A tibble: 2,431 × 4
  region sex   education vote 
  <chr>  <chr> <chr>     <chr>
1 N      M     P         Y    
2 N      M     PS        N    
3 N      F     P         Y    
# ℹ 2,428 more rows

Task #5

Convert all the categorical columns such as region, sex, education, and vote to Factor. Make sure education is ordered.

chile_data <- chile_data |> 
    mutate(across(cols_to_change, as.factor))

#reordering
chile_data <- chile_data |> 
    mutate(education = factor(education, 
                         levels = c("P", "S", "PS"), 
                         ordered = TRUE))

The structure of education column after the change:

tibble [2,431 × 1] (S3: tbl_df/tbl/data.frame)
 $ education: Ord.factor w/ 3 levels "P"<"S"<"PS": 1 3 1 1 2 1 3 2 1 1 ...

Task #6

Add a new column to chile_data called age_group that categorizes age into groups: “18-29”, “30-44”, “45-59”, “60+”.

chile_data <- chile_data |> 
    mutate(age_group = case_when(
        age < 30 ~ "18-29",
        age < 45 ~ "30-44",
        age < 60 ~ "45-59",
        TRUE ~ "60+"
    ))

The new age_group column should look something like this:

# A tibble: 2,431 × 2
    age age_group
  <dbl> <chr>    
1    65 60+      
2    29 18-29    
3    38 30-44    
4    49 45-59    
# ℹ 2,427 more rows

Task #7

Create a new column called high_income that is TRUE if income is above the median, FALSE otherwise.

hint: you need to calculate the median income first.

median_income <- median(chile_data$income, na.rm = TRUE)
chile_data <- chile_data |> 
    mutate(high_income = income > median_income)

The original income column and the new high_income column:

# A tibble: 3 × 2
  income high_income
   <dbl> <lgl>      
1  15000 FALSE      
2  15000 FALSE      
3  15000 FALSE      

Task #8 - Can you solve this?

Create a new column called support_level that categorizes statusquo into “Unsupportive” (<= 0) and “Supportive” (> 0).

Save all of these changes you made from step 3 onwards into a new CSV called chile_voting_processed.csv

Show answer
chile_data <- chile_data |> 
    mutate(support_level = case_when(
        statusquo <= 0 ~ "Unsupportive",
        statusquo > 0 ~ "Supportive",
        TRUE ~ NA_character_
    ))

write_csv(chile_data, "data-output/chile_voting_processed.csv")

The new support_level column:

# A tibble: 2,431 × 2
  statusquo support_level
      <dbl> <chr>        
1      1.01 Supportive   
2     -1.30 Unsupportive 
3      1.23 Supportive   
# ℹ 2,428 more rows

Task #9

Generate summary stats of income grouped by region and education. The summary stats should include mean, median, max, min, std, and n (number of observations).

library(gt)
chile_data |> 
    group_by(region, education) |> 
    summarise(observation = n(), 
              mean_income = mean(income, na.rm = TRUE),
              median_income = median(income, na.rm = TRUE), 
              highest = max(income, na.rm = TRUE),
              lowest = min(income, na.rm = TRUE),
              std_dev = sd(income, na.rm = TRUE))

Task #9

# A tibble: 15 × 8
# Groups:   region [5]
   region education observation mean_income median_income highest lowest std_dev
   <fct>  <ord>           <int>       <dbl>         <dbl>   <dbl>  <dbl>   <dbl>
 1 C      P                 238      16450.         15000  200000   2500  18483.
 2 C      S                 234      35887.         35000  200000   2500  36540.
 3 C      PS                 76      64375          35000  200000   7500  50705.
 4 M      P                  43      21163.         15000  125000   2500  21967.
 5 M      S                  26      29038.         25000   75000   7500  19787.
 6 M      PS                  6      60000          55000  125000  15000  39875.
 7 N      P                 133      19173.         15000   75000   2500  15823.
 8 N      S                 111      31937.         35000  125000   2500  26351.
 9 N      PS                 61      53934.         35000  200000   7500  41772.
10 S      P                 310      15677.          7500  200000   2500  19074.
11 S      S                 250      29800          15000  200000   2500  30450.
12 S      PS                 95      56658.         35000  200000   2500  52879.
13 SA     P                 278      19191.         15000  125000   2500  15796.
14 SA     S                 389      41292.         35000  200000   2500  41052.
15 SA     PS                181      82859.         75000  200000   7500  66268.

Task #10

Reshape the data to have region as rows and vote options as columns, with the values being the count of votes for each option in each region.

To achieve this clean table look, we can use gt package or knitr package!

chile_data |>
    group_by(region, vote) |>
    summarise(count = n(), .groups = "drop") |>
    pivot_wider(names_from = vote, values_from = count, values_fill = 0) |> 
    knitr::kable()
region A N U Y
C 44 209 130 165
M 2 17 19 37
N 29 98 46 132
S 39 209 140 267
SA 63 334 216 235

Long vs Wide Data

Long data:

  • Each row is a unique observation.

  • There is a separate column indicating the variable or type of measurements

  • This format is more “understandable” by R, more suitable for visualizations.

Wide data:

  • Each row is a unique observation.

  • Each column is a variable –> the more variables you have, the “wider” is the data

  • This format is more intuitive for humans!

Long vs Wide Data: Examples

Long data:

# A tibble: 20 × 3
   region vote  count
   <fct>  <fct> <int>
 1 C      A        44
 2 C      N       209
 3 C      U       130
 4 C      Y       165
 5 M      A         2
 6 M      N        17
 7 M      U        19
 8 M      Y        37
 9 N      A        29
10 N      N        98
11 N      U        46
12 N      Y       132
13 S      A        39
14 S      N       209
15 S      U       140
16 S      Y       267
17 SA     A        63
18 SA     N       334
19 SA     U       216
20 SA     Y       235

Wide data:

# A tibble: 5 × 5
  region     A     N     U     Y
  <fct>  <int> <int> <int> <int>
1 C         44   209   130   165
2 M          2    17    19    37
3 N         29    98    46   132
4 S         39   209   140   267
5 SA        63   334   216   235

Group exercise 3 (solo attempts ok)

Time: 5 minutes

Generate a summary stats of age grouped by region and sex. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:

Show answer
chile_data |> 
    group_by(region, sex) |> 
    summarise(observation = n(), 
              mean_age = mean(age, na.rm = TRUE),
              median_age = median(age, na.rm = TRUE), 
              oldest = max(age, na.rm = TRUE),
              youngest = min(age, na.rm = TRUE),
              std_dev = sd(age, na.rm = TRUE))

Group exercise 3 (solo attempts ok)

# A tibble: 10 × 8
# Groups:   region [5]
   region sex   observation mean_age median_age oldest youngest std_dev
   <fct>  <fct>       <int>    <dbl>      <dbl>  <dbl>    <dbl>   <dbl>
 1 C      F             269     37.7       36       70       18    14.3
 2 C      M             279     39.1       36       70       18    15.8
 3 M      F              43     37.0       36       68       18    14.1
 4 M      M              32     36         33.5     65       18    13.5
 5 N      F             159     38.7       36       70       18    13.8
 6 N      M             146     38.7       35       70       18    16.6
 7 S      F             330     36.7       35       70       18    13.9
 8 S      M             325     38.0       36       70       18    14.9
 9 SA     F             449     38.9       38       70       18    14.0
10 SA     M             399     39.0       37       70       18    15.1

Bonus: Deleting columns from dataframe

Let’s say I have this column called wrong_column that I want to remove:

chile_data <- chile_data |> mutate(wrong_column = "random values")
chile_data |> select(region, wrong_column) |> print(n = 3)
# A tibble: 2,431 × 2
  region wrong_column 
  <fct>  <chr>        
1 N      random values
2 N      random values
3 N      random values
# ℹ 2,428 more rows

Remove the wrong column with subset -:

chile_data <- chile_data |> 
    select(-wrong_column)
# A tibble: 2,431 × 1
  region
  <fct> 
1 N     
2 N     
3 N     
# ℹ 2,428 more rows

End of Session 2!

Next session: Descriptive stats and data visualization with ggplot2 package.