Data Wrangling with Tidyverse

Bella Ratmelia

Today’s Outline

Loading our data into RStudio environment
Data wrangling with dplyr and tidyr (part of the tidyverse package)

Checklist when you start RStudio

Load the project we created last session and open the R script file. Click on File > Recent Projects... > Select the our project from last week.
Make sure that Environment panel is empty (click on broom icon to clean it up).
Clear the Console and Plots too.
Re-run the library(tidyverse) and read_csv portion in the previous session (the code is also on the next slide if you missed last week’s session)

Refresher: Loading from CSV into a dataframe

Use read_csv from readr package (part of tidyverse) to load our World Values Survey data. More information about the data can be found under the Dataset tab in the course website.

# import tidyverse library
library(tidyverse)

# read the CSV and save into a dataframe called wvs_data
wvs_data <- read_csv("data/wvs-wave7-sg-ca-nz.csv")

# "peek" at the data, pay attention to the data types!
glimpse(wvs_data)

Cleaning data for analysis

Why do it in R? Because it’s much efficient to do so in R, especially if your data is large (e.g. millions of rows, thousands of columns) and you have repetitive clean up tasks.
Incorrect or inconsistent data can lead to false conclusions, so it’s important to clean and prep it correctly.
Having a clear understanding of the desired data shape is essential as real data often differs from what you imagine! Refer to codebook, actual questionnaire, appendix for guidance.
Data cleaning techniques differ based on the problems, data type, and the research questions you are trying to answer. Various methods are available, each with its own trade-offs.

About dplyr and tidyr

Packages from tidyverse. (click here to go to the tidyverse homepage)
Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)
- dplyr cheatsheet | pdf version (I personally prefer this PDF version since it’s more visual)
- tidyr cheatsheet | pdf version
Most of the time, these are the ones that you will use quite often:
- drop_na() - remove rows with null values
- select() - to select column(s) from a dataframe
- filter() - to filter rows based on criteria
- mutate() - to compute new columns or edit existing ones
- if_else() and case_when() - to be used with mutate when we want to compute/edit columns based on multiple criteria
- group_by() and summarize() - group data and summarize each group

Scenario: Data wrangling activities with WVS data

Scenario: We are research assistants analyzing patterns in values and satisfaction across different countries and demographic groups.

Our team has been assigned to explore and get insights on how specific factors (employment, work importance, marital status, political alignment, financial satisfaction, and religiosity) may relate to life satisfaction among different generations (Gen Z, Millennials, Gen X, and Baby Boomer), including how they may differ between the 3 countries.

To ensure the analysis quality, we were instructed to discard incomplete data.

We can break down the tasks as such:

Remove all rows with missing values (NA)
Check for duplicates
Select only the relevant columns: demographic columns, work_importance, life_satisfaction, financial_satisfaction, religiousity, political_scale, marital_status, and employment.
Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)
Reverse-code importance variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.
Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”

Once we did all of the wrangling above, we can save this “wrangled” version into another CSV

Let’s wrangle our data!

Task #1

A strategy I’d like to recommend: briefly read over the dplyr + tidyr documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.

Remove all rows with empty values (NA) with drop_na()

wvs_data <- wvs_data |> 
  drop_na()

The number of observations after removing NAs:

dim(wvs_data)

[1] 6403   16

Interlude: Pipe Operator ( |> )

The pipe operator (|>) allows us to chain multiple operations without creating intermediate / temporary dataframes.
Super handy when we perform several data wrangling tasks using tidyverse in sequence.
Helps with readability, especially for complex operations.
Keyboard shortcut: Ctrl+Shift+M on Windows, Cmd+Shift+M on Mac

Without pipe operator
With pipe operator

Notice that we have to create a “temp” dataframes called wvs_data_clean in this method.

wvs_data <- drop_na(wvs_data)
wvs_data_clean <- wvs_data_clean(wvs_data, desc(age))
write_csv(wvs_data_clean, "data-output/wvs-clean.csv")

No “temporary” dataframe needed here! :D

wvs_data |> 
    drop_na() |> 
    distinct(ID, .keep_all = TRUE) |> 
    write_csv("data-output/wvs-30plus.csv")

Task #2

Check for duplicates with distinct()

wvs_data <- wvs_data |> 
  distinct(ID, .keep_all = TRUE)

(Our data has no duplicates, but this is still a good practice to do, especially if we were combining data from multipe sources)

Task #3:

Select only the relevant columns: demographic columns, life_satisfaction, work_importance, financial_satisfaction, religiousity, political_scale, marital_status, and employment.

We can achieve this with select()!

wvs_data <- wvs_data |>
    select(ID, country, sex, birthyear, age, life_satisfaction, work_importance, financial_satisfaction, religiousity, political_scale, marital_status, employment)

Preview of the filtered data:

Rows: 6,403
Columns: 12
$ ID                     <dbl> 124070003, 124070004, 124070005, 124070006, 124…
$ country                <chr> "CAN", "CAN", "CAN", "CAN", "CAN", "CAN", "CAN"…
$ sex                    <chr> "Female", "Male", "Female", "Male", "Male", "Ma…
$ birthyear              <dbl> 1944, 1951, 1984, 1975, 1988, 1982, 1981, 1985,…
$ age                    <dbl> 76, 69, 35, 45, 32, 38, 38, 34, 65, 31, 27, 33,…
$ life_satisfaction      <dbl> 5, 8, 9, 7, 1, 7, 9, 6, 7, 6, 7, 10, 8, 10, 6, …
$ work_importance        <dbl> 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 4, 1, 3, 3,…
$ financial_satisfaction <dbl> 8, 2, 8, 8, 9, 8, 9, 5, 6, 9, 5, 7, 9, 9, 7, 8,…
$ religiousity           <chr> "A religious person", "A religious person", "An…
$ political_scale        <dbl> 10, 5, 2, 8, 6, 6, 7, 5, 6, 4, 5, 4, 1, 7, 2, 8…
$ marital_status         <chr> "Separated", "Married", "Living together as mar…
$ employment             <chr> "Retired/pensioned", "Retired/pensioned", "Part…

Task #4:

Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)

wvs_data <- wvs_data |> 
    filter(age >= 18) |> 
    arrange(desc(age))

Checking the structure:

# A tibble: 6 × 12
         ID country sex    birthyear   age life_satisfaction work_importance
      <dbl> <chr>   <chr>      <dbl> <dbl>             <dbl>           <dbl>
1 124071434 CAN     Male        1927    93                 7               1
2 554070524 NZL     Female      1928    91                 5               4
3 702070265 SGP     Male        1930    90                 8               1
4 124071321 CAN     Female      1931    89                 6               3
5 124073566 CAN     Female      1930    89                 6               4
6 124074356 CAN     Male        1931    89                 8               4
# ℹ 5 more variables: financial_satisfaction <dbl>, religiousity <chr>,
#   political_scale <dbl>, marital_status <chr>, employment <chr>

Task #5

Reverse-code work_importance variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.

We can achieve this with mutate()!

wvs_data <- wvs_data |>
    mutate(work_importance_r = 5 - work_importance)

Preview of age groups:

# A tibble: 6,403 × 2
  work_importance_r work_importance
              <dbl>           <dbl>
1                 4               1
2                 1               4
3                 4               1
4                 2               3
5                 1               4
# ℹ 6,398 more rows

Task #6

Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”

wvs_data <- wvs_data |>
    mutate(age_group = case_when(
        age <= 28 ~ "18-28",
        age <= 44 ~ "29-44",
        age <= 60 ~ "45-60",
        TRUE ~ "61+"
    ))

Preview of age groups:

# A tibble: 6,403 × 2
    age age_group
  <dbl> <chr>    
1    93 61+      
2    91 61+      
3    90 61+      
4    89 61+      
# ℹ 6,399 more rows

Checkpoint 1 - saving our hard work into a CSV file

We have done some cleaning! Let’s save this cleaned data into a separate CSV file called “wvs_cleaned_v1.csv”

wvs_data |> write_csv("data-output/wvs_cleaned_v1.csv")

Check the data output folder to make sure the CSV is created!

Group Exercise 1 (5 minutes)

Now that we have a new file, load this new wvs_cleaned_v1.csv into a new dataframe called wvs_cleaned. Filter to respondents from Singapore who are currently employed full time. Show only the Respondent ID, country, the employment status, and age. Use glimpse() or print() to check the result! (you can chain these functions at the end)

Step 1: Load the new file

Show answer

library(tidyverse)
wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")

Step 2: Do the filtering and selecting, and then show result

Show answer

library(tidyverse)
wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")

wvs_cleaned |> 
    filter(country == "SGP" & employment == "Full time") |> 
    select(ID, country, employment, age) |> 
    glimpse()

Rows: 857
Columns: 4
$ ID         <dbl> 702071000, 702070773, 702070363, 702070746, 702070026, 7020…
$ country    <chr> "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SG…
$ employment <chr> "Full time", "Full time", "Full time", "Full time", "Full t…
$ age        <dbl> 81, 79, 73, 70, 69, 68, 68, 68, 68, 68, 68, 67, 67, 67, 67,…

Scenario: Data wrangling activities with WVS data (continued)

Once we are done with the wrangling part, we can proceed with simple descriptive analysis!

Before we proceed further, convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor
Generate summary statistics of life_satisfaction grouped by country
Create a new column called satisfaction_group that indicate whether each respondent has higher or lower than average life_satisfaction
Reshape the data to show average satisfaction scores by country and age group

Task #7

Let’s use the wvs_cleaned dataframe for this task.

Convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor.

We can do this with mutate() and as_factor() from forcats, another sub-package within tidyverse.

wvs_cleaned <- wvs_cleaned |> 
    mutate(
        country = as_factor(country),
        religiousity = as_factor(religiousity),
        sex = as_factor(sex),
        marital_status = as_factor(marital_status),
        employment = as_factor(employment)
    )

# check conversion result
str(wvs_cleaned)

Rstudio may auto-suggest as.factor() from base R. You can use this as well, but as_factor() is preferred since we are using tidyverse approach.

Task #7

tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
 $ ID                    : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
 $ country               : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
 $ sex                   : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
 $ birthyear             : num [1:6403] 1927 1928 1930 1931 1930 ...
 $ age                   : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
 $ life_satisfaction     : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
 $ work_importance       : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
 $ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
 $ religiousity          : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
 $ political_scale       : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
 $ marital_status        : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
 $ employment            : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ work_importance_r     : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
 $ age_group             : chr [1:6403] "61+" "61+" "61+" "61+" ...

Task #7 - the shortcut

If we have a lot of columns to convert, that might be troublesome to type! This is where across() can come in handy.

Let’s first define a character vector that contains the names of columns we plan to convert.

columns_to_convert <- c("country", "religiousity", "sex", "marital_status", "employment")

We will use this vector with mutate() and across(). We tell tidyverse to convert all of the columns with the help of all_of()

wvs_cleaned <- wvs_cleaned |> 
    mutate(across(all_of(columns_to_convert), as_factor))

# check conversion result
str(wvs_cleaned)

across() is used for applying the same function to multiple columns in a single mutate() or summarise() operation.

Task #7 - the shortcut

tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
 $ ID                    : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
 $ country               : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
 $ sex                   : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
 $ birthyear             : num [1:6403] 1927 1928 1930 1931 1930 ...
 $ age                   : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
 $ life_satisfaction     : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
 $ work_importance       : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
 $ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
 $ religiousity          : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
 $ political_scale       : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
 $ marital_status        : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
 $ employment            : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ work_importance_r     : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
 $ age_group             : chr [1:6403] "61+" "61+" "61+" "61+" ...

Task #8

Generate summary statistics such as count (n), mean, median, and standard deviation of life_satisfaction grouped by country.

We can achieve this with group_by() and summarise()

wvs_data |>
    group_by(country, age_group) |>
    summarise(
        n = n(),
        mean_satisfaction = mean(life_satisfaction),
        median_satisfaction = median(life_satisfaction),
        sd_satisfaction = sd(life_satisfaction)
    ) |>
    arrange(country, desc(mean_satisfaction))

What if we want to save this into a CSV? What if we also want to group by country AND age_group?

Task #8

# A tibble: 12 × 6
# Groups:   country [3]
   country age_group     n mean_satisfaction median_satisfaction sd_satisfaction
   <chr>   <chr>     <int>             <dbl>               <dbl>           <dbl>
 1 CAN     61+        1013              7.55                   8            1.59
 2 CAN     29-44      1232              6.99                   7            1.80
 3 CAN     45-60      1061              6.92                   7            1.87
 4 CAN     18-28       712              6.60                   7            1.86
 5 NZL     61+         292              7.96                   8            1.66
 6 NZL     45-60       222              7.38                   8            1.90
 7 NZL     29-44       119              7.32                   8            1.69
 8 NZL     18-28        27              6.70                   7            1.98
 9 SGP     61+         418              7.17                   7            1.84
10 SGP     29-44       511              7.09                   7            1.70
11 SGP     45-60       550              7.05                   7            1.83
12 SGP     18-28       246              6.85                   7            1.68

Task #9

Create a new column called satisfaction_group that indicate whether each respondent has higher or lower than average life_satisfaction

mean_satisfaction <- mean(wvs_data$life_satisfaction, na.rm = TRUE)

wvs_data |>
    mutate(satisfaction_group = if_else(
        life_satisfaction > mean_satisfaction, # the condition to evaluate
        "higher", # if condition is fulfilled, do this
        "lower" # otherwise, do this
    ))

Task #9

# A tibble: 6,403 × 15
          ID country sex    birthyear   age life_satisfaction work_importance
       <dbl> <chr>   <chr>      <dbl> <dbl>             <dbl>           <dbl>
 1 124071434 CAN     Male        1927    93                 7               1
 2 554070524 NZL     Female      1928    91                 5               4
 3 702070265 SGP     Male        1930    90                 8               1
 4 124071321 CAN     Female      1931    89                 6               3
 5 124073566 CAN     Female      1930    89                 6               4
 6 124074356 CAN     Male        1931    89                 8               4
 7 554070216 NZL     Female      1930    89                 9               2
 8 124070295 CAN     Male        1932    88                 8               1
 9 124072193 CAN     Female      1958    88                 8               4
10 554070038 NZL     Male        1931    88                10               2
# ℹ 6,393 more rows
# ℹ 8 more variables: financial_satisfaction <dbl>, religiousity <chr>,
#   political_scale <dbl>, marital_status <chr>, employment <chr>,
#   work_importance_r <dbl>, age_group <chr>, satisfaction_group <chr>

Task #10

Show the average satisfaction scores by country and age group in wide data format

wvs_data |>
    group_by(country, age_group) |>
    summarise(
        avg_satisfaction = mean(life_satisfaction, na.rm = TRUE),
    ) |>
    pivot_wider(
        names_from = age_group,
        values_from = avg_satisfaction
    )

# A tibble: 3 × 5
# Groups:   country [3]
  country `18-28` `29-44` `45-60` `61+`
  <chr>     <dbl>   <dbl>   <dbl> <dbl>
1 CAN        6.60    6.99    6.92  7.55
2 NZL        6.70    7.32    7.38  7.96
3 SGP        6.85    7.09    7.05  7.17

Long vs Wide Data

Long data:

Each row is a unique observation.
There is a separate column indicating the variable or type of measurements
This format is more “understandable” by R, more suitable for visualizations.

Wide data:

Each row is a value in variables.
Each column is a value in variables –> the more values you have, the “wider” is the data
The cells represent theThis format is more intuitive for humans!

Long vs Wide Data: Examples

Long data:

Observations (Long)
country	age_group	count
CAN	18-28	712
CAN	29-44	1232
CAN	45-60	1061
CAN	61+	1013
NZL	18-28	27
NZL	29-44	119
NZL	45-60	222
NZL	61+	292
SGP	18-28	246
SGP	29-44	511
SGP	45-60	550
SGP	61+	418

Wide data:

Observations (Wide)
country	18-28	29-44	45-60	61+
CAN	712	1232	1061	1013
NZL	27	119	222	292
SGP	246	511	550	418

Group exercise 2 (solo attempts ok)

Time: 5 minutes

Generate a summary stats of age grouped by country and sex. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:

Show answer

wvs_data |> 
    group_by(country, sex) |> 
    summarise(observation = n(), 
              mean_age = mean(age, na.rm = TRUE),
              median_age = median(age, na.rm = TRUE), 
              oldest = max(age, na.rm = TRUE),
              youngest = min(age, na.rm = TRUE),
              std_dev = sd(age, na.rm = TRUE))

Group exercise 2 (solo attempts ok)

# A tibble: 6 × 8
# Groups:   country [3]
  country sex    observation mean_age median_age oldest youngest std_dev
  <chr>   <chr>        <int>    <dbl>      <dbl>  <dbl>    <dbl>   <dbl>
1 CAN     Female        1959     44.2         42     89       18    17.0
2 CAN     Male          2059     48.8         49     93       18    16.4
3 NZL     Female         354     56.6         58     91       19    15.5
4 NZL     Male           306     57.3         59     88       20    15.4
5 SGP     Female         919     47.7         47     87       21    15.3
6 SGP     Male           806     47.9         48     90       21    16.2

Bonus: Deleting columns from dataframe

Let’s say I have this column called wrong_column that I want to remove:

wvs_data <- wvs_data |> mutate(wrong_column = "random values")
wvs_data |> select(country, wrong_column) |> print(n = 3)

# A tibble: 6,403 × 2
  country wrong_column 
  <chr>   <chr>        
1 CAN     random values
2 NZL     random values
3 SGP     random values
# ℹ 6,400 more rows

Remove the wrong column with subset `-`:

wvs_data <- wvs_data |> 
    select(-wrong_column)

# A tibble: 6,403 × 1
  country
  <chr>  
1 CAN    
2 NZL    
3 SGP    
# ℹ 6,400 more rows

Recap

Import and read data into RStudio: Load external data files (like CSV or Excel) into R using functions such as read_csv() to make the data available for analysis.
Data wrangling with dplyr and tidyr (part of the tidyverse package): Use tidyverse functions to tidy and reshape datasets and perform tasks like selecting, filtering, and summarizing data.
Remove all rows with missing values (NA) and check for duplicates: Delete rows containing missing data using na.omit() and identify or remove duplicate rows with functions like distinct().
Select only the relevant columns: Use select() to keep only the columns needed for the analysis, focusing on important variables.
Filter data based on criterion and arrange data: Apply filter() to keep rows meeting specific conditions and use arrange() to sort data by one or more columns.
Change factor level values: Modify the categories of a factor using factor() or recode() to rename or reorder levels for clearer analysis.
Create new columns: Generate or modify columns using mutate(), often by transforming or combining existing columns.

End of Session 2!

Next session: Descriptive statistics and data visualization with ggplot2 package - we’ll create visualizations to explore patterns in life satisfaction, values, and demographics across countries!

Data Wrangling with Tidyverse

Today’s Outline

Checklist when you start RStudio

Refresher: Loading from CSV into a dataframe

Cleaning data for analysis

About dplyr and tidyr

Scenario: Data wrangling activities with WVS data

Let’s wrangle our data!

Task #1

Interlude: Pipe Operator ( |> )

Task #2

Task #3:

Task #4:

Task #5

Task #6

Checkpoint 1 - saving our hard work into a CSV file

Group Exercise 1 (5 minutes)

Scenario: Data wrangling activities with WVS data (continued)

Task #7

Task #7

Task #7 - the shortcut

Task #7 - the shortcut

Task #8

Task #8

Task #9

Task #9

Task #10

Long vs Wide Data

Long vs Wide Data: Examples

Group exercise 2 (solo attempts ok)

Group exercise 2 (solo attempts ok)

Bonus: Deleting columns from dataframe

Remove the wrong column with subset -:

Recap

End of Session 2!

Remove the wrong column with subset `-`: