Data Wrangling with Tidyverse

Bella Ratmelia

Today’s Outline

  1. Loading our data into RStudio environment
  2. Data wrangling with dplyr and tidyr (part of the tidyverse package)

Checklist when you start RStudio

  • Load the project we created last session and open the R script file. Click on File > Recent Projects... > Select the our project from last week.
  • Make sure that Environment panel is empty (click on broom icon to clean it up).
  • Clear the Console and Plots too.
  • Re-run the library(tidyverse) and read_csv portion in the previous session (the code is also on the next slide if you missed last week’s session)

Refresher: Loading from CSV into a dataframe

Use read_csv from readr package (part of tidyverse) to load our World Values Survey data. More information about the data can be found under the Dataset tab in the course website.

# import tidyverse library
library(tidyverse)

# read the CSV and save into a dataframe called wvs_data
wvs_data <- read_csv("data/wvs-wave7-sg-ca-nz.csv")

# "peek" at the data, pay attention to the data types!
glimpse(wvs_data)

Cleaning data for analysis

  • Why do it in R? Because it’s much efficient to do so in R, especially if your data is large (e.g. millions of rows, thousands of columns) and you have repetitive clean up tasks.
  • Incorrect or inconsistent data can lead to false conclusions, so it’s important to clean and prep it correctly.
  • Having a clear understanding of the desired data shape is essential as real data often differs from what you imagine! Refer to codebook, actual questionnaire, appendix for guidance.
  • Data cleaning techniques differ based on the problems, data type, and the research questions you are trying to answer. Various methods are available, each with its own trade-offs.

About dplyr and tidyr

  • Packages from tidyverse. (click here to go to the tidyverse homepage)

  • Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)

  • Most of the time, these are the ones that you will use quite often:

    • drop_na() - remove rows with null values

    • select() - to select column(s) from a dataframe

    • filter() - to filter rows based on criteria

    • mutate() - to compute new columns or edit existing ones

    • if_else() and case_when() - to be used with mutate when we want to compute/edit columns based on multiple criteria

    • group_by() and summarize() - group data and summarize each group

Scenario: Data wrangling activities with WVS data

Scenario: We are research assistants analyzing patterns in values and satisfaction across different countries and demographic groups.

Our team has been assigned to explore and get insights on how specific factors (employment, work importance, marital status, political alignment, financial satisfaction, and religiosity) may relate to life satisfaction among different generations (Gen Z, Millennials, Gen X, and Baby Boomer), including how they may differ between the 3 countries.

To ensure the analysis quality, we were instructed to discard incomplete data.

We can break down the tasks as such:

  1. Remove all rows with missing values (NA)

  2. Check for duplicates

  3. Select only the relevant columns: demographic columns, work_importance, life_satisfaction, financial_satisfaction, religiousity, political_scale, marital_status, and employment.

  4. Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)

  5. Reverse-code importance variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.

  6. Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”

Once we did all of the wrangling above, we can save this “wrangled” version into another CSV

Let’s wrangle our data!

Task #1

A strategy I’d like to recommend: briefly read over the dplyr + tidyr documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.

Remove all rows with empty values (NA) with drop_na()

wvs_data <- wvs_data |> 
  drop_na()

The number of observations after removing NAs:

dim(wvs_data)
[1] 6403   16

Interlude: Pipe Operator ( |> )

  • The pipe operator (|>) allows us to chain multiple operations without creating intermediate / temporary dataframes.

  • Super handy when we perform several data wrangling tasks using tidyverse in sequence.

  • Helps with readability, especially for complex operations.

  • Keyboard shortcut: Ctrl+Shift+M on Windows, Cmd+Shift+M on Mac

Notice that we have to create a “temp” dataframes called wvs_data_clean in this method.

wvs_data <- drop_na(wvs_data)
wvs_data_clean <- wvs_data_clean(wvs_data, desc(age))
write_csv(wvs_data_clean, "data-output/wvs-clean.csv")

No “temporary” dataframe needed here! :D

wvs_data |> 
    drop_na() |> 
    distinct(ID, .keep_all = TRUE) |> 
    write_csv("data-output/wvs-30plus.csv")

Task #2

Check for duplicates with distinct()

wvs_data <- wvs_data |> 
  distinct(ID, .keep_all = TRUE)

(Our data has no duplicates, but this is still a good practice to do, especially if we were combining data from multipe sources)

Task #3:

Select only the relevant columns: demographic columns, life_satisfaction, work_importance, financial_satisfaction, religiousity, political_scale, marital_status, and employment.

We can achieve this with select()!

wvs_data <- wvs_data |>
    select(ID, country, sex, birthyear, age, life_satisfaction, work_importance, financial_satisfaction, religiousity, political_scale, marital_status, employment) 

Preview of the filtered data:

Rows: 6,403
Columns: 12
$ ID                     <dbl> 124070003, 124070004, 124070005, 124070006, 124…
$ country                <chr> "CAN", "CAN", "CAN", "CAN", "CAN", "CAN", "CAN"…
$ sex                    <chr> "Female", "Male", "Female", "Male", "Male", "Ma…
$ birthyear              <dbl> 1944, 1951, 1984, 1975, 1988, 1982, 1981, 1985,…
$ age                    <dbl> 76, 69, 35, 45, 32, 38, 38, 34, 65, 31, 27, 33,…
$ life_satisfaction      <dbl> 5, 8, 9, 7, 1, 7, 9, 6, 7, 6, 7, 10, 8, 10, 6, …
$ work_importance        <dbl> 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 4, 1, 3, 3,…
$ financial_satisfaction <dbl> 8, 2, 8, 8, 9, 8, 9, 5, 6, 9, 5, 7, 9, 9, 7, 8,…
$ religiousity           <chr> "A religious person", "A religious person", "An…
$ political_scale        <dbl> 10, 5, 2, 8, 6, 6, 7, 5, 6, 4, 5, 4, 1, 7, 2, 8…
$ marital_status         <chr> "Separated", "Married", "Living together as mar…
$ employment             <chr> "Retired/pensioned", "Retired/pensioned", "Part…

Task #4:

Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)

wvs_data <- wvs_data |> 
    filter(age >= 18) |> 
    arrange(desc(age))

Checking the structure:

# A tibble: 6 × 12
         ID country sex    birthyear   age life_satisfaction work_importance
      <dbl> <chr>   <chr>      <dbl> <dbl>             <dbl>           <dbl>
1 124071434 CAN     Male        1927    93                 7               1
2 554070524 NZL     Female      1928    91                 5               4
3 702070265 SGP     Male        1930    90                 8               1
4 124071321 CAN     Female      1931    89                 6               3
5 124073566 CAN     Female      1930    89                 6               4
6 124074356 CAN     Male        1931    89                 8               4
# ℹ 5 more variables: financial_satisfaction <dbl>, religiousity <chr>,
#   political_scale <dbl>, marital_status <chr>, employment <chr>

Task #5

Reverse-code work_importance variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.

We can achieve this with mutate()!

wvs_data <- wvs_data |>
    mutate(work_importance_r = 5 - work_importance)

Preview of age groups:

# A tibble: 6,403 × 2
  work_importance_r work_importance
              <dbl>           <dbl>
1                 4               1
2                 1               4
3                 4               1
4                 2               3
5                 1               4
# ℹ 6,398 more rows

Task #6

Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”

wvs_data <- wvs_data |>
    mutate(age_group = case_when(
        age <= 28 ~ "18-28",
        age <= 44 ~ "29-44",
        age <= 60 ~ "45-60",
        TRUE ~ "61+"
    ))

Preview of age groups:

# A tibble: 6,403 × 2
    age age_group
  <dbl> <chr>    
1    93 61+      
2    91 61+      
3    90 61+      
4    89 61+      
# ℹ 6,399 more rows

Checkpoint 1 - saving our hard work into a CSV file

We have done some cleaning! Let’s save this cleaned data into a separate CSV file called “wvs_cleaned_v1.csv”

wvs_data |> write_csv("data-output/wvs_cleaned_v1.csv")

Check the data output folder to make sure the CSV is created!

Group Exercise 1 (5 minutes)

Now that we have a new file, load this new wvs_cleaned_v1.csv into a new dataframe called wvs_cleaned. Filter to respondents from Singapore who are currently employed full time. Show only the Respondent ID, country, the employment status, and age. Use glimpse() or print() to check the result! (you can chain these functions at the end)

Step 1: Load the new file

Show answer
library(tidyverse)
wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")

Step 2: Do the filtering and selecting, and then show result

Show answer
library(tidyverse)
wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")

wvs_cleaned |> 
    filter(country == "SGP" & employment == "Full time") |> 
    select(ID, country, employment, age) |> 
    glimpse()
Rows: 857
Columns: 4
$ ID         <dbl> 702071000, 702070773, 702070363, 702070746, 702070026, 7020…
$ country    <chr> "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SG…
$ employment <chr> "Full time", "Full time", "Full time", "Full time", "Full t…
$ age        <dbl> 81, 79, 73, 70, 69, 68, 68, 68, 68, 68, 68, 67, 67, 67, 67,…

Scenario: Data wrangling activities with WVS data (continued)

Once we are done with the wrangling part, we can proceed with simple descriptive analysis!

  1. Before we proceed further, convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor

  2. Generate summary statistics of life_satisfaction grouped by country

  3. Create a new column called satisfaction_group that indicate whether each respondent has higher or lower than average life_satisfaction

  4. Reshape the data to show average satisfaction scores by country and age group

Task #7

Let’s use the wvs_cleaned dataframe for this task.

Convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor.

We can do this with mutate() and as_factor() from forcats, another sub-package within tidyverse.

wvs_cleaned <- wvs_cleaned |> 
    mutate(
        country = as_factor(country),
        religiousity = as_factor(religiousity),
        sex = as_factor(sex),
        marital_status = as_factor(marital_status),
        employment = as_factor(employment)
    )

# check conversion result
str(wvs_cleaned)

Rstudio may auto-suggest as.factor() from base R. You can use this as well, but as_factor() is preferred since we are using tidyverse approach.

Task #7

tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
 $ ID                    : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
 $ country               : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
 $ sex                   : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
 $ birthyear             : num [1:6403] 1927 1928 1930 1931 1930 ...
 $ age                   : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
 $ life_satisfaction     : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
 $ work_importance       : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
 $ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
 $ religiousity          : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
 $ political_scale       : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
 $ marital_status        : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
 $ employment            : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ work_importance_r     : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
 $ age_group             : chr [1:6403] "61+" "61+" "61+" "61+" ...

Task #7 - the shortcut

If we have a lot of columns to convert, that might be troublesome to type! This is where across() can come in handy.

Let’s first define a character vector that contains the names of columns we plan to convert.

columns_to_convert <- c("country", "religiousity", "sex", "marital_status", "employment")

We will use this vector with mutate() and across(). We tell tidyverse to convert all of the columns with the help of all_of()

wvs_cleaned <- wvs_cleaned |> 
    mutate(across(all_of(columns_to_convert), as_factor))

# check conversion result
str(wvs_cleaned)

across() is used for applying the same function to multiple columns in a single mutate() or summarise() operation.

Task #7 - the shortcut

tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
 $ ID                    : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
 $ country               : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
 $ sex                   : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
 $ birthyear             : num [1:6403] 1927 1928 1930 1931 1930 ...
 $ age                   : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
 $ life_satisfaction     : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
 $ work_importance       : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
 $ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
 $ religiousity          : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
 $ political_scale       : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
 $ marital_status        : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
 $ employment            : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ work_importance_r     : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
 $ age_group             : chr [1:6403] "61+" "61+" "61+" "61+" ...

Task #8

Generate summary statistics such as count (n), mean, median, and standard deviation of life_satisfaction grouped by country.

We can achieve this with group_by() and summarise()

wvs_data |>
    group_by(country, age_group) |>
    summarise(
        n = n(),
        mean_satisfaction = mean(life_satisfaction),
        median_satisfaction = median(life_satisfaction),
        sd_satisfaction = sd(life_satisfaction)
    ) |>
    arrange(country, desc(mean_satisfaction))

What if we want to save this into a CSV? What if we also want to group by country AND age_group?

Task #8

# A tibble: 12 × 6
# Groups:   country [3]
   country age_group     n mean_satisfaction median_satisfaction sd_satisfaction
   <chr>   <chr>     <int>             <dbl>               <dbl>           <dbl>
 1 CAN     61+        1013              7.55                   8            1.59
 2 CAN     29-44      1232              6.99                   7            1.80
 3 CAN     45-60      1061              6.92                   7            1.87
 4 CAN     18-28       712              6.60                   7            1.86
 5 NZL     61+         292              7.96                   8            1.66
 6 NZL     45-60       222              7.38                   8            1.90
 7 NZL     29-44       119              7.32                   8            1.69
 8 NZL     18-28        27              6.70                   7            1.98
 9 SGP     61+         418              7.17                   7            1.84
10 SGP     29-44       511              7.09                   7            1.70
11 SGP     45-60       550              7.05                   7            1.83
12 SGP     18-28       246              6.85                   7            1.68

Task #9

Create a new column called satisfaction_group that indicate whether each respondent has higher or lower than average life_satisfaction

mean_satisfaction <- mean(wvs_data$life_satisfaction, na.rm = TRUE)

wvs_data |>
    mutate(satisfaction_group = if_else(
        life_satisfaction > mean_satisfaction, # the condition to evaluate
        "higher", # if condition is fulfilled, do this
        "lower" # otherwise, do this
    )) 

Task #9

# A tibble: 6,403 × 15
          ID country sex    birthyear   age life_satisfaction work_importance
       <dbl> <chr>   <chr>      <dbl> <dbl>             <dbl>           <dbl>
 1 124071434 CAN     Male        1927    93                 7               1
 2 554070524 NZL     Female      1928    91                 5               4
 3 702070265 SGP     Male        1930    90                 8               1
 4 124071321 CAN     Female      1931    89                 6               3
 5 124073566 CAN     Female      1930    89                 6               4
 6 124074356 CAN     Male        1931    89                 8               4
 7 554070216 NZL     Female      1930    89                 9               2
 8 124070295 CAN     Male        1932    88                 8               1
 9 124072193 CAN     Female      1958    88                 8               4
10 554070038 NZL     Male        1931    88                10               2
# ℹ 6,393 more rows
# ℹ 8 more variables: financial_satisfaction <dbl>, religiousity <chr>,
#   political_scale <dbl>, marital_status <chr>, employment <chr>,
#   work_importance_r <dbl>, age_group <chr>, satisfaction_group <chr>

Task #10

Show the average satisfaction scores by country and age group in wide data format

wvs_data |>
    group_by(country, age_group) |>
    summarise(
        avg_satisfaction = mean(life_satisfaction, na.rm = TRUE),
    ) |>
    pivot_wider(
        names_from = age_group,
        values_from = avg_satisfaction
    ) 
# A tibble: 3 × 5
# Groups:   country [3]
  country `18-28` `29-44` `45-60` `61+`
  <chr>     <dbl>   <dbl>   <dbl> <dbl>
1 CAN        6.60    6.99    6.92  7.55
2 NZL        6.70    7.32    7.38  7.96
3 SGP        6.85    7.09    7.05  7.17

Long vs Wide Data

Long data:

  • Each row is a unique observation.

  • There is a separate column indicating the variable or type of measurements

  • This format is more “understandable” by R, more suitable for visualizations.

Wide data:

  • Each row is a value in variables.

  • Each column is a value in variables –> the more values you have, the “wider” is the data

  • The cells represent theThis format is more intuitive for humans!

Long vs Wide Data: Examples

Long data:

Observations (Long)
country age_group count
CAN 18-28 712
CAN 29-44 1232
CAN 45-60 1061
CAN 61+ 1013
NZL 18-28 27
NZL 29-44 119
NZL 45-60 222
NZL 61+ 292
SGP 18-28 246
SGP 29-44 511
SGP 45-60 550
SGP 61+ 418

Wide data:

Observations (Wide)
country 18-28 29-44 45-60 61+
CAN 712 1232 1061 1013
NZL 27 119 222 292
SGP 246 511 550 418

Group exercise 2 (solo attempts ok)

Time: 5 minutes

Generate a summary stats of age grouped by country and sex. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:

Show answer
wvs_data |> 
    group_by(country, sex) |> 
    summarise(observation = n(), 
              mean_age = mean(age, na.rm = TRUE),
              median_age = median(age, na.rm = TRUE), 
              oldest = max(age, na.rm = TRUE),
              youngest = min(age, na.rm = TRUE),
              std_dev = sd(age, na.rm = TRUE))

Group exercise 2 (solo attempts ok)

# A tibble: 6 × 8
# Groups:   country [3]
  country sex    observation mean_age median_age oldest youngest std_dev
  <chr>   <chr>        <int>    <dbl>      <dbl>  <dbl>    <dbl>   <dbl>
1 CAN     Female        1959     44.2         42     89       18    17.0
2 CAN     Male          2059     48.8         49     93       18    16.4
3 NZL     Female         354     56.6         58     91       19    15.5
4 NZL     Male           306     57.3         59     88       20    15.4
5 SGP     Female         919     47.7         47     87       21    15.3
6 SGP     Male           806     47.9         48     90       21    16.2

Bonus: Deleting columns from dataframe

Let’s say I have this column called wrong_column that I want to remove:

wvs_data <- wvs_data |> mutate(wrong_column = "random values")
wvs_data |> select(country, wrong_column) |> print(n = 3)
# A tibble: 6,403 × 2
  country wrong_column 
  <chr>   <chr>        
1 CAN     random values
2 NZL     random values
3 SGP     random values
# ℹ 6,400 more rows

Remove the wrong column with subset -:

wvs_data <- wvs_data |> 
    select(-wrong_column)
# A tibble: 6,403 × 1
  country
  <chr>  
1 CAN    
2 NZL    
3 SGP    
# ℹ 6,400 more rows

End of Session 2!

Next session: Descriptive statistics and data visualization with ggplot2 package - we’ll create visualizations to explore patterns in life satisfaction, values, and demographics across countries!