Data Wrangling with Tidyverse

Bella Ratmelia

Today’s Outline

  1. Loading our data into RStudio environment
  2. Data wrangling with dplyr and tidyr (part of the tidyverse package)

Checklist when you start RStudio

  • Navigate to where you create your R project folder for this workshop.
  • Find a file with .Rproject extension and double-click on it. This will start RStudio and open your project.
  • Make sure that Environment panel is empty (click on broom icon to clean it up).
  • Clear the Console and Plots too.
  • Re-run the library(tidyverse) and read_csv portion in the previous session (the code is also on the next slide if you missed last week’s session)

Refresher: Loading from CSV into a dataframe

Use read_csv from readr package (part of tidyverse) to load our World Values Survey data. More information about the data can be found under the Dataset tab in the course website.

# import tidyverse library
library(tidyverse)

# read the CSV and save into a dataframe called wvs_data
wvs_data <- read_csv("data/wvs-wave7-sg-ca-nz.csv")

# "peek" at the data, pay attention to the data types!
glimpse(wvs_data)

Why do we need to clean our data?

  • Researchers spend 60-80% of their time on data preparation - investing in proper wrangling upfront saves hours of debugging and rework later.
  • Poor data quality costs projects or organizations millions annually(!), and even one misclassified data point can skew entire analyses.
  • Clean data improves the workflow for statistical tests and can reveal hidden patterns that aren’t visible in messy datasets
  • Well-documented data cleaning makes research transparent and reproducible, benefiting both future-you and collaborators.

Why do it in R? Can’t I do this on excel?

  • R code creates a permanent record of exactly what you did - Excel clicks and manual edits are hard to track and reproduce (unless you note down what you did).
  • R handles dates, times, text, and mixed data types much more reliably than Excel.
  • Excel can automatically “correct” your data in unwanted ways (like converting your dates from DDMMYY to MMDDYY)
  • Write the cleaning code once, apply it to multiple datasets or when data updates. Whereas Excel requires manual repetition of the same steps each time.
  • Complex joins, reshaping, and conditional operations that would take many Excel steps can be done in a few lines of R code.
  • R scripts can be easily shared, version controlled, and integrated into reproducible workflows

What do I need to know before we begin?

  • Always examine your data first (glimpse(), summary(), str()) before cleaning.
  • Never overwrite your raw data files
  • Budget extra time for data cleaning - it almost always takes longer than expected!
  • Having a clear understanding of the desired data shape is essential as real data often differs from what you imagine! Refer to codebook, actual questionnaire, appendix for guidance.
  • Look out for outliers!
  • Remember: Data cleaning techniques differ based on the problems, data type, and the research questions you are trying to answer. Various methods are available, each with its own trade-offs.

The tools we’re going to use: dplyr and tidyr from tidyverse

  • Packages from tidyverse. (click here to go to the tidyverse homepage)

  • Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)

  • Most of the time, these are the ones that you will use quite often:

    • drop_na() - remove rows with null values

    • select() - to select column(s) from a dataframe

    • filter() - to filter rows based on criteria

    • mutate() - to compute new columns or edit existing ones

    • if_else() and case_when() - to be used with mutate when we want to compute/edit columns based on multiple criteria

    • group_by() and summarize() - group data and summarize each group

More advanced ones which is not covered in the workshop (see appendix for examples)

  • Joining multiple dataframes’ columns: left_join(), right_join(), inner_join(), full_join()

  • Joining multiple dataframes’ rows: bind_rows()

  • Splitting or combining cells: unite(), separate_wider_delim(), separate_long_delim()

Additionally: you may need the forcats cheatsheet here: https://raw.githubusercontent.com/rstudio/cheatsheets/main/factors.pdf

forcats is also part of tidyverse, specifically used to handle factor data.

Why are we using tidyverse? Is it a must?

It’s not a must, but it’ll be make our job much easier! Key advantages:

  • More ‘English’ looking code and thus more intuitive for beginners
  • Tidyverse follow consistent syntax patterns - first argument is always the data to work on, followed by the function name i.e the things that we want to do on to the data.
  • Tidyverse gives clearer, more helpful error messages

Example:

# Base R
subset_data <- data[data$age > 18 & data$income > 50000, c("name", "age", "salary")]

# Tidyverse  
subset_data <- data |> 
  filter(age > 18, income > 50000)  |> 
  select(name, age, salary)

Scenario: Data wrangling activities with WVS data

Scenario: We are research assistants analyzing patterns in values and satisfaction across different countries and demographic groups.

Our team has been assigned to explore and get insights on how specific factors (employment, work importance, marital status, political alignment, financial satisfaction, and religiosity) may relate to life satisfaction among different generations (Gen Z, Millennials, Gen X, and Baby Boomer), including how they may differ between the 3 countries.

To ensure the analysis quality, we were instructed to discard incomplete data.

We can break down the steps as such:

  1. Remove all rows with missing values (NA)

  2. Check for duplicates

  3. Select only the relevant columns: participants’ ID, demographic columns (i.e. age, country, sex, birthyear), work_importance, life_satisfaction, financial_satisfaction, religiousity, political_scale, marital_status, and employment.

  4. Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)

  5. Reverse-code importance variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.

  6. Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”

Once we did all of the wrangling above, we can save this “wrangled” version into another CSV as a “checkpoint”.

Let’s wrangle our data!

Step #1

A strategy I’d like to recommend: briefly read over the dplyr + tidyr documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.

Remove all rows with empty values (NA) with drop_na()

wvs_data <- wvs_data |> 
  drop_na()

The number of observations after removing NAs:

dim(wvs_data)
[1] 6403   16

Interlude: Pipe Operator ( |> )

  • The pipe operator (|>) allows us to chain multiple operations without creating intermediate / temporary dataframes.

  • Super handy when we perform several data wrangling tasks using tidyverse in sequence.

  • Helps with readability, especially for complex operations.

  • Keyboard shortcut: Ctrl+Shift+M on Windows, Cmd+Shift+M on Mac

Notice that we have to create a “temp” dataframes called wvs_data_clean in this method.

wvs_data <- drop_na(wvs_data)
wvs_data_clean <- wvs_data_clean(wvs_data, desc(age))
write_csv(wvs_data_clean, "data-output/wvs-clean.csv")

No “temporary” dataframe needed here! :D

wvs_data |> 
    drop_na() |> 
    distinct(ID, .keep_all = TRUE) |> 
    write_csv("data-output/wvs-30plus.csv")

Step #2

Check for duplicates with distinct()

wvs_data <- wvs_data |> 
  distinct(ID, .keep_all = TRUE)

(Our data has no duplicates, but this is still a good practice to do, especially if we were combining data from multipe sources)

Step #3:

Select only the relevant columns: demographic columns, life_satisfaction, work_importance, financial_satisfaction, religiousity, political_scale, marital_status, and employment.

We can achieve this with select()!

wvs_data <- wvs_data |>
    select(ID, country, sex, birthyear, age, life_satisfaction, work_importance, financial_satisfaction, religiousity, political_scale, marital_status, employment) 

Preview of the filtered data:

Rows: 6,403
Columns: 12
$ ID                     <dbl> 124070003, 124070004, 124070005, 124070006, 124…
$ country                <chr> "CAN", "CAN", "CAN", "CAN", "CAN", "CAN", "CAN"…
$ sex                    <chr> "Female", "Male", "Female", "Male", "Male", "Ma…
$ birthyear              <dbl> 1944, 1951, 1984, 1975, 1988, 1982, 1981, 1985,…
$ age                    <dbl> 76, 69, 35, 45, 32, 38, 38, 34, 65, 31, 27, 33,…
$ life_satisfaction      <dbl> 5, 8, 9, 7, 1, 7, 9, 6, 7, 6, 7, 10, 8, 10, 6, …
$ work_importance        <dbl> 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 4, 1, 3, 3,…
$ financial_satisfaction <dbl> 8, 2, 8, 8, 9, 8, 9, 5, 6, 9, 5, 7, 9, 9, 7, 8,…
$ religiousity           <chr> "A religious person", "A religious person", "An…
$ political_scale        <dbl> 10, 5, 2, 8, 6, 6, 7, 5, 6, 4, 5, 4, 1, 7, 2, 8…
$ marital_status         <chr> "Separated", "Married", "Living together as mar…
$ employment             <chr> "Retired/pensioned", "Retired/pensioned", "Part…

Step #4:

Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)

wvs_data <- wvs_data |> 
    filter(age >= 18) |> 
    arrange(desc(age))

Checking the structure:

# A tibble: 6 × 12
         ID country sex    birthyear   age life_satisfaction work_importance
      <dbl> <chr>   <chr>      <dbl> <dbl>             <dbl>           <dbl>
1 124071434 CAN     Male        1927    93                 7               1
2 554070524 NZL     Female      1928    91                 5               4
3 702070265 SGP     Male        1930    90                 8               1
4 124071321 CAN     Female      1931    89                 6               3
5 124073566 CAN     Female      1930    89                 6               4
6 124074356 CAN     Male        1931    89                 8               4
# ℹ 5 more variables: financial_satisfaction <dbl>, religiousity <chr>,
#   political_scale <dbl>, marital_status <chr>, employment <chr>

Step #5

Reverse-code work_importance variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.

We can achieve this with mutate()!

wvs_data <- wvs_data |>
    mutate(work_importance_r = 5 - work_importance)

Preview of age groups:

# A tibble: 6,403 × 2
  work_importance_r work_importance
              <dbl>           <dbl>
1                 4               1
2                 1               4
3                 4               1
4                 2               3
5                 1               4
# ℹ 6,398 more rows

Step #6

Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”

wvs_data <- wvs_data |>
    mutate(age_group = case_when(
        age <= 28 ~ "18-28",
        age <= 44 ~ "29-44",
        age <= 60 ~ "45-60",
        TRUE ~ "61+"
    ))

Preview of age groups:

# A tibble: 6,403 × 2
    age age_group
  <dbl> <chr>    
1    93 61+      
2    91 61+      
3    90 61+      
4    89 61+      
# ℹ 6,399 more rows

Checkpoint 1 - saving our hard work into a CSV file

We have done some cleaning! Let’s save this cleaned data into a separate CSV file called “wvs_cleaned_v1.csv”

wvs_data |> write_csv("data-output/wvs_cleaned_v1.csv")

Check the data output folder to make sure the CSV is created!

Simple descriptive analysis

Once we are done with the wrangling part, we can proceed with simple descriptive analysis!

  • Prep: Before we proceed further, convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor

  • Analysis 1: Generate summary statistics of life_satisfaction grouped by country

  • Analysis 2: Create a new column called satisfaction_group that indicate whether each respondent has higher or lower than average life_satisfaction

  • Analysis 3: Reshape the data to show average satisfaction scores by country and age group

Prep before analysis: convert to factors

Let’s use the wvs_cleaned dataframe for this task.

Convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor.

We can do this with mutate() and as_factor() from forcats, another sub-package within tidyverse.

wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")
wvs_cleaned <- wvs_cleaned |> 
    mutate(
        country = as_factor(country),
        religiousity = as_factor(religiousity),
        sex = as_factor(sex),
        marital_status = as_factor(marital_status),
        employment = as_factor(employment)
    )

# check conversion result
str(wvs_cleaned)

Rstudio may auto-suggest as.factor() from base R. You can use this as well, but as_factor() is preferred since we are using tidyverse approach.

Prep before analysis: convert to factors

tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
 $ ID                    : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
 $ country               : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
 $ sex                   : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
 $ birthyear             : num [1:6403] 1927 1928 1930 1931 1930 ...
 $ age                   : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
 $ life_satisfaction     : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
 $ work_importance       : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
 $ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
 $ religiousity          : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
 $ political_scale       : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
 $ marital_status        : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
 $ employment            : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ work_importance_r     : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
 $ age_group             : chr [1:6403] "61+" "61+" "61+" "61+" ...

Prep before analysis: convert to factors (shortcut ver.)

If we have a lot of columns to convert, that might be troublesome to type! This is where across() can come in handy.

Let’s first define a character vector that contains the names of columns we plan to convert.

columns_to_convert <- c("country", "religiousity", "sex", "marital_status", "employment")

We will use this vector with mutate() and across(). We tell tidyverse to convert all of the columns with the help of all_of()

wvs_cleaned <- wvs_cleaned |> 
    mutate(across(all_of(columns_to_convert), as_factor))

# check conversion result
str(wvs_cleaned)

across() is used for applying the same function to multiple columns in a single mutate() or summarise() operation.

Prep before analysis: convert to factors (shortcut ver.)

tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
 $ ID                    : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
 $ country               : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
 $ sex                   : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
 $ birthyear             : num [1:6403] 1927 1928 1930 1931 1930 ...
 $ age                   : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
 $ life_satisfaction     : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
 $ work_importance       : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
 $ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
 $ religiousity          : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
 $ political_scale       : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
 $ marital_status        : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
 $ employment            : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ work_importance_r     : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
 $ age_group             : chr [1:6403] "61+" "61+" "61+" "61+" ...

Analysis 1 - summary stats of life_satisfaction for each country

Generate summary statistics such as count (n), mean, median, and standard deviation of life_satisfaction grouped by country.

We can achieve this with group_by() and summarise()

wvs_data |>
    group_by(country, age_group) |>
    summarise(
        n = n(),
        mean_satisfaction = mean(life_satisfaction),
        median_satisfaction = median(life_satisfaction),
        sd_satisfaction = sd(life_satisfaction)
    ) |>
    arrange(country, desc(mean_satisfaction))

What if we want to save this into a CSV? What if we also want to group by country AND age_group?

Analysis 1 - summary stats of life_satisfaction for each country

# A tibble: 12 × 6
# Groups:   country [3]
   country age_group     n mean_satisfaction median_satisfaction sd_satisfaction
   <chr>   <chr>     <int>             <dbl>               <dbl>           <dbl>
 1 CAN     61+        1013              7.55                   8            1.59
 2 CAN     29-44      1232              6.99                   7            1.80
 3 CAN     45-60      1061              6.92                   7            1.87
 4 CAN     18-28       712              6.60                   7            1.86
 5 NZL     61+         292              7.96                   8            1.66
 6 NZL     45-60       222              7.38                   8            1.90
 7 NZL     29-44       119              7.32                   8            1.69
 8 NZL     18-28        27              6.70                   7            1.98
 9 SGP     61+         418              7.17                   7            1.84
10 SGP     29-44       511              7.09                   7            1.70
11 SGP     45-60       550              7.05                   7            1.83
12 SGP     18-28       246              6.85                   7            1.68

Analysis 2 - How many has below and average life_satisfaction?

Create a new column called satisfaction_group that indicate whether each respondent has higher or lower than average life_satisfaction

mean_satisfaction <- mean(wvs_data$life_satisfaction, na.rm = TRUE)

wvs_data |>
    mutate(satisfaction_group = if_else(
        life_satisfaction > mean_satisfaction, # the condition to evaluate
        "higher", # if condition is fulfilled, do this
        "lower" # otherwise, do this
    )) 

Analysis 2 - How many has below and average life_satisfaction?

# A tibble: 6,403 × 15
          ID country sex    birthyear   age life_satisfaction work_importance
       <dbl> <chr>   <chr>      <dbl> <dbl>             <dbl>           <dbl>
 1 124071434 CAN     Male        1927    93                 7               1
 2 554070524 NZL     Female      1928    91                 5               4
 3 702070265 SGP     Male        1930    90                 8               1
 4 124071321 CAN     Female      1931    89                 6               3
 5 124073566 CAN     Female      1930    89                 6               4
 6 124074356 CAN     Male        1931    89                 8               4
 7 554070216 NZL     Female      1930    89                 9               2
 8 124070295 CAN     Male        1932    88                 8               1
 9 124072193 CAN     Female      1958    88                 8               4
10 554070038 NZL     Male        1931    88                10               2
# ℹ 6,393 more rows
# ℹ 8 more variables: financial_satisfaction <dbl>, religiousity <chr>,
#   political_scale <dbl>, marital_status <chr>, employment <chr>,
#   work_importance_r <dbl>, age_group <chr>, satisfaction_group <chr>

Analysis 3 - What’s the average satisfaction scores for each country and age group?

Show the average satisfaction scores by country and age group in wide data format

wvs_data |>
    group_by(country, age_group) |>
    summarise(
        avg_satisfaction = mean(life_satisfaction, na.rm = TRUE),
    ) |>
    pivot_wider(
        names_from = age_group,
        values_from = avg_satisfaction
    ) 
# A tibble: 3 × 5
# Groups:   country [3]
  country `18-28` `29-44` `45-60` `61+`
  <chr>     <dbl>   <dbl>   <dbl> <dbl>
1 CAN        6.60    6.99    6.92  7.55
2 NZL        6.70    7.32    7.38  7.96
3 SGP        6.85    7.09    7.05  7.17

Long vs Wide Data

Long data:

  • Each row is a unique observation.

  • There is a separate column indicating the variable or type of measurements

  • This format is more “understandable” by R, more suitable for visualizations (which we’ll explore more next week!)

Wide data:

  • Each row is a value in variables.

  • Each column is a value in variables –> the more values you have, the “wider” is the data

  • This format is more intuitive for humans!

Long vs Wide Data: Examples

Long data:

Observations (Long)
country age_group count
CAN 18-28 712
CAN 29-44 1232
CAN 45-60 1061
CAN 61+ 1013
NZL 18-28 27
NZL 29-44 119
NZL 45-60 222
NZL 61+ 292
SGP 18-28 246
SGP 29-44 511
SGP 45-60 550
SGP 61+ 418

Wide data:

Observations (Wide)
country 18-28 29-44 45-60 61+
CAN 712 1232 1061 1013
NZL 27 119 222 292
SGP 246 511 550 418

Bonus: Deleting columns from dataframe

Let’s say I have this column called wrong_column that I want to remove:

wvs_data <- wvs_data |> mutate(wrong_column = "random values")
wvs_data |> select(country, wrong_column) |> print(n = 3)
# A tibble: 6,403 × 2
  country wrong_column 
  <chr>   <chr>        
1 CAN     random values
2 NZL     random values
3 SGP     random values
# ℹ 6,400 more rows

Remove the wrong column with subset -:

wvs_data <- wvs_data |> 
    select(-wrong_column)
# A tibble: 6,403 × 1
  country
  <chr>  
1 CAN    
2 NZL    
3 SGP    
# ℹ 6,400 more rows

Recap

  • Import and read data into RStudio: Load external data files (like CSV or Excel) into R using functions such as read_csv() to make the data available for analysis.

  • Data wrangling with dplyr and tidyr (part of the tidyverse package): Use tidyverse functions to tidy and reshape datasets and perform tasks like selecting, filtering, and summarizing data.

  • Remove all rows with missing values (NA) and check for duplicates: Delete rows containing missing data using na.omit() and identify or remove duplicate rows with functions like distinct().

  • Select only the relevant columns: Use select() to keep only the columns needed for the analysis, focusing on important variables.

  • Filter data based on criterion and arrange data: Apply filter() to keep rows meeting specific conditions and use arrange() to sort data by one or more columns.

  • Change factor level values: Modify the categories of a factor using factor() or recode() to rename or reorder levels for clearer analysis.

  • Create new columns: Generate or modify columns using mutate(), often by transforming or combining existing columns.

End of Session 2!

Next session: Descriptive statistics and data visualization with ggplot2 package - we’ll create visualizations to explore patterns in life satisfaction, values, and demographics across countries!

To try at home: Exercise 1

Now that we have a new file, load this new wvs_cleaned_v1.csv into a new dataframe called wvs_cleaned. Filter to respondents from Singapore who are currently employed full time. Show only the Respondent ID, country, the employment status, and age. Use glimpse() or print() to check the result! (you can chain these functions at the end)

Step 1: Load the new file

Show answer
library(tidyverse)
wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")

Step 2: Do the filtering and selecting, and then show result

Show answer
library(tidyverse)
wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")

wvs_cleaned |> 
    filter(country == "SGP" & employment == "Full time") |> 
    select(ID, country, employment, age) |> 
    glimpse()
Rows: 857
Columns: 4
$ ID         <dbl> 702071000, 702070773, 702070363, 702070746, 702070026, 7020…
$ country    <chr> "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SG…
$ employment <chr> "Full time", "Full time", "Full time", "Full time", "Full t…
$ age        <dbl> 81, 79, 73, 70, 69, 68, 68, 68, 68, 68, 68, 67, 67, 67, 67,…

To try at home: Exercise 2

Generate a summary stats of age grouped by country and sex. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:

Show answer
wvs_data |> 
    group_by(country, sex) |> 
    summarise(observation = n(), 
              mean_age = mean(age, na.rm = TRUE),
              median_age = median(age, na.rm = TRUE), 
              oldest = max(age, na.rm = TRUE),
              youngest = min(age, na.rm = TRUE),
              std_dev = sd(age, na.rm = TRUE))

To try at home: Exercise 2

# A tibble: 6 × 8
# Groups:   country [3]
  country sex    observation mean_age median_age oldest youngest std_dev
  <chr>   <chr>        <int>    <dbl>      <dbl>  <dbl>    <dbl>   <dbl>
1 CAN     Female        1959     44.2         42     89       18    17.0
2 CAN     Male          2059     48.8         49     93       18    16.4
3 NZL     Female         354     56.6         58     91       19    15.5
4 NZL     Male           306     57.3         59     88       20    15.4
5 SGP     Female         919     47.7         47     87       21    15.3
6 SGP     Male           806     47.9         48     90       21    16.2

Appendix

Advanced tidyr and dplyr features you may need in the future.

Joining DataFrames by column

Let’s assume we have two DataFrames customers and orders

customers dataframe:

(Notice that the ID is 1, 2, and 3)

  id  name
1  1 Alice
2  2   Bob
3  3 Carol

orders dataframe

(Notice that the ID is 1, 2, and 4)

  id amount
1  1    100
2  2    200
3  4    150

left_join() to keep all rows from the left table i.e. customers

customers |> left_join(orders, by = "id")
  id  name amount
1  1 Alice    100
2  2   Bob    200
3  3 Carol     NA

right_join() to keep all rows from the right table i.e. orders

customers |> right_join(orders, by = "id")
  id  name amount
1  1 Alice    100
2  2   Bob    200
3  4  <NA>    150

inner_join() to keep only matching rows

customers |> inner_join(orders, by = "id")
  id  name amount
1  1 Alice    100
2  2   Bob    200

full_join() to keep all rows from both tables

customers |> full_join(orders, by = "id")
  id  name amount
1  1 Alice    100
2  2   Bob    200
3  3 Carol     NA
4  4  <NA>    150

bind_rows() to stack rows from two different dataframes

Let’s say we have the following dataframes

print(df1)
   name age
1 Alice  25
print(df2)
  name age
1  Bob  30

Stack them together:

df1 |> bind_rows(df2)
   name age
1 Alice  25
2   Bob  30

Splitting and combining cells

Let’s say we have the following dataframe

  first  last         full_address
1  John   Doe 123 Main St, NYC, NY
2  Jane Smith  456 Oak Ave, LA, CA

Combine first and last name columns into one

df <- df |> unite("full_name", "first", "last", sep = "_")
print(df)
   full_name         full_address
1   John_Doe 123 Main St, NYC, NY
2 Jane_Smith  456 Oak Ave, LA, CA

Split the full name column into multiple (by delimiter)

df <- df |> separate_wider_delim("full_name", delim = "_", names = c("first", "last"))
print(df)
# A tibble: 2 × 3
  first last  full_address        
  <chr> <chr> <chr>               
1 John  Doe   123 Main St, NYC, NY
2 Jane  Smith 456 Oak Ave, LA, CA 

Split one column into multiple rows (long format)

df |> separate_longer_delim(full_address, delim = ", ")
# A tibble: 6 × 3
  first last  full_address
  <chr> <chr> <chr>       
1 John  Doe   123 Main St 
2 John  Doe   NYC         
3 John  Doe   NY          
4 Jane  Smith 456 Oak Ave 
5 Jane  Smith LA          
6 Jane  Smith CA