dplyr
and tidyr
(part of the tidyverse
package)File
> Recent Projects...
> Select the our project from last week.Environment
panel is empty (click on broom icon to clean it up).Console
and Plots
too.library(tidyverse)
and read_csv
portion in the previous session (the code is also on the next slide if you missed last week’s session)Use read_csv
from readr
package (part of tidyverse
) to load our World Values Survey data. More information about the data can be found under the Dataset
tab in the course website.
Packages from tidyverse
. (click here to go to the tidyverse homepage)
Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)
dplyr cheatsheet | pdf version (I personally prefer this PDF version since it’s more visual)
Most of the time, these are the ones that you will use quite often:
drop_na()
- remove rows with null values
select()
- to select column(s) from a dataframe
filter()
- to filter rows based on criteria
mutate()
- to compute new columns or edit existing ones
if_else()
and case_when()
- to be used with mutate when we want to compute/edit columns based on multiple criteria
group_by()
and summarize()
- group data and summarize each group
Scenario: We are research assistants analyzing patterns in values and satisfaction across different countries and demographic groups.
Our team has been assigned to explore and get insights on how specific factors (employment, work importance, marital status, political alignment, financial satisfaction, and religiosity) may relate to life satisfaction among different generations (Gen Z, Millennials, Gen X, and Baby Boomer), including how they may differ between the 3 countries.
To ensure the analysis quality, we were instructed to discard incomplete data.
We can break down the tasks as such:
Remove all rows with missing values (NA)
Check for duplicates
Select only the relevant columns: demographic columns, work_importance
, life_satisfaction
, financial_satisfaction
, religiousity
, political_scale
, marital_status
, and employment
.
Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)
Reverse-code importance
variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.
Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”
Once we did all of the wrangling above, we can save this “wrangled” version into another CSV
A strategy I’d like to recommend: briefly read over the
dplyr
+tidyr
documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.
Remove all rows with empty values (NA) with drop_na()
The number of observations after removing NAs:
The pipe operator (|>) allows us to chain multiple operations without creating intermediate / temporary dataframes.
Super handy when we perform several data wrangling tasks using tidyverse in sequence.
Helps with readability, especially for complex operations.
Keyboard shortcut: Ctrl
+Shift
+M
on Windows, Cmd
+Shift
+M
on Mac
Notice that we have to create a “temp” dataframes called wvs_data_clean
in this method.
Check for duplicates with distinct()
(Our data has no duplicates, but this is still a good practice to do, especially if we were combining data from multipe sources)
Select only the relevant columns: demographic columns, life_satisfaction
, work_importance
, financial_satisfaction
, religiousity
, political_scale
, marital_status
, and employment
.
We can achieve this with select()
!
Preview of the filtered data:
Rows: 6,403
Columns: 12
$ ID <dbl> 124070003, 124070004, 124070005, 124070006, 124…
$ country <chr> "CAN", "CAN", "CAN", "CAN", "CAN", "CAN", "CAN"…
$ sex <chr> "Female", "Male", "Female", "Male", "Male", "Ma…
$ birthyear <dbl> 1944, 1951, 1984, 1975, 1988, 1982, 1981, 1985,…
$ age <dbl> 76, 69, 35, 45, 32, 38, 38, 34, 65, 31, 27, 33,…
$ life_satisfaction <dbl> 5, 8, 9, 7, 1, 7, 9, 6, 7, 6, 7, 10, 8, 10, 6, …
$ work_importance <dbl> 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 4, 1, 3, 3,…
$ financial_satisfaction <dbl> 8, 2, 8, 8, 9, 8, 9, 5, 6, 9, 5, 7, 9, 9, 7, 8,…
$ religiousity <chr> "A religious person", "A religious person", "An…
$ political_scale <dbl> 10, 5, 2, 8, 6, 6, 7, 5, 6, 4, 5, 4, 1, 7, 2, 8…
$ marital_status <chr> "Separated", "Married", "Living together as mar…
$ employment <chr> "Retired/pensioned", "Retired/pensioned", "Part…
Filter for respondents aged 18 or older. Optionally, we can then arrange the dataset by age (oldest to youngest)
Checking the structure:
# A tibble: 6 × 12
ID country sex birthyear age life_satisfaction work_importance
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 124071434 CAN Male 1927 93 7 1
2 554070524 NZL Female 1928 91 5 4
3 702070265 SGP Male 1930 90 8 1
4 124071321 CAN Female 1931 89 6 3
5 124073566 CAN Female 1930 89 6 4
6 124074356 CAN Male 1931 89 8 4
# ℹ 5 more variables: financial_satisfaction <dbl>, religiousity <chr>,
# political_scale <dbl>, marital_status <chr>, employment <chr>
Reverse-code work_importance
variables so that higher numbers consistently represent higher levels of the measured construct. In other words, 1 = Not at all important and 4 = Very important.
We can achieve this with mutate()
!
Preview of age groups:
# A tibble: 6,403 × 2
work_importance_r work_importance
<dbl> <dbl>
1 4 1
2 1 4
3 4 1
4 2 3
5 1 4
# ℹ 6,398 more rows
Create age groups for each generation: “18-28”, “29-44”, “45-60”, “61+”
Preview of age groups:
# A tibble: 6,403 × 2
age age_group
<dbl> <chr>
1 93 61+
2 91 61+
3 90 61+
4 89 61+
# ℹ 6,399 more rows
We have done some cleaning! Let’s save this cleaned data into a separate CSV file called “wvs_cleaned_v1.csv”
Check the data output folder to make sure the CSV is created!
Now that we have a new file, load this new wvs_cleaned_v1.csv
into a new dataframe called wvs_cleaned
. Filter to respondents from Singapore who are currently employed full time. Show only the Respondent ID, country, the employment status, and age. Use glimpse()
or print()
to check the result! (you can chain these functions at the end)
Step 1: Load the new file
Step 2: Do the filtering and selecting, and then show result
Rows: 857
Columns: 4
$ ID <dbl> 702071000, 702070773, 702070363, 702070746, 702070026, 7020…
$ country <chr> "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SGP", "SG…
$ employment <chr> "Full time", "Full time", "Full time", "Full time", "Full t…
$ age <dbl> 81, 79, 73, 70, 69, 68, 68, 68, 68, 68, 68, 67, 67, 67, 67,…
Once we are done with the wrangling part, we can proceed with simple descriptive analysis!
Before we proceed further, convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor
Generate summary statistics of life_satisfaction grouped by country
Create a new column called satisfaction_group
that indicate whether each respondent has higher or lower than average life_satisfaction
Reshape the data to show average satisfaction scores by country and age group
Let’s use the wvs_cleaned
dataframe for this task.
Convert the appropriate categorical variables (country, religiousity, sex, marital_status, employment) to Factor.
We can do this with mutate()
and as_factor()
from forcats, another sub-package within tidyverse.
Rstudio may auto-suggest as.factor()
from base R. You can use this as well, but as_factor()
is preferred since we are using tidyverse approach.
tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
$ country : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
$ sex : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
$ birthyear : num [1:6403] 1927 1928 1930 1931 1930 ...
$ age : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
$ life_satisfaction : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
$ work_importance : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
$ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
$ religiousity : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
$ political_scale : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
$ marital_status : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
$ employment : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
$ work_importance_r : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
$ age_group : chr [1:6403] "61+" "61+" "61+" "61+" ...
If we have a lot of columns to convert, that might be troublesome to type! This is where across()
can come in handy.
Let’s first define a character vector that contains the names of columns we plan to convert.
We will use this vector with mutate()
and across()
. We tell tidyverse to convert all of the columns with the help of all_of()
across()
is used for applying the same function to multiple columns in a single mutate()
or summarise()
operation.
tibble [6,403 × 14] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:6403] 1.24e+08 5.54e+08 7.02e+08 1.24e+08 1.24e+08 ...
$ country : Factor w/ 3 levels "CAN","NZL","SGP": 1 2 3 1 1 1 2 1 1 2 ...
$ sex : Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 1 2 1 ...
$ birthyear : num [1:6403] 1927 1928 1930 1931 1930 ...
$ age : num [1:6403] 93 91 90 89 89 89 89 88 88 88 ...
$ life_satisfaction : num [1:6403] 7 5 8 6 6 8 9 8 8 10 ...
$ work_importance : num [1:6403] 1 4 1 3 4 4 2 1 4 2 ...
$ financial_satisfaction: num [1:6403] 7 5 9 9 8 8 10 10 5 10 ...
$ religiousity : Factor w/ 4 levels "Not a religious person",..: 1 2 2 1 1 2 2 2 2 2 ...
$ political_scale : num [1:6403] 5 3 3 2 5 5 8 10 5 5 ...
$ marital_status : Factor w/ 6 levels "Widowed","Single",..: 1 1 1 1 2 3 1 1 1 1 ...
$ employment : Factor w/ 8 levels "Part time","Retired/pensioned",..: 1 2 2 2 2 2 2 2 2 2 ...
$ work_importance_r : num [1:6403] 4 1 4 2 1 1 3 4 1 3 ...
$ age_group : chr [1:6403] "61+" "61+" "61+" "61+" ...
Generate summary statistics such as count (n), mean, median, and standard deviation of life_satisfaction grouped by country.
We can achieve this with group_by()
and summarise()
What if we want to save this into a CSV? What if we also want to group by country AND age_group?
# A tibble: 12 × 6
# Groups: country [3]
country age_group n mean_satisfaction median_satisfaction sd_satisfaction
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 CAN 61+ 1013 7.55 8 1.59
2 CAN 29-44 1232 6.99 7 1.80
3 CAN 45-60 1061 6.92 7 1.87
4 CAN 18-28 712 6.60 7 1.86
5 NZL 61+ 292 7.96 8 1.66
6 NZL 45-60 222 7.38 8 1.90
7 NZL 29-44 119 7.32 8 1.69
8 NZL 18-28 27 6.70 7 1.98
9 SGP 61+ 418 7.17 7 1.84
10 SGP 29-44 511 7.09 7 1.70
11 SGP 45-60 550 7.05 7 1.83
12 SGP 18-28 246 6.85 7 1.68
Create a new column called satisfaction_group
that indicate whether each respondent has higher or lower than average life_satisfaction
# A tibble: 6,403 × 15
ID country sex birthyear age life_satisfaction work_importance
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 124071434 CAN Male 1927 93 7 1
2 554070524 NZL Female 1928 91 5 4
3 702070265 SGP Male 1930 90 8 1
4 124071321 CAN Female 1931 89 6 3
5 124073566 CAN Female 1930 89 6 4
6 124074356 CAN Male 1931 89 8 4
7 554070216 NZL Female 1930 89 9 2
8 124070295 CAN Male 1932 88 8 1
9 124072193 CAN Female 1958 88 8 4
10 554070038 NZL Male 1931 88 10 2
# ℹ 6,393 more rows
# ℹ 8 more variables: financial_satisfaction <dbl>, religiousity <chr>,
# political_scale <dbl>, marital_status <chr>, employment <chr>,
# work_importance_r <dbl>, age_group <chr>, satisfaction_group <chr>
Show the average satisfaction scores by country and age group in wide data format
wvs_data |>
group_by(country, age_group) |>
summarise(
avg_satisfaction = mean(life_satisfaction, na.rm = TRUE),
) |>
pivot_wider(
names_from = age_group,
values_from = avg_satisfaction
)
# A tibble: 3 × 5
# Groups: country [3]
country `18-28` `29-44` `45-60` `61+`
<chr> <dbl> <dbl> <dbl> <dbl>
1 CAN 6.60 6.99 6.92 7.55
2 NZL 6.70 7.32 7.38 7.96
3 SGP 6.85 7.09 7.05 7.17
Long data:
Each row is a unique observation.
There is a separate column indicating the variable or type of measurements
This format is more “understandable” by R, more suitable for visualizations.
Wide data:
Each row is a value in variables.
Each column is a value in variables –> the more values you have, the “wider” is the data
The cells represent theThis format is more intuitive for humans!
Long data:
country | age_group | count |
---|---|---|
CAN | 18-28 | 712 |
CAN | 29-44 | 1232 |
CAN | 45-60 | 1061 |
CAN | 61+ | 1013 |
NZL | 18-28 | 27 |
NZL | 29-44 | 119 |
NZL | 45-60 | 222 |
NZL | 61+ | 292 |
SGP | 18-28 | 246 |
SGP | 29-44 | 511 |
SGP | 45-60 | 550 |
SGP | 61+ | 418 |
Wide data:
country | 18-28 | 29-44 | 45-60 | 61+ |
---|---|---|---|---|
CAN | 712 | 1232 | 1061 | 1013 |
NZL | 27 | 119 | 222 | 292 |
SGP | 246 | 511 | 550 | 418 |
Time: 5 minutes
Generate a summary stats of age
grouped by country
and sex
. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:
# A tibble: 6 × 8
# Groups: country [3]
country sex observation mean_age median_age oldest youngest std_dev
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CAN Female 1959 44.2 42 89 18 17.0
2 CAN Male 2059 48.8 49 93 18 16.4
3 NZL Female 354 56.6 58 91 19 15.5
4 NZL Male 306 57.3 59 88 20 15.4
5 SGP Female 919 47.7 47 87 21 15.3
6 SGP Male 806 47.9 48 90 21 16.2
Let’s say I have this column called wrong_column
that I want to remove:
-
:# A tibble: 6,403 × 1
country
<chr>
1 CAN
2 NZL
3 SGP
# ℹ 6,400 more rows
Next session: Descriptive statistics and data visualization with ggplot2
package - we’ll create visualizations to explore patterns in life satisfaction, values, and demographics across countries!