dplyr
and tidyr
(part of the tidyverse
package)Environment
panel is empty (click on broom icon to clean it up)Console
and Plots
too.library(tidyverse)
and read_csv
portion in the previous sessionUse read_csv
from readr
package (part of tidyverse
) to load our data into a dataframe
Packages from tidyverse
. (click here to go to the tidyverse homepage)
Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)
dplyr cheatsheet | pdf version (I personally prefer this PDF version since it’s more visual)
The data is from a national survey conducted in April and May of 1988 by FLACSO/Chile, capturing voting intentions for the 1988 Chilean plebiscite. The dataset contains information about respondents’ demographic characteristics and their voting intentions. This data can also be found from carData
package! (more on this later)
Key variables in the dataset:
Variable | Description |
---|---|
region |
Region of voters: C (Central), M (Metropolitan Santiago area), N (North), S (South), SA (city of Santiago) |
population |
Population size of respondent’s community |
sex |
Sex of voters: F (female), M (male) |
age |
Age in years |
education |
Education level of voters: P (Primary), PS (Post-secondary), S (Secondary) |
income |
Monthly income, in Pesos |
statusquo |
Scale of support for the status-quo in numerical value |
vote |
Voter’s decision: A (will abstain), N (will vote no), U (undecided), Y (will vote yes) |
It’s good practice to do some preliminary checks on your data to get a better sense of it!
A few things that you can do:
Check for duplicates
Check for missing values
Check on overall distributions of the categorical data
Plot the distribution of the numerical/continuous data
Scenario: We are a junior Research Assistant (RA) in a research team currently studying about the 1988 Chilean plebiscite. The team’s primary goal is to identify any discernible and interesting patterns in the voter demographic data.
As a start, we have been asked to do the following data cleaning and processing tasks:
Remove all rows with empty values (NA)
Separate the demographic information (region, sex, age, education) into a separate CSV called chile-demographics.csv
Retrieve only respondents aged 30 or older, arrange it from oldest to youngest, and save it into a separate CSV called chile-30plus.csv
Make the character columns such as region
, sex
, education
, and vote
to be in uppercase.
Convert all the categorical columns such as region
, sex
, education
, and vote
to Factor. Make sure education
is ordered.
Create a new column called age_group
that categorizes age into groups: “18-29”, “30-44”, “45-59”, “60+”.
Create a new column called high_income
that is TRUE if income is above the median, FALSE otherwise.
Create a new column called support_level
that categorizes statusquo
into “Unsupportive” (<= 0) and “Supportive” (> 0). Save all of these changes you made from step 3 onwards into a new CSV called chile_voting_processed.csv
Generate summary stats of income
grouped by region
and education
. The summary stats should include mean, median, max, min, std, and n (number of observations).
Reshape the data to have region
as rows and vote
options as columns, with the values being the count of votes for each option in each region.
A strategy I’d like to recommend: briefly read over the
dplyr
+tidyr
documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.
Remove all rows with empty values (NA)
The number of observations after all rows containing NAs are removed:
The pipe operator (|>) allows us to chain multiple operations without creating intermediate dataframes.
Super handy when we perform several data wrangling tasks using tidyverse in sequence.
Helps with readability, especially for complex operations.
Keyboard shortcut: Ctrl
+Shift
+M
on Windows, Cmd
+Shift
+M
on Mac
Notice that we have to create a “temp” dataframes called age_df
and age_df_desc
in this method.
Separate the demographic information (region, sex, age, education) into a separate CSV called chile-demographics.csv
The first few rows in the CSV:
# A tibble: 2,431 × 4
region sex age education
<chr> <chr> <dbl> <chr>
1 N M 65 P
2 N M 29 PS
3 N F 38 P
# ℹ 2,428 more rows
Retrieve only respondents aged 30 or older, arrange it from oldest to youngest, and save it into a separate CSV called chile-30plus.csv
The first few rows in the resulting CSV:
# A tibble: 1,581 × 8
region population sex age education income statusquo vote
<chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 N 125000 F 70 P 15000 1.49 Y
2 N 250000 F 70 P 75000 1.40 Y
3 N 250000 M 70 P 7500 -0.366 N
# ℹ 1,578 more rows
Time: 5 minutes!
Retrieve only respondents from the Metropolitan Santiago area (M
), with Secondary education (S
), and who are undecided (U
) about their vote. Keep only their region
, education
, vote
, and income
columns in a new dataframe called santiago_undecided
in descending order of income
.
The first few rows of santiago_undecided
:
# A tibble: 5 × 4
region education vote income
<chr> <chr> <chr> <dbl>
1 M S U 35000
2 M S U 15000
3 M S U 15000
# ℹ 2 more rows
Make all character columns such as region
, sex
, education
, and vote
to be in uppercase.
The end result:
# A tibble: 2,431 × 4
region sex education vote
<chr> <chr> <chr> <chr>
1 N M P Y
2 N M PS N
3 N F P Y
# ℹ 2,428 more rows
Convert all the categorical columns such as region
, sex
, education
, and vote
to Factor. Make sure education
is ordered.
The structure of education column after the change:
tibble [2,431 × 1] (S3: tbl_df/tbl/data.frame)
$ education: Ord.factor w/ 3 levels "P"<"S"<"PS": 1 3 1 1 2 1 3 2 1 1 ...
Add a new column to chile_data
called age_group
that categorizes age into groups: “18-29”, “30-44”, “45-59”, “60+”.
The new age_group column should look something like this:
# A tibble: 2,431 × 2
age age_group
<dbl> <chr>
1 65 60+
2 29 18-29
3 38 30-44
4 49 45-59
# ℹ 2,427 more rows
Create a new column called high_income
that is TRUE if income is above the median, FALSE otherwise.
hint: you need to calculate the median income first.
The original income column and the new high_income column:
# A tibble: 3 × 2
income high_income
<dbl> <lgl>
1 15000 FALSE
2 15000 FALSE
3 15000 FALSE
Create a new column called support_level
that categorizes statusquo
into “Unsupportive” (<= 0) and “Supportive” (> 0).
Save all of these changes you made from step 3 onwards into a new CSV called chile_voting_processed.csv
The new support_level column:
# A tibble: 2,431 × 2
statusquo support_level
<dbl> <chr>
1 1.01 Supportive
2 -1.30 Unsupportive
3 1.23 Supportive
# ℹ 2,428 more rows
Generate summary stats of income
grouped by region
and education
. The summary stats should include mean, median, max, min, std, and n (number of observations).
# A tibble: 15 × 8
# Groups: region [5]
region education observation mean_income median_income highest lowest std_dev
<fct> <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C P 238 16450. 15000 200000 2500 18483.
2 C S 234 35887. 35000 200000 2500 36540.
3 C PS 76 64375 35000 200000 7500 50705.
4 M P 43 21163. 15000 125000 2500 21967.
5 M S 26 29038. 25000 75000 7500 19787.
6 M PS 6 60000 55000 125000 15000 39875.
7 N P 133 19173. 15000 75000 2500 15823.
8 N S 111 31937. 35000 125000 2500 26351.
9 N PS 61 53934. 35000 200000 7500 41772.
10 S P 310 15677. 7500 200000 2500 19074.
11 S S 250 29800 15000 200000 2500 30450.
12 S PS 95 56658. 35000 200000 2500 52879.
13 SA P 278 19191. 15000 125000 2500 15796.
14 SA S 389 41292. 35000 200000 2500 41052.
15 SA PS 181 82859. 75000 200000 7500 66268.
Reshape the data to have region
as rows and vote
options as columns, with the values being the count of votes for each option in each region.
To achieve this clean table look, we can use gt
package or knitr
package!
Long data:
Each row is a unique observation.
There is a separate column indicating the variable or type of measurements
This format is more “understandable” by R, more suitable for visualizations.
Wide data:
Each row is a unique observation.
Each column is a variable –> the more variables you have, the “wider” is the data
This format is more intuitive for humans!
Long data:
# A tibble: 20 × 3
region vote count
<fct> <fct> <int>
1 C A 44
2 C N 209
3 C U 130
4 C Y 165
5 M A 2
6 M N 17
7 M U 19
8 M Y 37
9 N A 29
10 N N 98
11 N U 46
12 N Y 132
13 S A 39
14 S N 209
15 S U 140
16 S Y 267
17 SA A 63
18 SA N 334
19 SA U 216
20 SA Y 235
Wide data:
# A tibble: 5 × 5
region A N U Y
<fct> <int> <int> <int> <int>
1 C 44 209 130 165
2 M 2 17 19 37
3 N 29 98 46 132
4 S 39 209 140 267
5 SA 63 334 216 235
Time: 5 minutes
Generate a summary stats of age
grouped by region
and sex
. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:
# A tibble: 10 × 8
# Groups: region [5]
region sex observation mean_age median_age oldest youngest std_dev
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C F 269 37.7 36 70 18 14.3
2 C M 279 39.1 36 70 18 15.8
3 M F 43 37.0 36 68 18 14.1
4 M M 32 36 33.5 65 18 13.5
5 N F 159 38.7 36 70 18 13.8
6 N M 146 38.7 35 70 18 16.6
7 S F 330 36.7 35 70 18 13.9
8 S M 325 38.0 36 70 18 14.9
9 SA F 449 38.9 38 70 18 14.0
10 SA M 399 39.0 37 70 18 15.1
Let’s say I have this column called wrong_column
that I want to remove:
-
:# A tibble: 2,431 × 1
region
<fct>
1 N
2 N
3 N
# ℹ 2,428 more rows
Next session: Descriptive stats and data visualization with ggplot2
package.