library(tidyverse)
library(car)
Post-workshop exercises
Optional exercises to test your knowledge and reinforce your learning
Prep
Have the dplyr, tidyr, and ggplot cheatsheets ready open and ready.
Load packages
the code below will load the packages you need:
Load datasets
The code below will load and print the first few rows of the Duncan
dataset. To find out more about this dataset, type ?Duncan
in your RStudio console to open the vignette.
<- as_tibble(Duncan)
duncan print(duncan)
# A tibble: 45 × 4
type income education prestige
<fct> <int> <int> <int>
1 prof 62 86 82
2 prof 72 76 83
3 prof 75 92 90
4 prof 55 90 76
5 prof 64 86 90
6 prof 21 84 87
7 prof 64 93 93
8 prof 80 100 90
9 wc 67 87 52
10 prof 72 86 88
# ℹ 35 more rows
The code below will load and print the first few rows of the WVS
dataset. To find out more about this dataset, type ?WVS
in your RStudio console to open the vignette.
<- as_tibble(WVS)
wvs print(wvs)
# A tibble: 5,381 × 6
poverty religion degree country age gender
<ord> <fct> <fct> <fct> <int> <fct>
1 Too Little yes no USA 44 male
2 About Right yes no USA 40 female
3 Too Little yes no USA 36 female
4 Too Much yes yes USA 25 female
5 Too Little yes yes USA 39 male
6 About Right yes no USA 80 female
7 Too Much yes no USA 48 female
8 Too Little yes no USA 32 male
9 Too Little yes no USA 74 female
10 Too Little yes no USA 30 male
# ℹ 5,371 more rows
Question 1
Using wvs
dataset, filter the age
column to include values that is more than 29. Then, select the columns age
, degree
, religion
and poverty
. Save the result to a new dataframe called wvs_filtered
.
Show Answer
<- wvs |>
wvs_filtered filter(age > 29) |>
select(age, degree, religion, poverty)
print(wvs_filtered)
# A tibble: 4,228 × 4
age degree religion poverty
<int> <fct> <fct> <ord>
1 44 no yes Too Little
2 40 no yes About Right
3 36 no yes Too Little
4 39 yes yes Too Little
5 80 no yes About Right
6 48 no yes Too Much
7 32 no yes Too Little
8 74 no yes Too Little
9 30 no yes Too Little
10 32 yes yes Too Little
# ℹ 4,218 more rows
Question 2
Update the wvs
dataset by creating a dummy-coded version of the gender
variable, where male = 0 and female = 1. Save the results in a new column named gender_coded
.
Show Answer
<- wvs |>
wvs mutate(gender_coded = if_else(gender == "male", 0, 1))
print(wvs)
# A tibble: 5,381 × 7
poverty religion degree country age gender gender_coded
<ord> <fct> <fct> <fct> <int> <fct> <dbl>
1 Too Little yes no USA 44 male 0
2 About Right yes no USA 40 female 1
3 Too Little yes no USA 36 female 1
4 Too Much yes yes USA 25 female 1
5 Too Little yes yes USA 39 male 0
6 About Right yes no USA 80 female 1
7 Too Much yes no USA 48 female 1
8 Too Little yes no USA 32 male 0
9 Too Little yes no USA 74 female 1
10 Too Little yes no USA 30 male 0
# ℹ 5,371 more rows
Question 3
Create a summary of the wvs
dataset showing the number of observations for each country.
Show Answer
|>
wvs count(country)
# A tibble: 4 × 2
country n
<fct> <int>
1 Australia 1874
2 Norway 1127
3 Sweden 1003
4 USA 1377
Question 4
Using wvs
dataset, calculate the average age
for each combination of gender and degree status.
Show Answer
|>
wvs group_by(gender, degree) |>
summarise(avg_age = mean(age, na.rm = TRUE))
`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
# Groups: gender [2]
gender degree avg_age
<fct> <fct> <dbl>
1 female no 45.6
2 female yes 41.0
3 male no 46.0
4 male yes 43.3
Question 5
Using the wvs
dataset, create a summary statistics for each country and religion (yes/no). Calculate the age
mean, age
median, number of observation.
Show Answer
|>
wvs group_by(country, religion) %>%
summarise(
avg_age = mean(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
n_observations = n()
)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
# A tibble: 8 × 5
# Groups: country [4]
country religion avg_age median_age n_observations
<fct> <fct> <dbl> <dbl> <int>
1 Australia no 39.9 37 375
2 Australia yes 45.7 43 1499
3 Norway no 40.6 38 109
4 Norway yes 43.6 42 1018
5 Sweden no 43.7 42 15
6 Sweden yes 43.9 43 988
7 USA no 44.6 42 287
8 USA yes 48.9 46 1090
Question 6
Using the wvs
dataset, select the 10 oldest respondents from the USA. (hint: arrange()
and slice()
)
Show Answer
|>
wvs filter(country == "USA", age > 50) |>
arrange(desc(age)) |>
slice(1:10)
# A tibble: 10 × 7
poverty religion degree country age gender gender_coded
<ord> <fct> <fct> <fct> <int> <fct> <dbl>
1 Too Much no no USA 91 male 0
2 Too Little yes no USA 91 male 0
3 Too Much yes no USA 88 female 1
4 About Right yes no USA 88 male 0
5 Too Little yes yes USA 87 female 1
6 Too Much yes no USA 87 female 1
7 Too Little yes no USA 87 male 0
8 About Right yes no USA 87 male 0
9 About Right yes no USA 86 female 1
10 Too Much yes no USA 86 female 1
Question 7
Update the wvs
dataset by adding a new column called age_category
that categorises each respondents based on the following criteria:
- below 18 = “youth” category
- between 18 to 34 = “young adult” category
- between 35 to 49 = “adult” category
- between 50 to 69 = “senior” category
- more than 70 = “elderly” category
Show Answer
<- wvs |>
wvs mutate(age_category = case_when(
< 18 ~ "youth",
age >= 18 & age < 35 ~ "young adult",
age >= 35 & age < 50 ~ "adult",
age >= 50 & age < 70 ~ "senior",
age >= 70 ~ "elderly"
age
))
print(wvs)
# A tibble: 5,381 × 8
poverty religion degree country age gender gender_coded age_category
<ord> <fct> <fct> <fct> <int> <fct> <dbl> <chr>
1 Too Little yes no USA 44 male 0 adult
2 About Right yes no USA 40 female 1 adult
3 Too Little yes no USA 36 female 1 adult
4 Too Much yes yes USA 25 female 1 young adult
5 Too Little yes yes USA 39 male 0 adult
6 About Right yes no USA 80 female 1 elderly
7 Too Much yes no USA 48 female 1 adult
8 Too Little yes no USA 32 male 0 young adult
9 Too Little yes no USA 74 female 1 elderly
10 Too Little yes no USA 30 male 0 young adult
# ℹ 5,371 more rows
Question 8
Recreate the following visualization:
Show Answer
|> ggplot(aes(x = country, fill = poverty)) +
wvs geom_bar(position = "dodge") +
labs(title = "Distribution of Poverty Opinions Across Countries",
x = "Country",
y = "Count",
fill = "Poverty Opinion") +
theme_minimal()
Question 9
Recreate the following visualization:
Show Answer
|>
wvs ggplot(aes(x = country, fill = degree)) +
geom_bar(position = "fill") +
labs(title = "Proportion of Degree Status by Country",
x = "Country", y = "Count", fill = "Degree Status") +
theme_minimal()
Question 10
Recreate the following visualization:
Show Answer
|> ggplot(aes(x = country, y = age)) +
wvs geom_boxplot() +
labs(title = "Age Distribution by Country", x = "Country", y = "Age") +
theme_minimal() +
theme(legend.position = "none")
Question 11
Recreate the following visualization:
Show Answer
|>
wvs ggplot(aes(x = country, y = age)) +
geom_boxplot() +
facet_wrap(~ religion) +
labs(title = "Age Distribution by Country and Religion",
x = "Country", y = "Age") +
theme_minimal()
Question 12
Recreate the following visualization using duncan
dataset:
Show Answer
|>
duncan ggplot(aes(x = income, y = prestige, color = type)) +
geom_jitter() +
labs(title = "Occupation Prestige vs. Income",
x = "Income",
y = "Prestige Score",
color = "Occupation Type") +
theme_minimal()
Question 13
Recreate the following visualization using duncan
dataset:
Show Answer
|>
duncan ggplot(aes(x = type, y = income)) +
geom_boxplot() +
labs(title = "Income Distribution by Occupation Type",
x = "Occupation Type",
y = "Income") +
theme_minimal()
Question 14
Recreate the following visualization using duncan
dataset:
Show Answer
|>
duncan ggplot(aes(x = prestige, y = income, color = type)) +
geom_jitter() +
geom_smooth(method = "lm") +
labs(
title = "Relationship between Prestige and Income",
subtitle = "Grouped by Occupation Type",
x = "Prestige Score",
y = "Income",
color = "Occupation Type"
+
) theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Question 15
Using duncan
dataset:
- Examine the correlation between
prestige
scores andeducation
. - Analyze the relationship between
income
andprestige
scores.
Show Answer
cor.test(duncan$prestige, duncan$education)
Pearson's product-moment correlation
data: duncan$prestige and duncan$education
t = 10.668, df = 43, p-value = 1.171e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7445746 0.9163112
sample estimates:
cor
0.8519156
Show Answer
cor.test(duncan$prestige, duncan$income)
Pearson's product-moment correlation
data: duncan$prestige and duncan$income
t = 10.062, df = 43, p-value = 7.144e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7217665 0.9080298
sample estimates:
cor
0.8378014
Question 16
Using duncan
dataset, compare prestige scores across different occupational categories using ANOVA.
Show Answer
<- aov(prestige ~ type, data=duncan)
duncan_anova summary(duncan_anova)
Df Sum Sq Mean Sq F value Pr(>F)
type 2 33090 16545 65.57 1.21e-13 ***
Residuals 42 10598 252
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question 17
Using duncan
dataset, create a regression model that predicts income
based on occupational prestige
and education
scores.
library(huxtable)
Show Answer
<- lm(income ~ prestige + education, data = duncan)
duncan_model huxreg("income" = duncan_model)
income | |
---|---|
(Intercept) | 10.426 * |
(4.164) | |
prestige | 0.624 *** |
(0.125) | |
education | 0.032 |
(0.132) | |
N | 45 |
R2 | 0.702 |
logLik | -179.902 |
AIC | 367.805 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Question 18
Using the wvs
dataset, examine the relationship between religion
and poverty
perception.
Show Answer
<- chisq.test(table(wvs$religion, wvs$poverty))
wvs_chisq print(wvs_chisq)
Pearson's Chi-squared test
data: table(wvs$religion, wvs$poverty)
X-squared = 0.083005, df = 2, p-value = 0.9593
Question 19
Using the wvs dataset, examine whether there is a significant difference in the average age between individuals with and without a university degree.
Show Answer
<- t.test(age ~ degree, data = wvs)
wvs_ttest print(wvs_ttest)
Welch Two Sample t-test
data: age by degree
t = 7.0571, df = 2029, p-value = 2.325e-12
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
2.674321 4.732708
sample estimates:
mean in group no mean in group yes
45.82775 42.12423
Question 20
Using wvs
dataset, compare the mean of age across different countries and age categories (see Question 7 to create age_category), and investigate if there is any significant difference. Conduct the post-hoc test if needed.
Show Answer
<- aov(age ~ country + age_category, data = wvs)
wvs_anova summary(wvs_anova)
Df Sum Sq Mean Sq F value Pr(>F)
country 3 17399 5800 235.9 <2e-16 ***
age_category 3 1424858 474953 19319.9 <2e-16 ***
Residuals 5374 132113 25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1