Data Visualization in ggplot using World Values Survey data, including guidelines on how to choose the appropriate visualization.
Checklist when you start RStudio
Load the project we created last session and open the R script file.
Make sure that Environment panel is empty (click on broom icon to clean it up)
Clear the Console and Plots too.
Re-run the library(tidyverse) and read_csv portion in the previous session
Refresher: Loading from CSV into a dataframe
Use read_csv from readr package (part of tidyverse) to load our data into a dataframe
# import tidyverse librarylibrary(tidyverse)# read the CSV with WVS datawvs_cleaned <-read_csv("data-output/wvs_cleaned_v1.csv")# Convert categorical variables to factorscolumns_to_convert <-c("country", "religiousity", "sex", "marital_status", "employment")wvs_cleaned <- wvs_cleaned |>mutate(across(all_of(columns_to_convert), as_factor))# peek at the data, pay attention to the data types!glimpse(wvs_cleaned)
Recap: Descriptive Statistics
Univariate (i.e. single variable) Descriptive Stats
Measures of central tendency: mean(), median(), Mode()
Covariance - describe how two variables vary together
Correlation - describe relationship strength and direction in a sample. (If we want to use this to infer about a population from a sample, it would fall under inferential stats)
Visualizations e.g. Scatterplots, side-by-side boxplots, stacked bar charts, etc.
From last week: Basic R Functions for Descriptive Stats
Last week, we explored some basic R functions for descriptive statistics.
mean(): arithmetic average
median(): middle value
sd(): standard deviation
var(): variance
range(): range of values
IQR(): interquartile range
summary(): provides a summary of descriptive statistics
Mode() function from DescTools package: the most frequently occuring value 1
Univariate Descriptive Stats
Measures of Central Tendency
Let’s start by examining the age variable in our dataset.
cat("Most frequently occuring age:", mode_age, "\n") # this is here just for demo purposes
Most frequently occuring age: 54
How we can interpret / report this:
“The age distribution of this sample is fairly symmetrical, as indicated by the very close mean (48 years) and median (48 years) values. The mode of 54 years suggests a slight right-skew in the age distribution, with a cluster of participants in their mid-50s.”
cat("Range of age:", range_age[1], "to", range_age[2], "\n")
Range of age: 18 to 93
cat("Interquartile range of age:", iqr_age, "\n")
Interquartile range of age: 28
How we can interpret / report this:
“The age distribution of this sample is fairly wide spread. With a standard deviation of 16.72144, suggesting that most individuals’ ages deviate from the mean by approximately 16.72 years. The range of ages spans from 18 to 93 years, which covers a wide range of age groups within the sample. The interquartile range (IQR) of 28 years, which represents the middle 50% of the data, indicates a moderately wide distribution of ages in the central portion of the dataset.”
Distribution Shape
The function skewness() and kurtosis() is available through R package called moments. You may need to install it first before calling the library and its functions like in this code below.
“The age distribution has a very slight right skew (skewness = 0.10), meaning there are slightly more outliers toward older ages, but the skew is minimal since values between -0.5 and 0.5 are considered approximately symmetric. The kurtosis of 2.02 is lower than a normal distribution’s kurtosis of 3, indicating this distribution is platykurtic - it has lighter tails and is more uniform or”flatter” than a normal distribution.”
Visualizing with ggplot
Describing the spread and shape of distribution with just words is not very productive, so typically it is accompanied with visualization.
ggplot is plotting package that is included inside tidyverse package
works best with data in the long format, i.e., a column for all the dimensions/measures and another column for the value for each dimension/measure.
wvs_cleaned |>ggplot(aes(x = age)) +geom_histogram(binwidth =1, fill ="lightblue", color ="navy") +labs(title ="Age distribution of respondents",x ="Age",y ="Count") +theme_minimal()
Visualizing with ggplot
Anatomy of ggplot code
Charts built with ggplot must include the following:
1wvs_cleaned |>2ggplot(aes(x = age)) +3geom_histogram(binwidth =1, fill ="lightblue", color ="navy") +4labs(title ="Age distribution of respondents",x ="Age",y ="Count") +5theme_minimal()
1
Data - the dataframe/tibble to visualize.
2
Aesthetic mappings (aes) - describes which variables are mapped to the x, y axes, alpha (transparency) and other visual aesthetics.
3
Geometric objects (geom) - describes how values are rendered; as bars, scatterplot, lines, etc.
4
Provide titles and labels to your graph
5
(Optional) apply a theme/look to your graph
Tip: open the ggplot cheatsheet
Tip
A strategy I’d like to recommend: briefly read over the ggplot2 documentation and have them open on a separate tab. Figure out the type of variables you need to visualize (discrete or continuous) to quickly identify which visualization would make sense.
Going back to our univariate descriptive stats on age variable
Let’s visualize the variability with boxplot to get a better sense of the spread.
wvs_cleaned |>ggplot(aes(x = age)) +geom_boxplot(fill ="lightblue", color ="navy") +labs(title ="Age distribution of respondents",x ="Age") +theme_minimal()
Going back to our univariate descriptive stats on age variable
Categorical Data - Frequency Distribution
The age variable is a numerical / continuous data. We can’t apply mean(), median() and other central tendency measures to categorical data such as age_group or employment_status. We can, however, visualize them.
When dealing with categorical data, first take note on whether you want to visualize the proportion or the frequency distribution.
Let’s visualize the frequency distribution of survey participants by country:
wvs_cleaned |>ggplot(aes(x = country, fill = country)) +geom_bar() +labs(title ="Participants by Country",x ="Country",y ="Participants") +theme_minimal()
Categorical Data - Frequency Distribution
Categorical Data - Proportion
When we want to show proportion (i.e. in terms of “parts of whole”), we must first quickly calculate the proportion with count()
Let’s create a new dataframe called wvs_country_proportion to hold this data.
wvs_country_proportion <- wvs_cleaned |>group_by(country) |>summarize(n =n()) |># count the number of participants each countrymutate(proportion = n/sum(n)) # calculate proportionprint(wvs_country_proportion)
# A tibble: 3 × 3
country n proportion
<fct> <int> <dbl>
1 CAN 4018 0.628
2 NZL 660 0.103
3 SGP 1725 0.269
Categorical Data - Proportion (cont’d)
And then, we use this proportion table to create a pie chart by adding coord_polar() layer after geom_bar() and some changes in aes() and geom_bar()
wvs_country_proportion |>ggplot(aes(x ="", y = proportion, fill = country)) +geom_bar(stat ="identity", width =1) +coord_polar("y", start =0) +labs(title ="Proportion of Participants by Country") +theme_minimal()
Categorical Data - Proportion (cont’d)
Learning Check 1A
Using the wvs_cleaned dataset:
Create a histogram that visualizes the distribution of financial_satisfaction
Show answer
wvs_cleaned |>ggplot(aes(x = financial_satisfaction)) +geom_histogram(fill ="steelblue", color ="white", binwidth =1) +labs(title ="Distribution of Financial Satisfaction",x ="Financial Satisfaction",y ="Count") +theme_minimal()
Learning Check 1A
Learning Check 1B
Create a barchart that visualizes the frequency of religiousity
Show answer
wvs_cleaned |>ggplot(aes(x = religiousity, fill = religiousity)) +geom_bar() +labs(title ="Frequency of Religiosity",x ="Religiosity",y ="Count") +theme_minimal()
Learning Check 1B
Bivariate Descriptive Stats
Three Combinations in Bivariate Descriptive Stats
Bivariate descriptive statistics describe and summarize relationships between two variables in your dataset without making inferences about a larger population. They include numeric measures like correlation or covariance, and visualizations like scatterplots, side-by-side boxplots, or contingency tables.
Think of them as taking a snapshot of how two variables relate to each other in your current data.
Since data can be continuous or categorical, there can be three combinations when we deal with bivariate descriptive stats:
Both categorical (e.g. age_group and country)
Both continuous (e.g. financial_satisfaction and life_satisfaction)
One continuous, one categorical (e.g. country and life_satisfaction)
Both categorical
Examine relationships between categorical variables
Look at joint distributions and proportions
Compare group compositions
First, let’s create a contingency table of age_group and country!
We can also create a proportion table just like we did earlier
wvs_cleaned |>group_by(country, age_group) |>summarise(n =n()) |># count the frequency of participants by age group and country mutate(prop = n/sum(n)) # calculate proportion
# A tibble: 12 × 4
# Groups: country [3]
country age_group n prop
<fct> <chr> <int> <dbl>
1 CAN 18-28 712 0.177
2 CAN 29-44 1232 0.307
3 CAN 45-60 1061 0.264
4 CAN 61+ 1013 0.252
5 NZL 18-28 27 0.0409
6 NZL 29-44 119 0.180
7 NZL 45-60 222 0.336
8 NZL 61+ 292 0.442
9 SGP 18-28 246 0.143
10 SGP 29-44 511 0.296
11 SGP 45-60 550 0.319
12 SGP 61+ 418 0.242
Both categorical (cont’d)
For categorical data like this, we can use barchart to visualize the frequency distribution. Stacked bar chart can be used to visualize proportion.
wvs_cleaned |>ggplot(aes(x = country, fill = age_group)) +geom_bar(position ="dodge") +labs(y ="Count", title ="Age Groups by Country") +theme_minimal()
Change position = "dodge" to position = "stack" to stack the bar chart
Both categorical (cont’d)
Both categorical (cont’d)
To get a better sense of the proportion for each country, we can use percent stacked bar chart.
The code is similar to previous bar charts; we just have to change the position argument to position = "fill"
wvs_cleaned |>ggplot(aes(x = country, fill = age_group)) +geom_bar(position ="fill") +labs(y ="Proportion", title ="Age Groups by Country") +theme_minimal()
Both categorical (cont’d)
Both continuous
Examine linear relationships
Look for patterns and trends
Identify potential outliers
Let’s first examine the correlation between financial_satisfaction and life_satisfaction
Let’s visualize the two variables together with a jitter / scatterplot!
wvs_cleaned |>ggplot(aes(x = financial_satisfaction, y = life_satisfaction)) +geom_jitter(alpha =0.3) +geom_smooth(method ="lm") +# layer with geom_smoothlabs(title ="Financial vs Life Satisfaction") +theme_minimal()
Both continuous (cont’d)
Correlation Plot
When there are more than two continuous variables to explore, correlation map is sometimes used. We can achieve this with ggplot, but it’s much easier to use the corrplot() function from the corrplot package.
Let’s visualize the correlation map for these three variables.
library(corrplot)# select all the columns for correlation calculation, save it to columns_for_corrcolumns_for_corr <- wvs_cleaned |>select(financial_satisfaction, life_satisfaction, age)# pass the columns_for_corr to cor() function, and save the result to cor_matrixcor_matrix <-cor(columns_for_corr)# visualize the cor_matrix with corrplot()!corrplot(cor_matrix,method ="shade", # show the correlation strength as color shadesaddCoef.col ="black", tl.col ="black") # label the coefficients
Correlation Plot
Correlation Plot - shorter code
We can shorten the code in the previous slide using the maggritr pipe |> like so:
wvs_cleaned |>select(financial_satisfaction, life_satisfaction, age) |>cor() |>corrplot(method ="shade", # show the correlation strength as color shadesaddCoef.col ="black", tl.col ="black")
Refresher:
Notice that we don’t have to pass the column names to cor() and corrplot() function. This is because the maggritr pipe |>, acts as a “conveyor belt” that take output from one step and then immediately feed it to the next step.
Correlation Plot - shorter code
One continuous, one categorical
Compare distributions across groups
Identify group differences
Examine spread within groups
Let’s do a recap from last week and get the summary stats for life_satisfaction for each country
# A tibble: 3 × 4
country mean_satisfaction median_satisfaction sd_satisfaction
<fct> <dbl> <dbl> <dbl>
1 CAN 7.04 7 1.81
2 NZL 7.60 8 1.79
3 SGP 7.06 7 1.78
One continuous, one categorical (cont’d)
To get a better sense of how the data is varied and spread, let’s visualize them with a side-by-side boxplot
wvs_cleaned |>ggplot(aes(x = country, y = life_satisfaction)) +geom_boxplot() +labs(title ="Life Satisfaction by Country") +theme_minimal()
One continuous, one categorical (cont’d)
One continuous, one categorical (cont’d)
We could also layer our boxplots with violin plots to get a better sense of the distribution of each group.
wvs_cleaned |>ggplot(aes(x = country, y = life_satisfaction)) +geom_violin(fill ="lightblue", alpha =0.5) +geom_boxplot(width =0.1, fill ="white") +labs(title ="Life Satisfaction Distribution by Country") +theme_minimal()
One continuous, one categorical (cont’d)
How to save your images - ggsave
There are two ways to do this:
Via ggsave
The point-and-click way in RStudio
Below is the ggsave way:
# save the chart into an object instead of viewing it like we have been doingboxplot_obj <- wvs_cleaned |>ggplot(aes(x = age)) +geom_boxplot(fill ="lightblue", color ="navy") +labs(title ="Age distribution of respondents",x ="Age") +theme_minimal()# pass the saved chart into ggsave and give it a filenameggsave("fig-output/boxplot_1.jpg", boxplot_obj)
How to save your images - point-and-click
Learning Check #2
Create a side-by-side boxplots that visualizes political_scale for each sex.
Show answer
wvs_cleaned |>ggplot(aes(x = political_scale, y = sex)) +geom_violin(fill ="lightblue", alpha =0.5) +geom_boxplot(width =0.1, fill ="white") +labs(title ="Political scale Distribution by Sex") +theme_minimal()
Using Facets for more complex visual
Compare patterns across multiple subgroups
Identify interaction effects
Maintain visual clarity with complex relationships
Facet grids are useful when we have more than two variables to visualize. However, if used excessively they may become too complex