# import tidyverse librarylibrary(tidyverse)library(gtsummary)library(huxtable)# read the CSV with Chile voting datachile_data <-read_csv("data-output/chile_voting_processed.csv")chile_data <- chile_data |>mutate(across(c("region", "sex", "education", "vote", "age_group", "support_level"), as_factor))#reorderingchile_data <- chile_data |>mutate(education =factor(education, levels =c("P", "S", "PS"), ordered =TRUE))# peek at the data, pay attention to the data types!glimpse(chile_data)
Simple Linear Regression
Simple Linear Regression: what is it?
Linear regression is a statistical method used to model the relationship between a dependent variable (outcome) and one or more independent variables (predictors) by fitting a linear equation to the observed data. The math formula looks like this:
\[
Y = \beta_0 + \beta_1X + \varepsilon
\]
\(Y\) - the dependent variable; must be continuous
\(X\) - the independent variable (if there are more than one, there will be \(X_1\) , \(X_2\) , and so on. This can be ordinal, nominal, or continuous
\(\beta_0\) - the y-intercept. Represents the expected value of independent variable \(Y\) when independent variable(s) \(X\) are set to zero.
\(\beta_1\) - the slope / coefficient for independent variable
\(\varepsilon\) - the error term. (In some examples you might see this omitted from the formula).
Examples:
Does a person’s age affect their income?
Do a person’s age and education level affect their income?
Linear Regression: One continuous predictor
Research Question: Does a person’s age influence their support to incumbent in Chile?
The outcome/DV (\(Y\)): statusquo
The predictor/IV (\(X\)): age
statsquo_model1 <-lm(statusquo ~ age, data = chile_data)summary(statsquo_model1) #summarize the result
Call: the formula
Residuals: overview on the distribution of residuals (expected value minus observed value) – we can plot this to check for homoscedasticity -
Coefficients: shows the intercept, the regression coefficients for the predictor variables, and their statistical significance
Residual standard error: the average difference between observed and expected outcome by the model. Generally the lower, the better.
R-squared & Adjusted R-squared: indicates the proportion of variation in the outcome that can be explained by the model (i.e. goodness of fit).
F-statistics: indicates whether the model as a whole is statistically significant and whether it explains more variance than just the baseline (intercept-only) model.
Linear Regression: One continuous predictor
Call:
lm(formula = statusquo ~ age, data = chile_data)
Residuals:
Min 1Q Median 3Q Max
-1.58205 -0.97443 -0.07783 0.97579 1.87278
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.304966 0.056620 -5.386 7.89e-08 ***
age 0.007670 0.001381 5.555 3.09e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9984 on 2429 degrees of freedom
Multiple R-squared: 0.01254, Adjusted R-squared: 0.01214
F-statistic: 30.85 on 1 and 2429 DF, p-value: 3.086e-08
Narrating the results
Here is one possible way to narrate your result:
A linear regression analysis was conducted to assess the influence of age on support for the incumbent (statusquo) in Chile. The coefficient for age was 0.008 (SE = 0.001), indicating that for each additional year of age, support for the incumbent increased by 0.008 units. This effect was statistically significant at p < 0.001.
The model explained a small portion of the variance in support for the statusquo (R² = 0.013, Adjusted R² = 0.012). The F-statistic was 30.85 (p < 0.001), further suggesting that age is a significant predictor of support for the incumbent.
Present the regression tables!
huxreg("statusquo"= statsquo_model1)
statusquo
(Intercept)
-0.305 ***
(0.057)
age
0.008 ***
(0.001)
N
2431
R2
0.013
logLik
-3444.614
AIC
6895.227
*** p < 0.001; ** p < 0.01; * p < 0.05.
Linear Regression: Multiple continuous predictors
Research Question: Do a person’s age and their income have an impact on their support for the incumbent?
The outcome/DV (\(Y\)): statusquo
The predictor/IV (\(X\)): age and income
# let's divide income by 1000 to make the scale bit more similarchile_data$income_downscaled <- chile_data$income /10000statsquo_model2 <-lm(statusquo ~ age + income_downscaled, data = chile_data)summary(statsquo_model2)
Linear Regression: Multiple continuous predictors
Call:
lm(formula = statusquo ~ age + income_downscaled, data = chile_data)
Residuals:
Min 1Q Median 3Q Max
-1.58170 -0.97166 -0.06689 0.97791 1.88982
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.335698 0.059682 -5.625 2.07e-08 ***
age 0.007738 0.001381 5.603 2.35e-08 ***
income_downscaled 0.008276 0.005097 1.624 0.105
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9981 on 2428 degrees of freedom
Multiple R-squared: 0.01361, Adjusted R-squared: 0.0128
F-statistic: 16.76 on 2 and 2428 DF, p-value: 5.93e-08
Alternative package to present regression result
statsquo_model2 |>tbl_regression() |>bold_p()
Characteristic
Beta
95% CI1
p-value
age
0.01
0.01, 0.01
<0.001
income_downscaled
0.01
0.00, 0.02
0.10
1 CI = Confidence Interval
Reporting result: A sample regression table
You might encounter different table formats when reporting regression results, but there are some key elements that should generally be included.
These are: the number of observations (\(N\)), the \(\beta\) coefficients, standard errors (SE), confidence intervals (95% CI), and p-values. Other metrics to include are the \(R^2\) and \(F\) statistics.
Presenting your models
Don’t forget to install the library by running this line in your terminal: install.packages("huxtable")
Caution! When doing regression-type of tests, watch out for multicollinearity.
Multicollinearity is a situation in which two or more predictor variables are highly correlated with each other. This makes it difficult to determine the specific contribution of each predictor variable to the relationship.
One way to check for it:
Assess the correlation between your predictor variables in your model using Variance Inflation Factor (VIF)
If they seem to be highly correlated (> 5 or so), one of the easiest (and somewhat acceptable) way is to simply remove the less significant predictor from your model :D
car::vif(statsquo_model2)
age income_downscaled
1.0009 1.0009
Linear Regression: One categorical predictor
Research question: Explore the difference of support for statusquo between different sex.
The outcome/DV (\(Y\)): statusquo
The predictor/IV (\(X\)): sex
Note
Before proceeding with analysis, ensure that all the categorical variables involved are cast as factors!
statusquo_model3 <-lm(statusquo ~ sex, data = chile_data)summary(statusquo_model3)
Call:
lm(formula = statusquo ~ sex, data = chile_data)
Residuals:
Min 1Q Median 3Q Max
-1.77948 -0.98619 -0.06878 0.96289 1.79341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.07986 0.02917 -2.737 0.00624 **
sexF 0.13339 0.04068 3.279 0.00106 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.003 on 2429 degrees of freedom
Multiple R-squared: 0.004406, Adjusted R-squared: 0.003997
F-statistic: 10.75 on 1 and 2429 DF, p-value: 0.001057
When interpreting a categorical predictor in regression, one category is treated as the reference category, which serves as the baseline for comparison. In this case, the reference category corresponds to the intercept.
By default, the first category in the data is used as the reference category, unless specified otherwise.
Narrating the result
Here is one possible way to narrate your result:
A linear regression analysis was conducted to explore the relationship between gender (male vs. female) of voters and support for the status quo in Chile (statusquo). The model includes gender as the only predictor. Results revealed that the intercept, representing the mean statusquo score for male voters, was -0.080 (SE = 0.030). This was statistically significant at p < 0.01, indicating that on average, male voters showed a negative inclination towards the status quo.
The coefficient for female voters was 0.134 (SE = 0.041), indicating that being female was associated with an increase of 0.134 units in support for the status quo compared to male voters. This effect was statistically significant p < 0.01
The model explain a small portion of the variance in support for the status quo (R^2 = 0.004, Adjusted R^2 = 0.004). The F-statistic (10.75, p < 0.01) suggests that gender is a significant predictor of support for the status quo.
Previously, we could use the relevel() function to change the reference category for ordered factors. However, in recent R versions, this no longer works for ordered factors, so we now use the method shown in the code above. The relevel() function still works for unordered factors.
Let’s try this Linear Regression exercise! (5 mins)
Create a regression model called statusquo_model4 that predicts the statusquo score based on region. Make sure the reference category is region “M”.
It can also be written like below, in which the \(logit(P)\) part is expanded:
\[
P = \frac{1}{1 + e^{-(\beta_0 + \beta_1X)}}
\]
Binary Logistic Regression Examples
Does a person’s age and education level influence whether they will vote Yes or No in a referendum?
Does the number of hours spent studying impact a student’s likelihood of passing a module? (pass/fail outcome)
In essence, the goal of binary logistic regression is to estimate the probability of a specific event happening when there are only two possible outcomes (hence the term “binary”).
Binary Logistic Regression: One Continuous Predictor
Research Question: Does age affect the likelihood of voting “Yes” in the plebiscite?
The outcome/DV (\(Y\)): vote - there are 4 outcomes here, but for the purpose of this workshop practice, let’s define the outcome as “Yes” and “not Yes” outcome
The predictor/IV (\(X\)): age
Dummy-coding dependent variable
Before we proceed with the calculations, we need to dummy code the dependent variable into 1 and 0, with 1 = Yes and 0 = Not Yes. More info on dummy coding here
# First, we need to create a binary outcomechile_data <- chile_data |>mutate(vote_yes =if_else(vote =="Y", 1, 0))# the if_else is from dplyr package (from session 2)
# A tibble: 2,431 × 2
vote vote_yes
<fct> <dbl>
1 Y 1
2 N 0
3 Y 1
# ℹ 2,428 more rows
Conduct the analysis
Let’s conduct the analysis!
vote_age <-glm(vote_yes ~ age,family = binomial, data = chile_data)summary(vote_age)
Call:
glm(formula = vote_yes ~ age, family = binomial, data = chile_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.166859 0.121266 -9.622 < 2e-16 ***
age 0.013451 0.002898 4.641 3.47e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3129.1 on 2430 degrees of freedom
Residual deviance: 3107.6 on 2429 degrees of freedom
AIC: 3111.6
Number of Fisher Scoring iterations: 4
Exponentiate the coefficients
If you recall the formula, the results are expressed in Logit Probability. As we typically report the result in terms of Odd Ratios (OR), perform exponentiation on the coefficients.
(The intercept isn’t typically interpreted as an odds ratio, so we’ll ignore that for now)
The result above indicates that like one unit increase of age will increase the odds of voting “Yes” by a factor of 1.0135. i.e. for each year increase in age, there is a 1.35% increase in the odds of voting “Yes”. \(((1.0135 - 1) * 100 = 1.35)\) . This positive effect is very small.
For reporting purposes: get the Rsquared
In the report, the model’s Chi square and Rsquared will be reported as well to indicate the model fit. However, since the resulting item does not have this info, we will call upon DescTools’ PseudoR2() function to help!
# by default, McFadden will be usedDescTools::PseudoR2(vote_age)
Below is a sample of how you may want to narrate your result. Note the resulting values mentioned in the paragraph below.
In a nutshell, you will most likely have to mention the p-value, the \(\beta\) coefficients (for linear regressions), the Odd Ratios (for logistic regression) with the confidence intervals, and the adjusted R-squared. You should also include these information in your regression table.
Binary Logistic Regression: One Categorical Predictor
Research Question: Does region affect the likelihood of voting “Yes” in the plebiscite?
The outcome/DV (\(Y\)): vote - there are 4 outcomes here, but for the purpose of this workshop practice, let’s define the outcome as “Yes” and “not Yes” outcome
Binary Logistic Regression: One Categorical Predictor
Call:
glm(formula = vote_yes ~ region, family = "binomial", data = chile_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02667 0.23096 -0.115 0.908075
regionN -0.24382 0.25826 -0.944 0.345125
regionC -0.81542 0.24903 -3.274 0.001059 **
regionS -0.34709 0.24427 -1.421 0.155331
regionSA -0.93211 0.24337 -3.830 0.000128 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3129.1 on 2430 degrees of freedom
Residual deviance: 3078.3 on 2426 degrees of freedom
AIC: 3088.3
Number of Fisher Scoring iterations: 4
Exponentiate the coefficients
If you recall the formula, the results are expressed in Logit Probability. As we typically report the result in terms of Odd Ratios (OR), perform exponentiation on the coefficients.
(The intercept isn’t typically interpreted as an odds ratio, so we’ll ignore that for now)
The result above indicates that the odds for people in region C to vote “Yes” are about 0.56 times the odds of people in region N (the reference category). i.e., There is a 43.54% decrease in odds for region C residents to vote “Yes” compared to region N residents. \(((0.5646 - 1) * 100 = -43.54)\)
Get the Rsquared
Let’s retrieve the Rsquared value:
# the default is McFaddenDescTools::PseudoR2(vote_region)
McFadden
0.01622922
DescTools::PseudoR2(vote_region, which ="CoxSnell")
Let’s try this Logistic Regression exercise! (5 mins)
Create a regression model called vote_model that predicts the likelihood of voting “Yes” based on sex.
Code
vote_model <-glm(vote_yes ~ sex,family ="binomial", data = chile_data) summary(vote_model)exp(coefficients(vote_model))
What is Quarto? What is Markdown?
Markdown (Specifically, R Markdown)
Markdown is a lightweight markup language that provides a simple and readable way to write formatted text without using complex HTML or LaTeX. It is designed to make authoring content easy for everyone!
Markdown files can be converted into HTML or other formats.
Generic Markdown file will have .md extension.
R Markdown is an extension of Markdown that incorporates R code chunks and allows you to create dynamic documents that integrate text, code, and output (such as tables and plots).
RMarkdown file will have .Rmd extension.
RMarkdown in action
How it all works
Illustration by Allison Horst (www.allisonhorst.com)
Quarto
Quarto is a multi-language, next-generation version of R Markdown from Posit and includes dozens of new features and capabilities while at the same being able to render most existing Rmd files without modification.
Illustration by Allison Horst (www.allisonhorst.com)
Quarto in action: R scripts + Markdown
R Scripts vs Quarto
R Scripts
Great for quick debugging, experiment
Preferred format if you are archiving your code to GitHub or data repository
More suitable for “production” tasks e.g. automating your data cleaning and processing, custom functions, etc.
Quarto
Great for report and presentation to showcase your research insights/process as it integrates code, narrative text, visualizations, and results.
Very handy when you need your report in multiple format, e.g. in Word and PPT.
Let’s create our first Quarto document!
Go to File > New File > Quarto Document
Put any title you like, and put your name as the author
Check HTML as the end result for now
Click on Create!
Optional: collapse the console and environment tab (bottom left and top right) to make it easier to view the quarto document and the output.
Quarto Tour + Hands On! (Open this cheatsheet on another tab if you’d like!). We will explore how to:
Add narrative text
Add code chunks
Add math formulas with LaTeX
Add citations (you need to have Zotero installed)
Rendering Quarto to HTML, Word, and PDF
You can change the final result of rendering in the YAML section of your document.
Rendering to HTML is the default option.
You can also render as presentation (fun fact: my slides is made from Quarto!)
Rendering to Word: You have to have MS Word installed in your laptop
Rendering to PDF: If you encounter an error when converting your result to PDF, the faster (and easier) alternative is to render your doc to Word, and save to PDF from there.
Best Practices + More Resources
R Best Practices
Use <- for assigning values to objects.
Only use = when passing values to a function parameter.
Do not alter your raw data; save your wrangled/cleaned data into a new file and keep it separate from the raw data.
Make use of R projects to organize your data and make it easier to send over to your collaborators.
Having said that, when it comes to coding project, the best way to collaborate is using GitHub or similar platforms.
Whenever possible and makes sense for your project, follow the common convention when naming your objects, scripts, and functions. One guide that you can follow is Hadley Wickham’s tidyverse style guide.
Thank you for your participation 😄
All the best for your studies and academic journey! (manifesting excellent grades for everyone who attended the workshop)
Need help with R or Quarto? Please don’t hesitate to contact me at bellar@smu.edu.sg
Post-workshop survey
Please scan this QR code or click on the link below to fill in the post-workshop survey. It should not take more than 2-3 minutes.