Regression and Introduction to Quarto

Bella Ratmelia

Today’s Outline

Simple Linear Regression in R
Binary Logistic Regression in R
Create your manuscript with Quarto

Load our data for today!

# import tidyverse library
library(tidyverse)
library(gtsummary)
library(huxtable)

# read the CSV with Chile voting data
chile_data <- read_csv("data-output/chile_voting_processed.csv")

chile_data <- chile_data |> 
    mutate(across(c("region", "sex", "education", "vote", "age_group", "support_level"), as_factor))

#reordering
chile_data <- chile_data |> 
    mutate(education = factor(education, 
                         levels = c("P", "S", "PS"), 
                         ordered = TRUE))

# peek at the data, pay attention to the data types!
glimpse(chile_data)

Simple Linear Regression

Simple Linear Regression: what is it?

Linear regression is a statistical method used to model the relationship between a dependent variable (outcome) and one or more independent variables (predictors) by fitting a linear equation to the observed data. The math formula looks like this:

\[ Y = \beta_0 + \beta_1X + \varepsilon \]

\(Y\) - the dependent variable; must be continuous
\(X\) - the independent variable (if there are more than one, there will be \(X_1\) , \(X_2\) , and so on. This can be ordinal, nominal, or continuous
\(\beta_0\) - the y-intercept. Represents the expected value of independent variable \(Y\) when independent variable(s) \(X\) are set to zero.
\(\beta_1\) - the slope / coefficient for independent variable
\(\varepsilon\) - the error term. (In some examples you might see this omitted from the formula).

Examples:

Does a person’s age affect their income?
Do a person’s age and education level affect their income?

Linear Regression: One continuous predictor

Research Question: Does a person’s age influence their support to incumbent in Chile?

The outcome/DV (\(Y\)): statusquo
The predictor/IV (\(X\)): age

statsquo_model1 <- lm(statusquo ~ age, data = chile_data)
summary(statsquo_model1) #summarize the result

Call: the formula
Residuals: overview on the distribution of residuals (expected value minus observed value) – we can plot this to check for homoscedasticity -
Coefficients: shows the intercept, the regression coefficients for the predictor variables, and their statistical significance
Residual standard error: the average difference between observed and expected outcome by the model. Generally the lower, the better.
R-squared & Adjusted R-squared: indicates the proportion of variation in the outcome that can be explained by the model (i.e. goodness of fit).
F-statistics: indicates whether the model as a whole is statistically significant and whether it explains more variance than just the baseline (intercept-only) model.

Linear Regression: One continuous predictor


Call:
lm(formula = statusquo ~ age, data = chile_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.58205 -0.97443 -0.07783  0.97579  1.87278 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.304966   0.056620  -5.386 7.89e-08 ***
age          0.007670   0.001381   5.555 3.09e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9984 on 2429 degrees of freedom
Multiple R-squared:  0.01254,   Adjusted R-squared:  0.01214 
F-statistic: 30.85 on 1 and 2429 DF,  p-value: 3.086e-08

Narrating the results

Here is one possible way to narrate your result:

A linear regression analysis was conducted to assess the influence of age on support for the incumbent (statusquo) in Chile. The coefficient for age was 0.008 (SE = 0.001), indicating that for each additional year of age, support for the incumbent increased by 0.008 units. This effect was statistically significant at p < 0.001.

The model explained a small portion of the variance in support for the statusquo (R² = 0.013, Adjusted R² = 0.012). The F-statistic was 30.85 (p < 0.001), further suggesting that age is a significant predictor of support for the incumbent.

Present the regression tables!

huxreg("statusquo" = statsquo_model1)

	statusquo
(Intercept)	-0.305 ***
	(0.057)
age	0.008 ***
	(0.001)
N	2431
R2	0.013
logLik	-3444.614
AIC	6895.227
* p < 0.001; p < 0.01; * p < 0.05.

Linear Regression: Multiple continuous predictors

Research Question: Do a person’s age and their income have an impact on their support for the incumbent?

The outcome/DV (\(Y\)): statusquo
The predictor/IV (\(X\)): age and income

# let's divide income by 1000 to make the scale bit more similar
chile_data$income_downscaled <- chile_data$income / 10000

statsquo_model2 <- lm(statusquo ~ age + income_downscaled, data = chile_data)
summary(statsquo_model2)

Linear Regression: Multiple continuous predictors


Call:
lm(formula = statusquo ~ age + income_downscaled, data = chile_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.58170 -0.97166 -0.06689  0.97791  1.88982 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.335698   0.059682  -5.625 2.07e-08 ***
age                0.007738   0.001381   5.603 2.35e-08 ***
income_downscaled  0.008276   0.005097   1.624    0.105    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9981 on 2428 degrees of freedom
Multiple R-squared:  0.01361,   Adjusted R-squared:  0.0128 
F-statistic: 16.76 on 2 and 2428 DF,  p-value: 5.93e-08

Alternative package to present regression result

statsquo_model2 |> tbl_regression() |> bold_p()

Characteristic	Beta	95% CI¹	p-value
age	0.01	0.01, 0.01	<0.001
income_downscaled	0.01	0.00, 0.02	0.10
¹ CI = Confidence Interval

Reporting result: A sample regression table

You might encounter different table formats when reporting regression results, but there are some key elements that should generally be included.

These are: the number of observations (\(N\)), the \(\beta\) coefficients, standard errors (SE), confidence intervals (95% CI), and p-values. Other metrics to include are the \(R^2\) and \(F\) statistics.

Presenting your models

Don’t forget to install the library by running this line in your terminal: install.packages("huxtable")

For more info about huxreg, go here: https://cran.r-project.org/web/packages/huxtable/vignettes/huxreg.html

library(huxtable)
huxreg("statusquo (model1)" = statsquo_model1, 
       "statusquo (model2)" = statsquo_model2,
       number_format = 4,
       bold_signif = 0.05,
       statistics = c( "R squared" = "r.squared", "N" = "nobs", "F" = "statistic",
      "P value" = "p.value"))

	statusquo (model1)	statusquo (model2)
(Intercept)	-0.3050 ***	-0.3357 ***
	(0.0566)	(0.0597)
age	0.0077 ***	0.0077 ***
	(0.0014)	(0.0014)
income_downscaled		0.0083
		(0.0051)
R squared	0.0125	0.0136
N	2431	2431
F	30.8532	16.7552
P value	0.0000	0.0000
* p < 0.001; p < 0.01; * p < 0.05.

FYI: Multicollinearity

Caution! When doing regression-type of tests, watch out for multicollinearity.

Multicollinearity is a situation in which two or more predictor variables are highly correlated with each other. This makes it difficult to determine the specific contribution of each predictor variable to the relationship.

One way to check for it:

Assess the correlation between your predictor variables in your model using Variance Inflation Factor (VIF)
If they seem to be highly correlated (> 5 or so), one of the easiest (and somewhat acceptable) way is to simply remove the less significant predictor from your model :D

car::vif(statsquo_model2)

              age income_downscaled 
           1.0009            1.0009

Linear Regression: One categorical predictor

Research question: Explore the difference of support for statusquo between different sex.

The outcome/DV (\(Y\)): statusquo
The predictor/IV (\(X\)): sex

Note

Before proceeding with analysis, ensure that all the categorical variables involved are cast as factors!

str(chile_data$statusquo)

 num [1:2431] 1.01 -1.3 1.23 -1.03 -1.1 ...

str(chile_data$sex)

 Factor w/ 2 levels "M","F": 1 1 2 2 2 2 1 2 2 1 ...

Continuing the analysis

The analysis summary should look like this:

statusquo_model3 <- lm(statusquo ~ sex, data = chile_data)
summary(statusquo_model3)


Call:
lm(formula = statusquo ~ sex, data = chile_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.77948 -0.98619 -0.06878  0.96289  1.79341 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.07986    0.02917  -2.737  0.00624 **
sexF         0.13339    0.04068   3.279  0.00106 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 2429 degrees of freedom
Multiple R-squared:  0.004406,  Adjusted R-squared:  0.003997 
F-statistic: 10.75 on 1 and 2429 DF,  p-value: 0.001057

When interpreting a categorical predictor in regression, one category is treated as the reference category, which serves as the baseline for comparison. In this case, the reference category corresponds to the intercept.

By default, the first category in the data is used as the reference category, unless specified otherwise.

Narrating the result

Here is one possible way to narrate your result:

A linear regression analysis was conducted to explore the relationship between gender (male vs. female) of voters and support for the status quo in Chile (statusquo). The model includes gender as the only predictor. Results revealed that the intercept, representing the mean statusquo score for male voters, was -0.080 (SE = 0.030). This was statistically significant at p < 0.01, indicating that on average, male voters showed a negative inclination towards the status quo.

The coefficient for female voters was 0.134 (SE = 0.041), indicating that being female was associated with an increase of 0.134 units in support for the status quo compared to male voters. This effect was statistically significant p < 0.01

The model explain a small portion of the variance in support for the status quo (R^2 = 0.004, Adjusted R^2 = 0.004). The F-statistic (10.75, p < 0.01) suggests that gender is a significant predictor of support for the status quo.

Presenting the regression table

For more info about huxreg, go here: https://cran.r-project.org/web/packages/huxtable/vignettes/huxreg.html

library(huxtable)
huxreg("statusquo (model3)" = statusquo_model3,
       number_format = 4,
       bold_signif = 0.05,
       statistics = c( "R squared" = "r.squared", "N" = "nobs", "F" = "statistic",
      "P value" = "p.value"))

	statusquo (model3)
(Intercept)	-0.0799 **
	(0.0292)
sexF	0.1334 **
	(0.0407)
R squared	0.0044
N	2431
F	10.7507
P value	0.0011
* p < 0.001; p < 0.01; * p < 0.05.

Categorical predictor: changing the reference

Let’s change the reference category for sex variable to “F”.

chile_data <- chile_data |> 
    mutate(sex = relevel(sex, ref = "F"))

str(chile_data$sex)

 Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...

Re-run the analysis with the new reference category

statsquo_model3 <- lm(statusquo ~ sex, data = chile_data)
summary(statsquo_model3)

Categorical predictor: changing the reference


Call:
lm(formula = statusquo ~ sex, data = chile_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.77948 -0.98619 -0.06878  0.96289  1.79341 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.05354    0.02836   1.888  0.05915 . 
sexM        -0.13339    0.04068  -3.279  0.00106 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 2429 degrees of freedom
Multiple R-squared:  0.004406,  Adjusted R-squared:  0.003997 
F-statistic: 10.75 on 1 and 2429 DF,  p-value: 0.001057

Changing baseline category for ordered factor

chile_data <- chile_data |> 
    mutate(education = factor(education, 
                         levels = c("S", "P", "PS"), 
                         ordered = TRUE))

Releveling ordered factor

Previously, we could use the relevel() function to change the reference category for ordered factors. However, in recent R versions, this no longer works for ordered factors, so we now use the method shown in the code above. The relevel() function still works for unordered factors.

Let’s try this Linear Regression exercise! (5 mins)

Create a regression model called statusquo_model4 that predicts the statusquo score based on region. Make sure the reference category is region “M”.

Code

chile_data <- chile_data |> 
    mutate(region = relevel(region, ref = "M"))

statusquo_model4 <- lm(statusquo ~ region, data = chile_data)
summary(statusquo_model4)

Binary Logistic Regression

Binary Logistic Regression - what is it?

Also known as simply logistic regression, it is used to model the relationship between a set of independent variables and a binary outcome.

These independent variables can be either categorical or continuous.

Binary Logistics Regression formula:

\[ logit(P) = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_nX_n \]

It can also be written like below, in which the \(logit(P)\) part is expanded:

\[ P = \frac{1}{1 + e^{-(\beta_0 + \beta_1X)}} \]

Binary Logistic Regression Examples

Does a person’s age and education level influence whether they will vote Yes or No in a referendum?
Does the number of hours spent studying impact a student’s likelihood of passing a module? (pass/fail outcome)

In essence, the goal of binary logistic regression is to estimate the probability of a specific event happening when there are only two possible outcomes (hence the term “binary”).

Binary Logistic Regression: One Continuous Predictor

Research Question: Does age affect the likelihood of voting “Yes” in the plebiscite?

The outcome/DV (\(Y\)): vote - there are 4 outcomes here, but for the purpose of this workshop practice, let’s define the outcome as “Yes” and “not Yes” outcome
The predictor/IV (\(X\)): age

Dummy-coding dependent variable

Before we proceed with the calculations, we need to dummy code the dependent variable into 1 and 0, with 1 = Yes and 0 = Not Yes. More info on dummy coding here

# First, we need to create a binary outcome
chile_data <- chile_data |>
    mutate(vote_yes = if_else(vote == "Y", 1, 0))
# the if_else is from dplyr package (from session 2)

# A tibble: 2,431 × 2
  vote  vote_yes
  <fct>    <dbl>
1 Y            1
2 N            0
3 Y            1
# ℹ 2,428 more rows

Conduct the analysis

Let’s conduct the analysis!

vote_age <- glm(vote_yes ~ age,
                family = binomial, 
                data = chile_data)

summary(vote_age)


Call:
glm(formula = vote_yes ~ age, family = binomial, data = chile_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.166859   0.121266  -9.622  < 2e-16 ***
age          0.013451   0.002898   4.641 3.47e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3129.1  on 2430  degrees of freedom
Residual deviance: 3107.6  on 2429  degrees of freedom
AIC: 3111.6

Number of Fisher Scoring iterations: 4

Exponentiate the coefficients

If you recall the formula, the results are expressed in Logit Probability. As we typically report the result in terms of Odd Ratios (OR), perform exponentiation on the coefficients.

library(gtsummary)
vote_age |> tbl_regression(exponentiate = TRUE) |> 
    bold_p()

Characteristic	OR¹	95% CI¹	p-value
age	1.01	1.01, 1.02	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Exponentiate the coefficients: interpretation

Possible interpretation:

(The intercept isn’t typically interpreted as an odds ratio, so we’ll ignore that for now)

The result above indicates that like one unit increase of age will increase the odds of voting “Yes” by a factor of 1.0135. i.e. for each year increase in age, there is a 1.35% increase in the odds of voting “Yes”. \(((1.0135 - 1) * 100 = 1.35)\) . This positive effect is very small.

For reporting purposes: get the Rsquared

In the report, the model’s Chi square and Rsquared will be reported as well to indicate the model fit. However, since the resulting item does not have this info, we will call upon DescTools’ PseudoR2() function to help!

# by default, McFadden will be used
DescTools::PseudoR2(vote_age)

   McFadden 
0.006886506

DescTools::PseudoR2(vote_age, 
                    which = "CoxSnell")

   CoxSnell 
0.008824905

DescTools::PseudoR2(vote_age, 
                    which = "all")

       McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
   6.886506e-03    5.608183e-03    8.824905e-03    1.218997e-02    8.786194e-03 
VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
   1.561219e-02    8.994005e-03    1.168768e-02    8.961863e-03    3.111552e+03 
            BIC          logLik         logLik0              G2 
   3.123144e+03   -1.553776e+03   -1.564550e+03    2.154857e+01

Reporting result: in-text narration

Below is a sample of how you may want to narrate your result. Note the resulting values mentioned in the paragraph below.

In a nutshell, you will most likely have to mention the p-value, the \(\beta\) coefficients (for linear regressions), the Odd Ratios (for logistic regression) with the confidence intervals, and the adjusted R-squared. You should also include these information in your regression table.

Reporting result: regression table

huxreg("Vote 'Yes'" = vote_age, statistics = c(R2 = "r.squared", "logLik", "AIC"))

	Vote 'Yes'
(Intercept)	-1.167 ***
	(0.121)
age	0.013 ***
	(0.003)
logLik	-1553.776
AIC	3111.552
* p < 0.001; p < 0.01; * p < 0.05.

Binary Logistic Regression: One Categorical Predictor

Research Question: Does region affect the likelihood of voting “Yes” in the plebiscite?

The outcome/DV (\(Y\)): vote - there are 4 outcomes here, but for the purpose of this workshop practice, let’s define the outcome as “Yes” and “not Yes” outcome
The predictor/IV (\(X\)): region

vote_region <- glm(vote_yes ~ region,
                      family = "binomial",
                      data = chile_data)

summary(vote_region)

Binary Logistic Regression: One Categorical Predictor


Call:
glm(formula = vote_yes ~ region, family = "binomial", data = chile_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.02667    0.23096  -0.115 0.908075    
regionN     -0.24382    0.25826  -0.944 0.345125    
regionC     -0.81542    0.24903  -3.274 0.001059 ** 
regionS     -0.34709    0.24427  -1.421 0.155331    
regionSA    -0.93211    0.24337  -3.830 0.000128 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3129.1  on 2430  degrees of freedom
Residual deviance: 3078.3  on 2426  degrees of freedom
AIC: 3088.3

Number of Fisher Scoring iterations: 4

Exponentiate the coefficients

If you recall the formula, the results are expressed in Logit Probability. As we typically report the result in terms of Odd Ratios (OR), perform exponentiation on the coefficients.

library(gtsummary)
vote_region |> tbl_regression(exponentiate = TRUE) |>  
    bold_p()

Characteristic	OR¹	95% CI¹	p-value
region
M	—	—
N	0.78	0.47, 1.30	0.3
C	0.44	0.27, 0.72	0.001
S	0.71	0.44, 1.14	0.2
SA	0.39	0.24, 0.64	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Narrating the results

Possible interpretation for region C:

(The intercept isn’t typically interpreted as an odds ratio, so we’ll ignore that for now)

The result above indicates that the odds for people in region C to vote “Yes” are about 0.56 times the odds of people in region N (the reference category). i.e., There is a 43.54% decrease in odds for region C residents to vote “Yes” compared to region N residents. \(((0.5646 - 1) * 100 = -43.54)\)

Get the Rsquared

Let’s retrieve the Rsquared value:

# the default is McFadden
DescTools::PseudoR2(vote_region)

  McFadden 
0.01622922

DescTools::PseudoR2(vote_region, 
                    which = "CoxSnell")

  CoxSnell 
0.02067302

DescTools::PseudoR2(vote_region, 
                    which = "all")

       McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
   1.622922e-02    1.303341e-02    2.067302e-02    2.855594e-02    2.046225e-02 
VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
   3.635938e-02    2.102186e-02    2.723262e-02    2.102186e-02    3.088317e+03 
            BIC          logLik         logLik0              G2 
   3.117298e+03   -1.539159e+03   -1.564550e+03    5.078285e+01

Present both models in a table

Note that the R-squared is missing here, so be sure to put that in when you paste this table to your document.

library(huxtable)
huxreg("vote 'Yes' (model1)" = vote_age, "vote 'Yes' (model2)" = vote_region)

	vote 'Yes' (model1)	vote 'Yes' (model2)
(Intercept)	-1.167 ***	-0.027
	(0.121)	(0.231)
age	0.013 ***
	(0.003)
regionN		-0.244
		(0.258)
regionC		-0.815 **
		(0.249)
regionS		-0.347
		(0.244)
regionSA		-0.932 ***
		(0.243)
N	2431	2431
logLik	-1553.776	-1539.159
AIC	3111.552	3088.317
* p < 0.001; p < 0.01; * p < 0.05.

Let’s try this Logistic Regression exercise! (5 mins)

Create a regression model called vote_model that predicts the likelihood of voting “Yes” based on sex.

Code

vote_model <- glm(vote_yes ~ sex,
                  family = "binomial", 
                  data = chile_data) 

summary(vote_model)
exp(coefficients(vote_model))

What is Quarto? What is Markdown?

Markdown (Specifically, R Markdown)

Markdown is a lightweight markup language that provides a simple and readable way to write formatted text without using complex HTML or LaTeX. It is designed to make authoring content easy for everyone!
- Markdown files can be converted into HTML or other formats.
- Generic Markdown file will have .md extension.
R Markdown is an extension of Markdown that incorporates R code chunks and allows you to create dynamic documents that integrate text, code, and output (such as tables and plots).
- RMarkdown file will have .Rmd extension.

RMarkdown in action

How it all works

Illustration by Allison Horst (www.allisonhorst.com)

Quarto

Quarto is a multi-language, next-generation version of R Markdown from Posit and includes dozens of new features and capabilities while at the same being able to render most existing Rmd files without modification.

Illustration by Allison Horst (www.allisonhorst.com)

Quarto in action: R scripts + Markdown

R Scripts vs Quarto

R Scripts

Great for quick debugging, experiment
Preferred format if you are archiving your code to GitHub or data repository
More suitable for “production” tasks e.g. automating your data cleaning and processing, custom functions, etc.

Quarto

Great for report and presentation to showcase your research insights/process as it integrates code, narrative text, visualizations, and results.
Very handy when you need your report in multiple format, e.g. in Word and PPT.

Let’s create our first Quarto document!

Go to File > New File > Quarto Document
Put any title you like, and put your name as the author
Check HTML as the end result for now
Click on Create!
Optional: collapse the console and environment tab (bottom left and top right) to make it easier to view the quarto document and the output.
Quarto Tour + Hands On! (Open this cheatsheet on another tab if you’d like!). We will explore how to:
1. Add narrative text
2. Add code chunks
3. Add math formulas with LaTeX
4. Add citations (you need to have Zotero installed)

Rendering Quarto to HTML, Word, and PDF

You can change the final result of rendering in the YAML section of your document.

Rendering to HTML is the default option.
- You can also render as presentation (fun fact: my slides is made from Quarto!)
Rendering to Word: You have to have MS Word installed in your laptop
Rendering to PDF: If you encounter an error when converting your result to PDF, the faster (and easier) alternative is to render your doc to Word, and save to PDF from there.

Best Practices + More Resources

R Best Practices

Use <- for assigning values to objects.
- Only use = when passing values to a function parameter.
Do not alter your raw data; save your wrangled/cleaned data into a new file and keep it separate from the raw data.
Make use of R projects to organize your data and make it easier to send over to your collaborators.
- Having said that, when it comes to coding project, the best way to collaborate is using GitHub or similar platforms.
Whenever possible and makes sense for your project, follow the common convention when naming your objects, scripts, and functions. One guide that you can follow is Hadley Wickham’s tidyverse style guide.

Thank you for your participation 😄

All the best for your studies and academic journey! (manifesting excellent grades for everyone who attended the workshop)

Need help with R or Quarto? Please don’t hesitate to contact me at bellar@smu.edu.sg

Post-workshop survey

Please scan this QR code or click on the link below to fill in the post-workshop survey. It should not take more than 2-3 minutes.

Survey link: https://smusg.asia.qualtrics.com/jfe/form/SV_1TYE6ntFBi3lwbQ