About me:
Senior Librarian, Research & Data Services team, SMU Libraries.
Bachelor in Info Tech (IT), MSc in Info Studies from NTU.
Have been with SMU since the pandemic era (2021).
Have been doing this workshop since Aug 2023.
About this workshop:
Live-coding format; code along with me!
Goal of workshop: to give you enough fundamentals (at least to the point that ChatGPT can’t bluff you so easily) and confidence to explore R on your own.
Don’t be afraid to ask for help! We are all here to learn.
The workshops are structured to follow this workflow when dealing with data
R: The programming language and the software that interprets the R script
RStudio: An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.
You will need to install both for this workshop. Go to https://posit.co/download/rstudio-desktop to download and install both if you have not done so.
Check out the course website for a step-by-step guide.
Working directory -> where R will look for files (scripts, data, etc).
By default, it will be on your Desktop
Best practice is to use R Project to organize your files and data into projects.
When using R Project, the working directory = project folder.
Go to File
> New project
. Choose New directory
, then New project
Enter intro-r-socsci
as the name for this new folder (or “directory”) and choose where you want to put this folder, e.g. Desktop
or Documents
if you are on Windows. This will be your working directory for the rest of the workshop!
Next, let’s create 3 folders inside our working directory:
data
- we will save our raw data here. It’s best practice to keep the data here untouched.
data-output
- if we need to modify raw data, store the modified version here.
fig-output
- we will save all the graphics we created here!
Warning
Don’t put your R projects inside your OneDrive folder as that may cause issues sometimes.
Create a new R script - File
> New File
> R script
.
Note: RStudio does not autosave your progress, so remember to save from time to time!
In this line of code:
"Singapore"
is a value. This can be either a character, numeric, or boolean data type. (more on this soon)country_name
is the object where we store this value. This is so that we can keep this value to be used later.<-
is the assignment operator to assign the value to the object.
=
, but generally in R, <-
is the convention.Alt
+ -
in Windows (Option
+ -
in Mac)Non-Continuous Data
Nominal/Categorical: Non-ordered, non-numerical data, used to represent qualitative attribute.
Ordinal: Ordered non-numerical data.
Discrete: Numerical data that can only take specific value (usually integers)
Binary: Nominal data with only two possible outcome
Continuous Data
Interval: Numerical data that can take any value within a range. It does not have a “true zero”.
Ratio: Numerical data that can take any value within a range. it has a “true zero”.
The four basic data types are characters, numeric, boolean, and integer. Let’s look at examples using our WVS survey variables:
You can use str
or typeof
to check the data type of an R object.
You can do arithmetic operations in R. For example, let’s calculate average satisfaction scores:
Boolean operations in R are useful for filtering survey data:
AND operations (both conditions must be TRUE)
# Check if someone is both highly satisfied (>8) AND from Singapore
(life_satisfaction > 8) & (country_code == "SGP")
OR operations (at least one condition must be TRUE)
Functions take inputs (arguments/parameters), process them, and return a result. For example, calculating the mean satisfaction score:
Saving the result to an object:
in the example above, round
is the function. 123.456
and digits = 2
are the arguments/parameters.
You can call the help page / vignette in R by prepending ?
to the function name.
E.g. if you want to find out more about the round
function, you can run ?round
in your R console (bottom left panel)
Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.
Popular packages: tidyverse
, caret
, shiny
, etc.
Installation (you only need to do this once): install.packages("package name")
Loading packages (you need to run this everytime you restart RStudio): library(package name)
- let’s try to load tidyverse
!
Vectors can store multiple values. Let’s create vectors using our survey data:
Let’s find high satisfaction scores (above 7):
# Create boolean vector for our condition
high_satisfaction <- satisfaction_scores > 7
print(high_satisfaction)
[1] FALSE FALSE TRUE FALSE TRUE
[1] 9 8
Shortened version:
NA values indicate null values, or the absence of a value (0 is still a value!)
Summary functions like mean
needs you to specify in the arguments how you want it to be handled.
Survey data often contains missing values (NA):
Several ways to add items to a vector
1satisfaction_scores <- c(satisfaction_scores, 7)
2satisfaction_scores <- c(satisfaction_scores, 8, 9, 10)
3satisfaction_scores <- c(8, satisfaction_scores)
4satisfaction_scores <- append(satisfaction_scores, 9, after = 2) # <2>
1satisfaction_scores <- satisfaction_scores[-c(2, 4)]
2satisfaction_scores <- satisfaction_scores[satisfaction_scores <= 7]
3satisfaction_scores <- na.omit(satisfaction_scores)
Factors are perfect for categorical survey variables:
Unordered (Nominal):
employment_factor <- factor(c("Full time", "Part time", "Student", "Retired", "Student"))
str(employment_factor)
Factor w/ 4 levels "Full time","Part time",..: 1 2 4 3 4
Ordered (Ordinal):
Create a small dataframe with survey responses:
survey_data <- data.frame(
country = c("SGP", "CAN", "NZL", "SGP", "CAN"),
life_satisfaction = c(8, 7, 9, 6, 8),
employment = c("Full time", "Student", "Part time", "Retired", "Full time")
)
print(survey_data)
country life_satisfaction employment
1 SGP 8 Full time
2 CAN 7 Student
3 NZL 9 Part time
4 SGP 6 Retired
5 CAN 8 Full time
For this workshop, we will try loading a dataset from a file.
Go to the course website and go to the ‘Dataset’ tab to download the data file and information about this WVS data
Download this CSV and save it under your data
folder in your R project!
Let’s load our actual World Values Survey dataset:
Some basic dataframe functions before we move on to data wrangling next week:
Next Session: Data wrangling with dplyr
and tidyr
packages - we’ll learn how to:
Filter survey responses by country
Calculate average satisfaction scores by demographic groups
Create new variables from existing ones
Handle missing values in survey data
And much more!