About Bella:
Senior Librarian, Research & Data Services team, SMU Libraries.
Bachelor in Info Tech (IT), MSc in Info Studies from NTU.
Have been with SMU since the pandemic era (2021).
Have been doing this workshop since Aug 2023.
About Wei:
Principle Librarian, Instruction and Learning Services, SMU Libraries.
Have been with SMU over 16 years. This is my first R workshop!
About this workshop:
Live-coding format; code along with me!
Goal of workshop: to give you enough fundamentals (at least to the point that ChatGPT can’t bluff you so easily) and confidence to explore R on your own.
Don’t be afraid to ask for help! We are all here to learn.
The workshops are structured to follow this workflow when dealing with data
R: R is an open-source programming language that was developed for statistical analysis and visualization. Community share R codes and create shortcuts.
RStudio: The R software environment i.e. RStudio is where we use to interact more easily with R language and scripts.
You will need to install both for this workshop. Go to https://posit.co/download/rstudio-desktop to download and install both if you have not done so. Remember: Install R firstly and then RStudio.
Check out the course website for a step-by-step guide.
Firstly, we need to be comfortable with the RStudio interface. We will use the RStudio to write code, navigate the files on our computer, inspect the variables we create, and visualize the plots we generate.
R Studio layout
Working directory -> where R will look for files (scripts, data, etc). Stay organized with all files and folders related to a project stored in one place.
By default, it will be on your Desktop.
Best practice is to use R Project to organize your files and data into projects.
When using R Project, the working directory = project folder.
Go to File
> New project
. Choose New directory
, then New project
Enter intro-r-socsci
as the name for this new folder (or “directory”) and choose where you want to put this folder, e.g. Desktop
or Documents
if you are on Windows. This will be your working directory for the rest of the workshop!
data
- we will save our raw data here. It’s best practice to keep the data here untouched.
data-output
- if we need to modify raw data, store the modified version here.
fig-output
- we will save all the graphics we created here!
Don’t:
Don’t put your R projects inside your OneDrive folder as that may cause issues sometimes.
Folder/file names should avoid spaces, symbols, and special characters. Prefer folder/file names that are all in lower case.
Do:
Create a new R script - File
> New File
> R script
.
RStudio allows you to execute commands directly from the script editor by using the Ctrl + Enter shortcut (on Mac, Cmd + Return will work).
You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session. It is better to type the commands in the script editor and save the script. This way, there is a complete record and can easily replicate the results.
Note: RStudio does not autosave your progress, so remember to save from time to time!
In this line of code:
"Singapore"
is a value. This can be either a character, numeric, or boolean data type. (more on this soon)country_name
is the object where we store this value. This is so that we can keep this value to be used later.<-
is the assignment operator to assign the value to the object.
=
, but generally in R, <-
is the convention.Alt
+ -
in Windows (Option
+ -
in Mac)In this line of code:
"Singapore"
is a value. This can be either a character, numeric, or boolean data type. (more on this soon)
country_name
is the object where we store this value. This is so that we can keep this value to be used later.
<-
is the assignment operator to assign the value to the object.
=
, but generally in R, <-
is the convention.Alt
+ -
in Windows (Option
+ -
in Mac)Object name rules:
Can’t start with numbers
Case sensitive
No spaces
Some reserved words (like some function names in R)
Non-Continuous Data
Nominal/Categorical: Non-ordered, non-numerical data, used to represent qualitative attribute.
Ordinal: Ordered non-numerical data.
Discrete: Numerical data that can only take specific value (usually integers)
Binary: Nominal data with only two possible outcome
Continuous Data
Interval: Numerical data that can take any value within a range. It does not have a “true zero”.
Ratio: Numerical data that can take any value within a range. it has a “true zero”.
The four basic data types are characters, numeric, boolean, and integer. Let’s look at examples using our WVS survey variables:
To include comments in the code, use the # character.
Anything to the right of the # sign and up to the end of the line is treated as a comment and is ignored by R when executing.
Good practice to make notes and explain the codes.
You can use str
or typeof
to check the data type of an R object.
str
returns both data type and value.
You can do arithmetic operations in R. For example, let’s calculate average satisfaction scores:
Boolean operations in R are useful for filtering survey data. Before that, let’s look at how R evaluates simple TRUE/FALSE statements
Is life_satisfaction greater than 8?
[1] TRUE
Is the country Singapore?
Is the country NOT Singapore?
Sometimes, we may have multiple statements to evaluate. This is where the Boolean Operators will come handy.
AND operations (both conditions must be TRUE). In R, it is represented by ampersand &
Is the country New Zealand AND is the life satisfaction more than 8?
country_code == "NZL"
is FALSE while life_satisfaction > 8
is TRUE
The whole statement will return FALSE because not all conditions TRUE.
OR operations (at least one condition must be TRUE). In R, it is represented by pipe symbol |
Is the country New Zealand OR is the life satisfaction more than 8?
As long as one condition is met, this will be TRUE.
A function is like a recipe in cooking.
It takes some ingredients (inputs) and uses a set of instructions to produce a result (output).
In R, a function is a pre-written set of recipes/instructions that performs a specific task. Function name will always be followed by round brackets ()
Example: round()
function in R will round up numbers.
round()
is the “recipe”, while 3.1415926
is the “ingredients”Saving the result to an object:
Following the recipe analogy, arguments are the ingredients you provide to a function. function_name(arguments/parameters)
Some arguments are required, while others are optional (they have default values).
Each argument tells the function what to use or how to perform the task.
Example: Think of a bubble tea order as a function. The possible arguments/ingredients here are:
Tea - required ingredient
Milk - optional, the default is to include
Toppings - optional, the default choice is “pearls”
In R:
3.1415926
is the required argument (if this is not provided, the function will not run)
digits
is an optional argument specifying how many decimal places to round to (the default is 0)
You can call the help page / vignette in R by prepending ?
to the function name.
E.g. if you want to find out more about the round
function, you can run ?round
in your R console (bottom left panel)
We will start with exploring 3 basic types of data structures in R:
Vector - can hold multiple values in a single variable/object.
Factor - Special data structure in R to handle categorical variables.
Data frame - De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
Other types (but we will not be covering these in this workshop):
Lists: A type of recursive vector
Matrices: A collection of elements of the same type in rows and columns
Think of vectors as a column in a dataset. It is the most common and basic data structure in R. (pretty much the workhorse of R!)
A vector can only contain 1 data type.
Vector can be created with c() function.
Can add, remove, or change values in a vector.
Inspect vectors: typeof(), str(), length()
Retrieve the first country in the vector
Retrieves the first three satisfaction scores
Retrieves the first and the third satisfaction scores
Update the first satisfaction score
Round brackets ()
are for running functions, like using a tool: mean()
or sum()
.
Square brackets []
are for accessing specific parts of your data, where we pass the index number(s) of the element(s) we want.
Let’s find high satisfaction scores (above 7)!
The code below will create a boolean vector called criteria
that basically keep tracks on whether each items inside satisfaction_scores
fulfil our condition.
The condition is “value must be > 7”. e.g. if item 1 fulfils our condition, then item 1 is ‘marked’ as TRUE
. Otherwise, it will be FALSE
[1] FALSE FALSE TRUE FALSE TRUE
criteria
to satisfaction_scores
, and only retrieve items that fulfils the condition. i.e. only return TRUE value criteria
vectorNA values indicate null values, or the absence of a value (0 is still a value!)
Summary functions like mean
needs you to inlcude argument called na.rm
on how you want it to be handled.
Survey data often contains missing values (NA):
De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
Similar to spreadsheets - a rectangular collection of variables (columns) and observations (rows)!
You can create it by hand like so: (though more commonly you will use this to load data from external source)
survey_data <- data.frame(
country = c("SGP", "CAN", "NZL", "SGP", "CAN"),
life_satisfaction = c(8, 7, 9, 6, 8),
employment = c("Full time", "Student", "Part time", "Retired", "Full time")
)
print(survey_data)
country life_satisfaction employment
1 SGP 8 Full time
2 CAN 7 Student
3 NZL 9 Part time
4 SGP 6 Retired
5 CAN 8 Full time
For this workshop, we will try loading a dataset from a file into a dataframe!
Go to the course website and go to the ‘Dataset’ tab to download the data file and information about this WVS data
Download this CSV and save it under your data
folder in your R project!
Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.
Popular packages: tidyverse
, caret
, shiny
, etc.
Installation (you only need to do this once): install.packages("package name")
Loading packages (you need to run this everytime you restart RStudio): library(package name)
- let’s try to load tidyverse
as we will use this to load our CSV into a dataframe!
Let’s load our actual World Values Survey dataset using read_csv function.
Before use read_csv install and load package tidyverse. Install the package only once and need to load the package in every session before you use it.
# A tibble: 6 × 16
country ID family_importance friends_importance leisure_importance
<chr> <dbl> <dbl> <dbl> <dbl>
1 CAN 124070003 1 1 1
2 CAN 124070004 1 1 2
3 CAN 124070005 2 2 2
4 CAN 124070006 1 2 2
5 CAN 124070008 1 2 2
6 CAN 124070009 1 2 2
# ℹ 11 more variables: work_importance <dbl>, freedom <dbl>,
# life_satisfaction <dbl>, financial_satisfaction <dbl>, religiousity <chr>,
# political_scale <dbl>, sex <chr>, birthyear <dbl>, age <dbl>,
# marital_status <chr>, employment <chr>
Make sure to save the CSV file in your data folder! (No auto save)
Some basic dataframe functions before we move on to data wrangling next week:
Special data structure in R to deal with categorical data.
Look (and often behave) like character vectors but are actually treated as integer vectors.
Can only contain a predefined set of values, known as levels
. R always sorts levels in alphabetical order
Can be ordered (ordinal) or unordered (nominal).
May look like a normal vector at first glance, so use str()
to check.
Most common application: when we want to specify a column in our dataframe as a column of categories
In our example, country
should be a factor!
Since there is no inherent order within country names, our categories will be unordered:
[1] "CAN" "NZL" "SGP"
If we check the structure, country
column should be a factor instead of characters now:
Technically, in our data set, there is no variables that should be ordered. But for example purposes, let’s say that we want to order the categorical values in religiousity
from non-believer to believer.
wvs_data$religiousity <- factor(wvs_data$religiousity,
levels = c("An atheist", "Not a religious person", "A religious person", "Don't know"),
ordered = TRUE)
# check the category levels and structure
levels(wvs_data$religiousity)
[1] "An atheist" "Not a religious person" "A religious person"
[4] "Don't know"
Ord.factor w/ 4 levels "An atheist"<"Not a religious person"<..: 3 3 1 3 2 1 1 3 2 3 ...
Types of messages
Error: Fatal error in your code that prevented it from being run through successfully. Need to fix it for the code to run.
Warning: Non-fatal errors (don’t stop the code from running, but this is a potential problem that you should know about).
Message: Helpful information about the code you just ran(can usually ignore these messages)
Check. Did you…
Set your working directory?
Check for missing commas (,), parentheses ()?
Check your spelling?
Spelling : Punctuation <- meaan(c(1, 2, 3, 4)) or print(Punctuatioon)
Punctuation : sum(10?20) or Punctuation <- sum(c(10, 20)))
Capitalization : sUm(c(5, 10, 15))
In text indicators : X + Y or #taking notes or “name” vs name
R & RStudio: R is a programming language for statistical computing and graphics, while RStudio is an environment that makes it easier to write, run, and manage R code.
Working Directory: A folder on your computer where R reads and saves files.
Data Type: Types of value an object can hold, such as numeric, character, and Boolean.
Data Structure: Data structures organize and store data in R, such as vectors, factors, and data frames.
Functions: Reusable codes that perform specific tasks. Can call functions with arguments.
Packages: Collections of functions and data to perform tasks. Need to install and load packages.
Next Session: Data wrangling with dplyr
and tidyr
packages - we’ll learn how to:
Filter survey responses by country
Calculate average satisfaction scores by demographic groups
Create new variables from existing ones
Handle missing values in survey data
And much more!
Several ways to add items to a vector
1satisfaction_scores <- c(satisfaction_scores, 7)
2satisfaction_scores <- c(satisfaction_scores, 8, 9, 10)
3satisfaction_scores <- c(8, satisfaction_scores)
4satisfaction_scores <- append(satisfaction_scores, 9, after = 2)
1satisfaction_scores <- satisfaction_scores[-c(2, 4)]
2satisfaction_scores <- satisfaction_scores[satisfaction_scores <= 7]
3satisfaction_scores <- na.omit(satisfaction_scores)