Introduction to R and RStudio

Bella Ratmelia

Welcome!

Preamble

  • About me:

    • Senior Librarian, Research & Data Services team, SMU Libraries.

    • Bachelor in Info Tech (IT), MSc in Info Studies from NTU.

    • Have been with SMU since the pandemic era (2021).

    • Have been doing this workshop since Aug 2023.

  • About this workshop:

    • Live-coding format; code along with me!

    • Goal of workshop: to give you enough fundamentals (at least to the point that ChatGPT can’t bluff you so easily) and confidence to explore R on your own.

    • Don’t be afraid to ask for help! We are all here to learn.

The outline for these workshops

The workshops are structured to follow this workflow when dealing with data

The outline for these workshops (explained)

  1. Import data into R, which means take data (stored in a file, via API, etc) and load it into a dataframe in R
  2. Tidy the imported data.
    • Tidy = storing it in a consistent form that matches the semantics of the dataset.
    • Tidy data = each column is a variable, each row is an observation
  3. Once a data it tidy, we can transform it. Transformation includes:
    • narrowing in on observations of interest (like all people in one city or all data from the last year)
    • creating new variables that are functions of existing variables (like computing speed from distance and time)
    • calculating a set of summary statistics (like counts or means).
  4. Once we have tidy data with the info we need, we can visualize it and model it.
  5. Communicate the result. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.

What is R? What is R Studio?

R: The programming language and the software that interprets the R script

RStudio: An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.

You will need to install both for this workshop. Go to https://posit.co/download/rstudio-desktop to download and install both if you have not done so.

Check out the course website for a step-by-step guide.

A Tour of RStudio

R Studio layout

Working Directory

  • Working directory -> where R will look for files (scripts, data, etc).

    • By default, it will be on your Desktop

    • Best practice is to use R Project to organize your files and data into projects.

    • When using R Project, the working directory = project folder.

Creating the project for this workshop

  1. Go to File > New project. Choose New directory, then New project

  2. Enter intro-r-socsci as the name for this new folder (or “directory”) and choose where you want to put this folder, e.g. Desktop or Documents if you are on Windows. This will be your working directory for the rest of the workshop!

  1. Next, let’s create 3 folders inside our working directory:

    • data - we will save our raw data here. It’s best practice to keep the data here untouched.

    • data-output - if we need to modify raw data, store the modified version here.

    • fig-output - we will save all the graphics we created here!

Warning

Don’t put your R projects inside your OneDrive folder as that may cause issues sometimes.

Let’s Code!

Create a new R script - File > New File > R script.

Note: RStudio does not autosave your progress, so remember to save from time to time!

R Objects and Values

In this line of code:

country_name <- "Singapore"
  • "Singapore" is a value. This can be either a character, numeric, or boolean data type. (more on this soon)
  • country_name is the object where we store this value. This is so that we can keep this value to be used later.
  • <- is the assignment operator to assign the value to the object.
    • You can also use =, but generally in R, <- is the convention.
    • Keyboard shortcut: Alt + - in Windows (Option + - in Mac)

Refresher: Quantitative Data Types

  • Non-Continuous Data

    • Nominal/Categorical: Non-ordered, non-numerical data, used to represent qualitative attribute.

      • Example: nationality, neighborhood, employment status
    • Ordinal: Ordered non-numerical data.

      • Example: Nutri-grade ratings, frequency of exercise (daily, weekly, bi-weekly)
    • Discrete: Numerical data that can only take specific value (usually integers)

      • Example: Shoe size, clothing size
    • Binary: Nominal data with only two possible outcome

      • Example: pass/fail, yes/no, survive/not survive
  • Continuous Data

    • Interval: Numerical data that can take any value within a range. It does not have a “true zero”.

      • Example: Celsius scale. Temperature of 0 C does not represent absence of heat.
    • Ratio: Numerical data that can take any value within a range. it has a “true zero”.

      • Example: Annual income. annual income of 0 represents no income.

Data Types in R

The four basic data types are characters, numeric, boolean, and integer. Let’s look at examples using our WVS survey variables:

country_code <- "SGP" # Character
life_satisfaction <- 8.5 # Numeric (also sometimes called Double)
is_religious <- TRUE # Boolean/Logical (true/false)
birth_year <- 1990L # Integer (whole numbers)

Checking data type of a variable

You can use str or typeof to check the data type of an R object.

typeof(country_code)
[1] "character"
str(is_religious) 
 logi TRUE

Arithmetic operations in R

You can do arithmetic operations in R. For example, let’s calculate average satisfaction scores:

(8 + 7 + 9) / 3  # Average of three satisfaction scores
[1] 8
2025 - 1990  # Calculate age from birth year
[1] 35

Boolean operations in R

Boolean operations in R are useful for filtering survey data:

AND operations (both conditions must be TRUE)

# Check if someone is both highly satisfied (>8) AND from Singapore
(life_satisfaction > 8) & (country_code == "SGP")

OR operations (at least one condition must be TRUE)

# Check if someone is either married OR living together as married
marital_status == "Married" | marital_status == "Living together as married"

Functions in R

Functions take inputs (arguments/parameters), process them, and return a result. For example, calculating the mean satisfaction score:

satisfaction_scores <- c(7.5, 8.0, 6.5, 9.0)
round(mean(satisfaction_scores), digits = 1)
[1] 7.8

Saving the result to an object:

avg_satisfaction = round(mean(satisfaction_scores), digits = 1)
print(avg_satisfaction)
[1] 7.8

in the example above, round is the function. 123.456 and digits = 2 are the arguments/parameters.

How do I find out more about a particular function?

You can call the help page / vignette in R by prepending ? to the function name.

E.g. if you want to find out more about the round function, you can run ?round in your R console (bottom left panel)

Packages in R

  • Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.

    • (Closest analogy I can think of is that they’re equivalent of browser add-ons, in a way)
  • Popular packages: tidyverse, caret, shiny, etc.

  • Installation (you only need to do this once): install.packages("package name")

  • Loading packages (you need to run this everytime you restart RStudio): library(package name) - let’s try to load tidyverse!

Data Structures in R: Vectors

Vectors can store multiple values. Let’s create vectors using our survey data:

countries <- c("CAN", "NZL", "SGP", "CAN", "SGP")
satisfaction_scores <- c(8, 7, 9, 6, 8)
employment_status <- c("Full time", "Student", "Part time", "Retired", "Full time")

Vector Manipulations: Retrieve and update items

# retrieve the first country in the vector
countries[1]
[1] "CAN"
# retrieves the first three satisfaction scores
satisfaction_scores[1:3]
[1] 8 7 9
# update the first satisfaction score
satisfaction_scores[1] <- 7
print(satisfaction_scores)
[1] 7 7 9 6 8

Vector Manipulations: Retrieve items based on criteria

Let’s find high satisfaction scores (above 7):

# Create boolean vector for our condition
high_satisfaction <- satisfaction_scores > 7
print(high_satisfaction)
[1] FALSE FALSE  TRUE FALSE  TRUE
# Use the boolean vector to filter satisfaction scores
satisfaction_scores[high_satisfaction]
[1] 9 8

Shortened version:

satisfaction_scores[satisfaction_scores > 7]

Vector Manipulations: Handling NA values

  • NA values indicate null values, or the absence of a value (0 is still a value!)

  • Summary functions like mean needs you to specify in the arguments how you want it to be handled.

Survey data often contains missing values (NA):

financial_satisfaction <- c(8, 7, NA, 6, 9, NA, 7)

# By default, mean() will return NA if there are any NA values
mean(financial_satisfaction)
[1] NA
# Remove NA values before calculating mean
mean(financial_satisfaction, na.rm = TRUE)
[1] 7.4

Vector Manipulations: Adding items

Several ways to add items to a vector

1satisfaction_scores <- c(satisfaction_scores, 7)
2satisfaction_scores <- c(satisfaction_scores, 8, 9, 10)
3satisfaction_scores <- c(8, satisfaction_scores)
4satisfaction_scores <- append(satisfaction_scores, 9, after = 2) # <2>
1
Add a single score to the end of the vector using c()
2
Add multiple scores to the end
3
Add a score to the beginning
4
Insert a score at a specific position using append()

Vector Manipulations: Removing items

1satisfaction_scores <- satisfaction_scores[-c(2, 4)]
2satisfaction_scores <- satisfaction_scores[satisfaction_scores <= 7]
3satisfaction_scores <- na.omit(satisfaction_scores)
1
Remove elements by index using “negative indexing”
2
Remove elements based on a condition using logical indexing
3
Remove NA values from the vector

Data Structures in R: Factors

Factors are perfect for categorical survey variables:

Unordered (Nominal):

employment_factor <- factor(c("Full time", "Part time", "Student", "Retired", "Student"))
str(employment_factor)
 Factor w/ 4 levels "Full time","Part time",..: 1 2 4 3 4

Ordered (Ordinal):

importance_factor <- factor(
    c("Very important", "Important", "Not very important", "Not at all important"),
    ordered = TRUE,
    levels = c("Not at all important", "Not very important", "Important", "Very important")
)
str(importance_factor)
 Ord.factor w/ 4 levels "Not at all important"<..: 4 3 2 1

Data Structures in R: Dataframe

Create a small dataframe with survey responses:

survey_data <- data.frame(
    country = c("SGP", "CAN", "NZL", "SGP", "CAN"),
    life_satisfaction = c(8, 7, 9, 6, 8),
    employment = c("Full time", "Student", "Part time", "Retired", "Full time")
)
print(survey_data)
  country life_satisfaction employment
1     SGP                 8  Full time
2     CAN                 7    Student
3     NZL                 9  Part time
4     SGP                 6    Retired
5     CAN                 8  Full time

Downloading the World Values Survey (WVS) Dataset

For this workshop, we will try loading a dataset from a file.

Go to the course website and go to the ‘Dataset’ tab to download the data file and information about this WVS data

Download this CSV and save it under your data folder in your R project!

Loading the WVS Dataset

Let’s load our actual World Values Survey dataset:

library(tidyverse)

wvs_data <- read_csv("data/wvs-wave7-sg-ca-nz.csv") # Make sure to save the WVS data in your data folder
head(wvs_data)

Exploring the WVS Dataset

1dim(wvs_data)
2names(wvs_data)
3str(wvs_data)
4summary(wvs_data)
5head(wvs_data, n=5)
6tail(wvs_data, n=5)
1
return a vector of number of rows and columns
2
inspect columns
3
inspect structure
4
print the summary stats of the entire dataframe
5
view the first 5 rows
6
view the last 5 rows

Basic dataframe manipulations: Retrieving values

Some basic dataframe functions before we move on to data wrangling next week:

1wvs_data["country"]
2wvs_data$country
3wvs_data[3]
4wvs_data[1, 4]
5wvs_data[3, ]
1
retrieve column by name (returns as tibble/dataframe)
2
another way to retrieve column by name (returns as vector)
3
get an entire column by index
4
get a cell at this row, column coord
5
get an entire row

End of Session 1!

Next Session: Data wrangling with dplyr and tidyr packages - we’ll learn how to:

  • Filter survey responses by country

  • Calculate average satisfaction scores by demographic groups

  • Create new variables from existing ones

  • Handle missing values in survey data

  • And much more!