About me:
Librarian, Research & Data Services team, SMU Libraries.
Bachelor of IT, MSc in Info Studies.
Have been with SMU since the pandemic era (2021).
About this workshop:
Live-coding format; code along with me!
Goal of workshop: to give you enough fundamentals (at least to the point that ChatGPT can’t bluff you so easily) and confidence to explore R on your own.
Don’t be afraid to ask for help! We are all here to learn
R: The programming language and the software that interprets the R script
RStudio: An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.
You will need to install both for this workshop. Go to https://posit.co/download/rstudio-desktop to download and install both if you have not done so.
R is free, open-source, and cross-platform.
R does not involve lots of pointing and clicking - you don’t have to remember a complicated sequence of clicks to re-run your analysis.
R code is great for reproducibility - when someone else (including your future self) can obtain the same results from the same dataset and same analysis.
R is interdisciplinary and extensible
R is scalable and works on data of all shapes and sizes (though admittedly, it is not best at some scenarios and other languages such as python would be preferred.)
R produces high-quality and publication-ready graphics
R has a large and welcoming community - which means there are lots of help available!
Working directory -> where R will look for files (scripts, data, etc).
By default, it will be on your Desktop
Best practice is to use R Project to organize your files and data into projects.
When using R Project, the working directory = project folder.
Go to File
> New project
. Choose New directory
, then New project
Enter intro-r-socsci
as the name for this new folder (or “directory”) and choose where you want to put this folder, e.g. Desktop
or Documents
if you are on Windows. This will be your working directory for the rest of the workshop!
Next, let’s create 3 folders inside our working directory:
data
- we will save our raw data here. It’s best practice to keep the data here untouched.
data-output
- if we need to modify raw data, store the modified version here.
fig-output
- we will save all the graphics we created here!
Warning
Don’t put your R projects inside your OneDrive folder as that may cause issues sometimes.
Create a new R script - File
> New File
> R script
.
Note: RStudio does not autosave your progress, so remember to save from time to time!
In this line of code:
"Anya Forger"
is a value. This can be either a character, numeric, or boolean data type. (more on this soon)
name
is the object where we store this value. This is so that we can keep this value to be used later.
<-
is the assignment operator to assign the value to the object.
You can also use =
, but generally in R, <-
is the convention.
Keyboard shortcut: Alt
+ -
in Windows (Option
+ -
in Mac)
Non-Continuous Data
Nominal/Categorical: Non-ordered, non-numerical data, used to represent qualitative attribute.
Ordinal: Ordered non-numerical data.
Discrete: Numerical data that can only take specific value (usually integers)
Binary: Nominal data with only two possible outcome
Continuous Data
Interval: Numerical data that can take any value within a range. It does not have a “true zero”.
Ratio: Numerical data that can take any value within a range. it has a “true zero”.
The four basic data types are characters, numeric, boolean, and integer.
You can use str
or typeof
to check the data type of an R object.
You can do arithmetic operations in R, like so:
Boolean operations in R (will be handy for later):
AND operations (all sides needs to be TRUE for the result to be TRUE)
OR operations (only one side needs to be TRUE for the result to be TRUE)
NOT operations, which is basically flipping TRUE to FALSE and vice versa
Functions is a block of reusable code designed to do specific task. Function take inputs (a.k.a arguments or parameters), do their thing, and then return a result. (this result can either be printed out, or saved into an object!)
Saving the result to an object:
in the example above, round
is the function. 123.456
and digits = 2
are the arguments/parameters.
You can call the help page / vignette in R by prepending ?
to the function name.
E.g. if you want to find out more about the round
function, you can run ?round
in your R console (bottom left panel)
Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.
Popular packages: tidyverse
, caret
, shiny
, etc.
Installation (you only need to do this once): install.packages("package name")
Loading packages (you need to run this everytime you restart RStudio): library(package name)
- let’s try to load tidyverse
!
Basic objects in R can only contain one value. But quite often you may want to group a bunch of values together and save it in a single object.
A vector is a data structure that can do this. It is the most common and basic data structure in R. (pretty much the workhorse of R!)
chr [1:5] "IDIS110" "IDIS100" "PLE100" "PSYC111" "PSYC103"
[1] "IDIS110" "IDIS100" "PLE100" "PSYC111" "PSYC103"
Example of numeric vector:
Let’s say we want to retrieve items that are larger than 75.
The code below will create a boolean vector called criteria
that basically keep tracks on whether each items inside t1_grades
fulfil our condition.
The condition is “value must be > 75”. e.g. if item 1 fulfils our condition, then item 1 is ‘marked’ as TRUE
. Otherwise, it will be FALSE
criteria
to t1_grades
, and only retrieve items that fulfils the condition. i.e. items whose position is marked as TRUE
by criteria
vectorNA values indicate null values, or the absence of a value (0 is still a value!)
Summary functions like mean
needs you to specify in the arguments how you want it to be handled.
Several ways to add items to a vector
Special data structure in R to deal with categorical data.
Can be ordered (ordinal) or unordered (nominal).
May look like a normal vector at first glance, so use str()
to check.
Unordered (Nominal):
Factor w/ 5 levels "CIS","SCIS","SOA",..: 3 4 2 1 5
Ordered (Ordinal):
De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
Similar to spreadsheets!
You can create it by hand like so:
Alternatively, here is how to create one using the two vectors that we created earlier:
course_code grade
1 IDIS110 65
2 IDIS100 70
3 PLE100 80
4 PSYC111 95
5 PSYC103 77
Most of the time, our dataframe will be generated by loading from external data file such as CSV, SAV, or XLSX file.
Let’s try loading one from a CSV!
What is a CSV?
A CSV (Comma-Separated Values) file is a type of file that stores data in a plain text format. Each line in the file represents a row of data, and within each row, individual pieces of data (like numbers or words) are separated by commas. This format is commonly used for storing and transferring data, especially in spreadsheets and databases.
You can open CSV files in Excel, Google Sheets, or event Notepad!
Download and save chile_voting.csv
from this URL
Save the CSV file into your data
folder.
Check out the data dictionary/explanatory notes to learn more about the data, including the column names, data type inside each columns here.
We need to use readr
package, which is part of tidyverse
package. So please install tidyverse
first if you have not done so.
Load the CSV and save the content into a tibble/dataframe called chile_data
tidyverse
library (make sure to have it installed first!)
read_csv
function, and save it into chile_data
dataframe.
1dim(chile_data)
2names(chile_data)
3str(chile_data)
4summary(chile_data)
5head(chile_data, n=5)
6tail(chile_data, n=5)
Some basic dataframe functions before we move on to data wrangling next week:
Next Session: Data wrangling with dplyr
and tidyr
packages