A video of this tutorial can be found here: video link.
Please try to download and install R and RStudio prior to the workshop sessions with the following steps:
You should also try to download the toy dataset that we will be using to demonstrate some concepts in these workshops: Heart_Disease_100sampled.csv.
(Note: this spreadsheet is a sub-sampled dataset from kaggle.)
In this workshop, we’ll introduce you to some R fundamentals:
We will do all of our work in RStudio. RStudio is an integrated development and analysis environment for R that brings a number of conveniences over using R in a terminal or other editing environments.
When you start RStudio, you will see something like the following window appear:
Notice that the window is divided into four “panes”:
Console (the left bottom): This is your view into the R engine. You can type in R commands here and see the output printed by R. (To make it easier to tell them apart, your input and the resulting output are printed in different colors.) There are several editing conveniences available: use up and down arrow keys to go back to previously entered commands, which can then be edited and re-run; TAB for completing the name before the cursor; see more in online docs.
Environment/History (tabbed in upper right): View current user-defined objects and previously-entered commands, respectively. The instructor may have additional tabs like Connection, Git, etc. These are not relevant to this workshop series.
Files/Plots/Packages/Help (tabbed in lower right): As their names suggest, these are used to view the contents of the current directory, graphics created by the user, install packages, and view the built-in help pages.
R script (left top): this is where you can save your code in a .R script file for future use. R script files are the primary way in which R facilitates reproducible research. R scripts maintain a record of everything that is done to the raw data to reach the final result. That way, it is very easy to write up and communicate your methods because you have a document listing the precise steps you used to conduct your analyses.
To change the look of RStudio, you can go to Tools -> Global Options -> Appearance and select colors, font size, etc. If you plan to be working for longer periods, we suggest choosing a dark background color scheme to save your computer battery and your eyes.
Generally, if you are testing an operation (e.g. what would my data look like if I applied a log-transformation to it?), you should do it in the console (left pane of RStudio). If you are committing a step to your analysis (e.g. I want to apply a log-transformation to my data and then conduct the rest of my analyses on the log-transformed data), you should add it to your R script so that it is saved for future use.
Important: You should annotate your R scripts with
comments. In each line of code, any text preceded by the #
symbol will not execute. Comments can be useful to remind yourself and
to tell other readers what a specific chunk of code does. In my
experience, there can never be too much commenting.
Let’s create an R script
(File > New File > R Script) and save it as
introR_live_notes.R in your main project directory. If you
again look to the project directory on your computer, you will see
introR_live_notes.R is now saved there. We will work
together to create and populate the introR_live_notes.R
script throughout this workshop.
With R open, you will see a prompt ‘>’ in the Console area (bottom left). Enter anything after this prompt, and R will try and evaluate it. R can evaluate 3 types of data:
Calculator operations work with numbers.
# We are also assigning the answer to a variable we are calling a
a <- 1 + 1
# calling on a will give back the answer
print(a)
## [1] 2
R calls conditionals “logicals”. We can use operations that test whether something is TRUE or FALSE, which will output conditionals.
3 < 5
## [1] TRUE
4 < 2
## [1] FALSE
#To test equality, you have to use two equals signs. You can use parentheses for operations
(3 + 3) == (2 + 4)
## [1] TRUE
# To test for non-equality
4 != 3
## [1] TRUE
Note: when you want to compare 2 conditionals, you can test if both are true with the AND operator “&”, and you can test if either one of the conditionals are true with the OR operator “|”
(3 < 5) & (4 < 2) #AND
## [1] FALSE
(3 < 5) | (4 < 2) #OR
## [1] TRUE
R can work with text in the form of strings. Note that R will register anything put between “” or ’’ as a string, even numbers. But the operations R can perform on strings differ from those on numbers
"Hello"
## [1] "Hello"
"Hello" == "Hello"
## [1] TRUE
"Hello" == "Hello "
## [1] FALSE
"Hello" == "hello"
## [1] FALSE
Lists are a grouping of multiple entries of data. Can comprise of any data type
numlist = c(1,3,4,9)
numlist + 2
## [1] 3 5 6 11
numlist > 3
## [1] FALSE FALSE TRUE TRUE
# a shortcut
numrange = 1:4
print(numrange)
## [1] 1 2 3 4
# a list of strings
wordlist = c('blue','red','white','green')
# can test if something is in a list:
'red' %in% wordlist
## [1] TRUE
You can tell R to extract/subset specific elements of a list using square brackets
numlist[2]
## [1] 3
numlist[c(2,4)]
## [1] 3 9
Importantly, you can subset based on conditionals, and embed conditional operations.
print(numlist)
## [1] 1 3 4 9
logiclist = c(TRUE,FALSE,TRUE,FALSE)
numlist[logiclist]
## [1] 1 4
numlist[numlist < 5]
## [1] 1 3 4
Many people have developed analysis tools for R that are not available in the base software. Instead, we can access these tools by downloading and installing packages.
Installations only need to happen once.
#install.packages("tidyverse")
But you need to load the package every time you open R to access the tools provided in the package:
library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will use the tidyverse function read_csv() to import
our spreadsheet into R as a tibble.
You will first need to navigate to the directory where you have
downloaded our toy dataset, which can be done with the Dropdown menu:
Session > Set Working Directory, or with the r command
setwd().
data = read_csv('Heart_Disease_100sampled.csv')
## Rows: 100 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): ptID, Gender, Exercise Habits, Smoking, Family Heart Disease, Diab...
## dbl (9): Age, Blood Pressure, Cholesterol Level, BMI, Sleep Hours, Triglyce...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The spreadsheet dimensions can be queried by dim,
nrow, and ncol
# tells number of rows (1st) and columns (2nd)
dim(data)
## [1] 100 22
# nrow tells how many rows
nrow(data)
## [1] 100
#ncol tells how many columns
ncol(data)
## [1] 22
R Studio: you can view a spreadsheet summary of the data with
View().
You can also see a summary of the data withstr()`
#View(data)
# viewing the top rows, columns
str(data)
## spc_tbl_ [100 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ptID : chr [1:100] "pt 7186" "pt 7572" "pt 2570" "pt 8253" ...
## $ Age : num [1:100] 51 63 77 35 29 73 67 56 38 29 ...
## $ Gender : chr [1:100] "Male" "Female" "Male" "Female" ...
## $ Blood Pressure : num [1:100] 139 142 140 172 165 180 129 132 122 130 ...
## $ Cholesterol Level : num [1:100] 166 175 284 162 251 262 231 187 183 237 ...
## $ Exercise Habits : chr [1:100] "Low" "Low" "High" "Medium" ...
## $ Smoking : chr [1:100] "Yes" "Yes" "Yes" "No" ...
## $ Family Heart Disease: chr [1:100] "No" "Yes" "Yes" "No" ...
## $ Diabetes : chr [1:100] "Yes" "Yes" "No" "Yes" ...
## $ BMI : num [1:100] 33.3 22.5 34.1 37.5 30.9 ...
## $ High Blood Pressure : chr [1:100] "No" "No" "Yes" "Yes" ...
## $ Low HDL Cholesterol : chr [1:100] "Yes" "No" "Yes" "Yes" ...
## $ High LDL Cholesterol: chr [1:100] "Yes" "Yes" "No" "Yes" ...
## $ Alcohol Consumption : chr [1:100] "None" "None" "Low" "Medium" ...
## $ Stress Level : chr [1:100] "Low" "High" "Low" "Medium" ...
## $ Sleep Hours : num [1:100] 7.6 8.25 4.6 5.82 5.43 ...
## $ Sugar Consumption : chr [1:100] "Medium" "Low" "High" "Medium" ...
## $ Triglyceride Level : num [1:100] 225 270 242 375 202 100 359 178 259 318 ...
## $ Fasting Blood Sugar : num [1:100] 89 111 156 100 118 110 109 151 124 97 ...
## $ CRP Level : num [1:100] 6.463 0.116 4.467 9.815 7.312 ...
## $ Homocysteine Level : num [1:100] 14.55 8.98 17.45 8.22 11.84 ...
## $ Heart Disease Status: chr [1:100] "No" "No" "No" "Yes" ...
## - attr(*, "spec")=
## .. cols(
## .. ptID = col_character(),
## .. Age = col_double(),
## .. Gender = col_character(),
## .. `Blood Pressure` = col_double(),
## .. `Cholesterol Level` = col_double(),
## .. `Exercise Habits` = col_character(),
## .. Smoking = col_character(),
## .. `Family Heart Disease` = col_character(),
## .. Diabetes = col_character(),
## .. BMI = col_double(),
## .. `High Blood Pressure` = col_character(),
## .. `Low HDL Cholesterol` = col_character(),
## .. `High LDL Cholesterol` = col_character(),
## .. `Alcohol Consumption` = col_character(),
## .. `Stress Level` = col_character(),
## .. `Sleep Hours` = col_double(),
## .. `Sugar Consumption` = col_character(),
## .. `Triglyceride Level` = col_double(),
## .. `Fasting Blood Sugar` = col_double(),
## .. `CRP Level` = col_double(),
## .. `Homocysteine Level` = col_double(),
## .. `Heart Disease Status` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
To select rows and columns together, use single brackets and two indices: [row, column].
# Return the value at the first row, third column
data[1,4]
## # A tibble: 1 × 1
## `Blood Pressure`
## <dbl>
## 1 139
# You can also use lists:
# First 3 rows, first 2 columns
data[1:3, 1:2]
## # A tibble: 3 × 2
## ptID Age
## <chr> <dbl>
## 1 pt 7186 51
## 2 pt 7572 63
## 3 pt 2570 77
# all of columns in row 3
data[3,]
## # A tibble: 1 × 22
## ptID Age Gender `Blood Pressure` `Cholesterol Level` `Exercise Habits`
## <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 pt 2570 77 Male 140 284 High
## # ℹ 16 more variables: Smoking <chr>, `Family Heart Disease` <chr>,
## # Diabetes <chr>, BMI <dbl>, `High Blood Pressure` <chr>,
## # `Low HDL Cholesterol` <chr>, `High LDL Cholesterol` <chr>,
## # `Alcohol Consumption` <chr>, `Stress Level` <chr>, `Sleep Hours` <dbl>,
## # `Sugar Consumption` <chr>, `Triglyceride Level` <dbl>,
## # `Fasting Blood Sugar` <dbl>, `CRP Level` <dbl>, `Homocysteine Level` <dbl>,
## # `Heart Disease Status` <chr>
# all of rows in column 1
data[,1]
## # A tibble: 100 × 1
## ptID
## <chr>
## 1 pt 7186
## 2 pt 7572
## 3 pt 2570
## 4 pt 8253
## 5 pt 5216
## 6 pt 3775
## 7 pt 7529
## 8 pt 8049
## 9 pt 1862
## 10 pt 8058
## # ℹ 90 more rows
You can also call on specific columns by name.
data$BMI
## [1] 33.34947 22.50586 34.11497 37.50158 30.91354 24.66731 30.19360 34.40913
## [9] 29.21565 35.12688 18.67460 32.21532 33.41252 35.67601 35.10529 22.65325
## [17] 32.45816 30.97571 30.58054 18.22979 19.48234 20.58284 22.24916 32.96815
## [25] 38.83295 27.15335 37.69030 28.74740 37.61561 18.75536 34.07053 22.70867
## [33] 21.29821 36.10898 28.03114 29.62960 27.24986 36.52917 37.01109 33.72941
## [41] 36.82030 19.21318 23.65275 33.78002 33.76812 30.06387 30.15157 27.33479
## [49] 21.49262 19.14408 24.00014 19.21833 21.54476 39.75723 34.37931 27.41723
## [57] 26.90195 20.88075 21.56784 25.14706 34.67351 29.97450 25.59749 22.95519
## [65] 29.77110 28.65349 27.71551 20.94324 20.54307 29.00538 37.69582 24.45473
## [73] 19.27043 32.05950 34.98625 38.29421 24.41339 35.16458 33.51402 28.19963
## [81] 28.40824 36.67524 34.28726 26.11600 35.16782 37.03448 27.61815 31.82638
## [89] 18.93055 34.75350 29.40367 24.87329 30.42090 25.54680 34.49717 31.46275
## [97] 19.53528 26.20360 31.10930 29.22324
data[,'BMI']
## # A tibble: 100 × 1
## BMI
## <dbl>
## 1 33.3
## 2 22.5
## 3 34.1
## 4 37.5
## 5 30.9
## 6 24.7
## 7 30.2
## 8 34.4
## 9 29.2
## 10 35.1
## # ℹ 90 more rows
Importantly, you can also subset based on conditionals
HeartDiseaseRows = data$`Heart Disease Status` == 'Yes'
print(HeartDiseaseRows)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [61] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [85] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
data[HeartDiseaseRows,'BMI']
## # A tibble: 24 × 1
## BMI
## <dbl>
## 1 37.5
## 2 34.4
## 3 35.1
## 4 35.7
## 5 31.0
## 6 20.6
## 7 33.0
## 8 21.3
## 9 36.8
## 10 33.8
## # ℹ 14 more rows
This kind of subsetting enables you to perform comparative analyses.
As an example, we will perform a t-test, comparing the distribution
of BMI in patients with and without heart disease. The R command for
performing a standard Student’s t-test is t.test():
t.test(data[data$`Heart Disease Status` == 'Yes','BMI'],data[data$`Heart Disease Status` == 'No','BMI'])
##
## Welch Two Sample t-test
##
## data: data[data$`Heart Disease Status` == "Yes", "BMI"] and data[data$`Heart Disease Status` == "No", "BMI"]
## t = 1.2262, df = 40.821, p-value = 0.2271
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.071546 4.383034
## sample estimates:
## mean of x mean of y
## 30.25474 28.59900
To get more information on a function, use the ? operator. This brings up a help page describing how the function works. These pages can be pretty dense and technical, even for longtime R users, so don’t be discouraged if it’s hard to understand
?t.test()
To see an example of how a function is used, try the ‘example’ function. Not every function has examples though!
example(t.test)
##
## t.test> ## Two-sample t-test
## t.test> t.test(1:10, y = c(7:20)) # P = .00001855
##
## Welch Two Sample t-test
##
## data: 1:10 and c(7:20)
## t = -5.4349, df = 21.982, p-value = 1.855e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.052802 -4.947198
## sample estimates:
## mean of x mean of y
## 5.5 13.5
##
##
## t.test> t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore
##
## Welch Two Sample t-test
##
## data: 1:10 and c(7:20, 200)
## t = -1.6329, df = 14.165, p-value = 0.1245
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -47.242900 6.376233
## sample estimates:
## mean of x mean of y
## 5.50000 25.93333
##
##
## t.test> ## Traditional interface
## t.test> with(mtcars, t.test(mpg[am == 0], mpg[am == 1]))
##
## Welch Two Sample t-test
##
## data: mpg[am == 0] and mpg[am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
##
##
## t.test> ## Formula interface
## t.test> t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
##
##
## t.test> ## One-sample t-test
## t.test> ## Traditional interface
## t.test> t.test(sleep$extra)
##
## One Sample t-test
##
## data: sleep$extra
## t = 3.413, df = 19, p-value = 0.002918
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.5955845 2.4844155
## sample estimates:
## mean of x
## 1.54
##
##
## t.test> ## Formula interface
## t.test> t.test(extra ~ 1, data = sleep)
##
## One Sample t-test
##
## data: extra
## t = 3.413, df = 19, p-value = 0.002918
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.5955845 2.4844155
## sample estimates:
## mean of x
## 1.54
##
##
## t.test> ## Paired t-test
## t.test> ## The sleep data is actually paired, so could have been in wide format:
## t.test> sleep2 <- reshape(sleep, direction = "wide",
## t.test+ idvar = "ID", timevar = "group")
##
## t.test> ## Traditional interface
## t.test> t.test(sleep2$extra.1, sleep2$extra.2, paired = TRUE)
##
## Paired t-test
##
## data: sleep2$extra.1 and sleep2$extra.2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean difference
## -1.58
##
##
## t.test> ## Formula interface
## t.test> t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)
##
## Paired t-test
##
## data: Pair(extra.1, extra.2)
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean difference
## -1.58
Note: Searching with a specific question will also return many useful tutorials online.
In this example, we’re going to apply linear regression analysis to test the statistical association of BMI with different other patient factors.
Clinical data are often analyzed by linear regression analyses. In R,
the command to perform an ordinary least squares linear model is
lm(). The summary() function reports several
result outputs from the linear model.
IndepVar = data$BMI
FactorGender = as.factor(data$Gender)
FactorSmoking = as.factor(data$Smoking)
FactorExercise = as.factor(data$`Exercise Habits`)
BMImodel = lm(IndepVar ~ FactorGender + FactorSmoking + FactorExercise)
summary(BMImodel)
##
## Call:
## lm(formula = IndepVar ~ FactorGender + FactorSmoking + FactorExercise)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5329 -5.4201 0.4577 5.0570 10.1465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.0192 1.3219 21.953 <2e-16 ***
## FactorGenderMale -1.0615 1.2075 -0.879 0.382
## FactorSmokingYes 1.7193 1.2379 1.389 0.168
## FactorExerciseLow -0.9906 1.4822 -0.668 0.506
## FactorExerciseMedium -0.4433 1.4861 -0.298 0.766
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.002 on 95 degrees of freedom
## Multiple R-squared: 0.03109, Adjusted R-squared: -0.009709
## F-statistic: 0.762 on 4 and 95 DF, p-value: 0.5526
To see more functionality associated with the lm
function, we can see the example report:
example(lm)
##
## lm> require(graphics)
##
## lm> ## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## lm> ## Page 9: Plant Weight Data.
## lm> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
##
## lm> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
##
## lm> group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
##
## lm> weight <- c(ctl, trt)
##
## lm> lm.D9 <- lm(weight ~ group)
##
## lm> lm.D90 <- lm(weight ~ group - 1) # omitting intercept
##
## lm> ## No test:
## lm> ##D anova(lm.D9)
## lm> ##D summary(lm.D90)
## lm> ## End(No test)
## lm> opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
##
## lm> plot(lm.D9, las = 1) # Residuals, Fitted, ...
##
## lm> par(opar)
##
## lm> ## Don't show:
## lm> ## model frame :
## lm> stopifnot(identical(lm(weight ~ group, method = "model.frame"),
## lm+ model.frame(lm.D9)))
##
## lm> ## End(Don't show)
## lm> ### less simple examples in "See Also" above
## lm>
## lm>
## lm>