In this session, we will introduce you to data manipulation and
visualization using tidyverse, a collection of R packages
for data science. We will cover the following topics:
If you would like to see an overview of commands we are discussing in this workshop, this cheatsheet provides them in a concise view.
tidyverse is not loaded on R by default. We will load
tidyverse first. If the package has not been installed yet,
then we will need to use the install.packages() function to
install first.
library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.6
## âś” forcats 1.0.1 âś” stringr 1.6.0
## âś” ggplot2 3.5.2 âś” tibble 3.3.0
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will use the toy heart disease dataset to demonstrate the use of tidyverse for data manipulation and visualization in this session. Here is the link to download: Heart_Disease_100sampled.csv
(Note: this spreadsheet is a sub-sampled dataset from kaggle.)
To make things a bit easier, we can set our working directory to where the dataset and the R script are. To do so, navigate to Session -> Set Working Directory -> Choose Directory, and then select its location on your computer.
data = read_csv("Heart_Disease_100sampled.csv")
## Rows: 100 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): ptID, Gender, Exercise Habits, Smoking, Family Heart Disease, Diab...
## dbl (9): Age, Blood Pressure, Cholesterol Level, BMI, Sleep Hours, Triglyce...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s first check the dimension of the data.
dim(data)
## [1] 100 22
We can extract and display the full column specification for the data.
spec(data)
## cols(
## ptID = col_character(),
## Age = col_double(),
## Gender = col_character(),
## `Blood Pressure` = col_double(),
## `Cholesterol Level` = col_double(),
## `Exercise Habits` = col_character(),
## Smoking = col_character(),
## `Family Heart Disease` = col_character(),
## Diabetes = col_character(),
## BMI = col_double(),
## `High Blood Pressure` = col_character(),
## `Low HDL Cholesterol` = col_character(),
## `High LDL Cholesterol` = col_character(),
## `Alcohol Consumption` = col_character(),
## `Stress Level` = col_character(),
## `Sleep Hours` = col_double(),
## `Sugar Consumption` = col_character(),
## `Triglyceride Level` = col_double(),
## `Fasting Blood Sugar` = col_double(),
## `CRP Level` = col_double(),
## `Homocysteine Level` = col_double(),
## `Heart Disease Status` = col_character()
## )
This will tell you the data types (e.g., character, double, integer, logical) for each column in the data frame.
If you’d like to have a better idea about the structure or content of the data:
View(data)
This will open a new tab to display a preview of the data (maximum 1,000 rows and 50 columns) in a table format, where you can browse, sort, or search within the data, but you cannot edit the data (read-only) through this interface.
We will work on data frame columns first. To pick specific columns
from a data frame, you can use the select() function in
dplyr package and refer to columns by name, position, or
specific criteria.
To select a single column by name:
data_ptID = select(data, ptID)
To select multiple columns by name:
data_multipleCols = select(data, ptID, BMI, `CRP Level`)
Column names can be changed as well (e.g., you can remove spaces or
use a more descriptive name). You can use the rename()
function in dplyr package to do it.
data_renameAge = rename(data, `Age (Year)` = Age)
This will add the unit “Year” for the “Age” column.
You can also create new columns or modify existing columns in a data
frame using the mutate() function in dplyr
package. For instance, if you want to add a new column to convert age in
year to age in month:
data_addCol = mutate(data_renameAge, `Age (Month)` = `Age (Year)` * 12)
This will add a new column named “Age (Month)” to the end of the original data frame.
If you’d like to move this new column right after the “Age (Year)”
column, you can use the relocate() function in
dplyr package to do it.
data_addCol_rearrange = relocate(data_addCol, `Age (Month)`, .after = `Age (Year)`)
Let’s work on subsetting data frame rows using functions from the dplyr package. Similar to column manipulations, you can select specific rows from a data frame by position, value of a variable, or specific criteria.
For example, if you want to extract all the measurements for patients
who smoke, you can use the filter() function to keep rows
in the data frame that match this criteria:
data_smoke = filter(data, Smoking == "Yes")
You can also add more criteria to the selection using
& (and) or | (or) operators. For instance,
if you want to extract data from patients who not only smoke but also do
not exercise regularly, simply add that condition into the argument like
this:
data_smoke_noExcercise = filter(data, Smoking == "Yes" & `Exercise Habits` == "Low")
You can also subset rows by specifying their row indexes in a data
frame using the slice() function in dplyr
package.
data_subsetRows = slice(data, 20:50)
This will give you all the rows between and including row 20 and row 50 in the example dataset.
You can also reorder the rows of the entire data frame based on
values of one or more columns using the arrange() function
in dplyr package. The rows are sorted in ascending order by
default, but you can use the desc() helper function to sort
in descending order.
data_asc = arrange(data, Age) # ascending order
data_desc = arrange(data, desc(Age)) # descending order
To calculate some summary statistics for the data, you can use the
summarise() function in dplyr package, which
reduces multiple values from a data frame into a single or multiple
summary statistics.
summarise(data, mean(BMI), median(BMI)) # mean and median
## # A tibble: 1 Ă— 2
## `mean(BMI)` `median(BMI)`
## <dbl> <dbl>
## 1 29.0 29.5
summarise(data, sd(BMI), var(BMI)) # standard deviation and variance
## # A tibble: 1 Ă— 2
## `sd(BMI)` `var(BMI)`
## <dbl> <dbl>
## 1 5.97 35.7
summarise(data, min(BMI), max(BMI)) # minimum and maximum
## # A tibble: 1 Ă— 2
## `min(BMI)` `max(BMI)`
## <dbl> <dbl>
## 1 18.2 39.8
If you want to learn more about the dplyr package,
please refer to this “Introduction to dplyr” guide, which provides a
comprehensive and detailed description of the package (including
examples).
vignette("dplyr")
## starting httpd help server ... done
To select all columns between two specific columns in a data frame, you can specify the names of the first and the last columns that you’d like to include.
data_colsRange = select(data, ptID:BMI)
This will give you all the columns in the example dataset between (and including) columns “ptID” and “BMI”.
You can also use the select() function to exclude one
particular column in a data frame. For instance, if you want to retrieve
all the measurements in the example dataset except for sleep hours, you
can simply specify its column name with the !:
data_excludeCol = select(data, !`Sleep Hours`)
If you want to get the first or the last number of rows from a data
frame, use the slice_head() or slice_tail()
function, respectively, and specify the number of rows you’d like to
subset.
data_headRows = slice_head(data, n=5)
data_tailRows = slice_tail(data, n=5)
You can also randomly select rows from a data frame. This can be done
with the slice_sample() function in dplyr
package. For instance, if you want to get data from 10 patients that are
randomly sampled from the 100 patients in the example dataset:
data_randomRows = slice_sample(data, n = 10)
Row selections can also be done based on the values of a particular
variable in a data frame. For example, you can use the
slice_min() function to extract data from the five patients
with the lowest BMI:
data_lowBMI = slice_min(data, order_by = BMI, n=5)
Or if you want to get data from the five oldest patients, use the
slice_max() function:
data_oldPt = slice_max(data, order_by = Age, n=5)
Oftentimes your code contains multiple nested parenthesis, which
makes it hard to read and understand. The pipe operator
%>% allows you to transform complex nested operations
into a linear sequence of operations, significantly increase the
readability of the code.
Another potential advantage is that if you use a pipe to perform a series of operations, then you also won’t need to store intermediate outputs as separate variables. This might make it easier to keep track of which output variables you actually care about.
For example, to select a single column from the data frame using a pipe:
data %>%
select(ptID)
## # A tibble: 100 Ă— 1
## ptID
## <chr>
## 1 pt 7186
## 2 pt 7572
## 3 pt 2570
## 4 pt 8253
## 5 pt 5216
## 6 pt 3775
## 7 pt 7529
## 8 pt 8049
## 9 pt 1862
## 10 pt 8058
## # ℹ 90 more rows
We will write the code in piped format in the rest of this workshop.
We will make some simple plots from the toy dataset using
ggplot2, a widely used R package for data visualization.
ggplot2 allows you to build complex plots layer by layer
through the + operator. You will first use
ggplot() to define the plot and aesthetic mappings (e.g.,
axes, color, size). You will then add layers to define the geometry
(e.g., points, lines, bars) of the plot. You can also add layers to
create subplots, add title and axis labels, and save the plot.
We will start with scatter plots. We will make a basic scatter plot of “Fasting Blood Sugar” against “CRP Level” using the default settings.
data %>%
ggplot(aes(x = `Fasting Blood Sugar`, y = `CRP Level`)) +
geom_point()
You can color code the data points by gender by adding a
color argument. You can also modify the appearance of the
plot. For instance, we can change it to a white background with gray
grid lines using the theme_bw() function. To adjust axis
labels or to add a title to the plot, you can use the
labs() and theme() functions.
data %>%
ggplot(aes(x = `Fasting Blood Sugar`, y = `CRP Level`, color = Gender)) +
geom_point() +
theme_bw() +
labs(x = "Fasting Blood Sugar [mg/dL]", y = "CRP Level [mg/L]", title = "Fasting blood sugar VS CRP level") +
theme(
axis.title.x = element_text(size = 14, color = "red"),
axis.title.y = element_text(size = 14, color = "blue"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 18, color = "black")
)
To help visualize patterns and trends in the data, you can use the
geom_smooth() function to fit a model to the data and
overlay a trend line to the plot. It also displays shaded 95% confidence
interval by default (you can hide or adjust it in the arguments). You
can specify the function to use for fitting the model (smoothing
function) via the method argument.
For example, if you want to add linear regression lines to the plot
above, you can add a layer of smoothing function
geom_smooth() and put lm in the “method”
argument.
data %>%
ggplot(aes(x = `Fasting Blood Sugar`, y = `Homocysteine Level`, color = Gender)) +
geom_point() +
theme_bw() +
labs(x = "Fasting Blood Sugar [mg/dL]", y = "CRP Level [mg/L]", title = "Fasting blood sugar VS CRP level") +
theme(
axis.title.x = element_text(size = 14, color = "red"),
axis.title.y = element_text(size = 14, color = "blue"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 18, color = "black")
) +
geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
Since we group the data by gender here, the fitting is applied to female and male patient data separately.
We will make some box-and-whisker plots to visualize the distribution
of data between groups. We will start with a basic box plot on
triglyceride level for patients with or without heart disease. We will
use geom_boxplot() to define the box plot. If you want the
y axis to show the full data range and hide the outliers, specify in the
argument with outlier.shape = NA. You can use the
geom_jitter() function to display individual data points,
which also adds a small amount of random noise to the position
(horizontal in this case) of each point for easy visualization.
data %>%
ggplot(aes(x = `Heart Disease Status`, y = `Triglyceride Level`)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter() +
theme_bw()
By default, geom_boxplot() does not include error bars
(or whisker end-bars) in the plot. You can call the
stat_boxplot() function to add horizontal lines at the end
of whiskers.
data %>%
ggplot(aes(x = `Heart Disease Status`, y = `Triglyceride Level`)) +
geom_boxplot(outlier.shape = NA) +
stat_boxplot(geom = "errorbar", width = 0.2) +
geom_jitter() +
theme_bw()
You can also color code the boxes, bars, and individual data points by
groups. To do so, simply add an argument
color in the
aesthetics function aes().
data %>%
ggplot(aes(x = `Heart Disease Status`, y = `Triglyceride Level`, color = `Heart Disease Status`)) +
geom_boxplot(outlier.shape = NA) +
stat_boxplot(geom = "errorbar", width = 0.2) +
geom_jitter() +
theme_bw()