Overview

In this session, we will introduce you to data manipulation and visualization using tidyverse, a collection of R packages for data science. We will cover the following topics:

  • retrieving data frame specifications
  • chaining sequences of operations (pipes)
  • manipulating data frames (select, add, rearrange columns/rows)
  • calculating statistics for the data
  • data visualization (scatter plot, box plot)

If you would like to see an overview of commands we are discussing in this workshop, this cheatsheet provides them in a concise view.

Data specifications

tidyverse is not loaded on R by default. We will load tidyverse first. If the package has not been installed yet, then we will need to use the install.packages() function to install first.

library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.6
## âś” forcats   1.0.1     âś” stringr   1.6.0
## âś” ggplot2   3.5.2     âś” tibble    3.3.0
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We will use the toy heart disease dataset to demonstrate the use of tidyverse for data manipulation and visualization in this session. Here is the link to download: Heart_Disease_100sampled.csv

(Note: this spreadsheet is a sub-sampled dataset from kaggle.)

To make things a bit easier, we can set our working directory to where the dataset and the R script are. To do so, navigate to Session -> Set Working Directory -> Choose Directory, and then select its location on your computer.

data = read_csv("Heart_Disease_100sampled.csv")
## Rows: 100 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): ptID, Gender, Exercise Habits, Smoking, Family Heart Disease, Diab...
## dbl  (9): Age, Blood Pressure, Cholesterol Level, BMI, Sleep Hours, Triglyce...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s first check the dimension of the data.

dim(data)
## [1] 100  22

We can extract and display the full column specification for the data.

spec(data)
## cols(
##   ptID = col_character(),
##   Age = col_double(),
##   Gender = col_character(),
##   `Blood Pressure` = col_double(),
##   `Cholesterol Level` = col_double(),
##   `Exercise Habits` = col_character(),
##   Smoking = col_character(),
##   `Family Heart Disease` = col_character(),
##   Diabetes = col_character(),
##   BMI = col_double(),
##   `High Blood Pressure` = col_character(),
##   `Low HDL Cholesterol` = col_character(),
##   `High LDL Cholesterol` = col_character(),
##   `Alcohol Consumption` = col_character(),
##   `Stress Level` = col_character(),
##   `Sleep Hours` = col_double(),
##   `Sugar Consumption` = col_character(),
##   `Triglyceride Level` = col_double(),
##   `Fasting Blood Sugar` = col_double(),
##   `CRP Level` = col_double(),
##   `Homocysteine Level` = col_double(),
##   `Heart Disease Status` = col_character()
## )

This will tell you the data types (e.g., character, double, integer, logical) for each column in the data frame.

If you’d like to have a better idea about the structure or content of the data:

View(data)

This will open a new tab to display a preview of the data (maximum 1,000 rows and 50 columns) in a table format, where you can browse, sort, or search within the data, but you cannot edit the data (read-only) through this interface.

Columns

We will work on data frame columns first. To pick specific columns from a data frame, you can use the select() function in dplyr package and refer to columns by name, position, or specific criteria.

To select a single column by name:

data_ptID = select(data, ptID)

To select multiple columns by name:

data_multipleCols = select(data, ptID, BMI, `CRP Level`)

Column names can be changed as well (e.g., you can remove spaces or use a more descriptive name). You can use the rename() function in dplyr package to do it.

data_renameAge = rename(data, `Age (Year)` = Age)

This will add the unit “Year” for the “Age” column.

You can also create new columns or modify existing columns in a data frame using the mutate() function in dplyr package. For instance, if you want to add a new column to convert age in year to age in month:

data_addCol = mutate(data_renameAge, `Age (Month)` = `Age (Year)` * 12)

This will add a new column named “Age (Month)” to the end of the original data frame.

If you’d like to move this new column right after the “Age (Year)” column, you can use the relocate() function in dplyr package to do it.

data_addCol_rearrange = relocate(data_addCol, `Age (Month)`, .after = `Age (Year)`)

Rows

Let’s work on subsetting data frame rows using functions from the dplyr package. Similar to column manipulations, you can select specific rows from a data frame by position, value of a variable, or specific criteria.

For example, if you want to extract all the measurements for patients who smoke, you can use the filter() function to keep rows in the data frame that match this criteria:

data_smoke = filter(data, Smoking == "Yes")

You can also add more criteria to the selection using & (and) or | (or) operators. For instance, if you want to extract data from patients who not only smoke but also do not exercise regularly, simply add that condition into the argument like this:

data_smoke_noExcercise = filter(data, Smoking == "Yes" & `Exercise Habits` == "Low")

You can also subset rows by specifying their row indexes in a data frame using the slice() function in dplyr package.

data_subsetRows = slice(data, 20:50)

This will give you all the rows between and including row 20 and row 50 in the example dataset.

You can also reorder the rows of the entire data frame based on values of one or more columns using the arrange() function in dplyr package. The rows are sorted in ascending order by default, but you can use the desc() helper function to sort in descending order.

data_asc = arrange(data, Age) # ascending order
data_desc = arrange(data, desc(Age)) # descending order

Statistics

To calculate some summary statistics for the data, you can use the summarise() function in dplyr package, which reduces multiple values from a data frame into a single or multiple summary statistics.

summarise(data, mean(BMI), median(BMI)) # mean and median
## # A tibble: 1 Ă— 2
##   `mean(BMI)` `median(BMI)`
##         <dbl>         <dbl>
## 1        29.0          29.5
summarise(data, sd(BMI), var(BMI)) # standard deviation and variance
## # A tibble: 1 Ă— 2
##   `sd(BMI)` `var(BMI)`
##       <dbl>      <dbl>
## 1      5.97       35.7
summarise(data, min(BMI), max(BMI)) # minimum and maximum
## # A tibble: 1 Ă— 2
##   `min(BMI)` `max(BMI)`
##        <dbl>      <dbl>
## 1       18.2       39.8

Getting help

If you want to learn more about the dplyr package, please refer to this “Introduction to dplyr” guide, which provides a comprehensive and detailed description of the package (including examples).

vignette("dplyr")
## starting httpd help server ... done

Bonus

To select all columns between two specific columns in a data frame, you can specify the names of the first and the last columns that you’d like to include.

data_colsRange = select(data, ptID:BMI)

This will give you all the columns in the example dataset between (and including) columns “ptID” and “BMI”.

You can also use the select() function to exclude one particular column in a data frame. For instance, if you want to retrieve all the measurements in the example dataset except for sleep hours, you can simply specify its column name with the !:

data_excludeCol = select(data, !`Sleep Hours`)

If you want to get the first or the last number of rows from a data frame, use the slice_head() or slice_tail() function, respectively, and specify the number of rows you’d like to subset.

data_headRows = slice_head(data, n=5)
data_tailRows = slice_tail(data, n=5)

You can also randomly select rows from a data frame. This can be done with the slice_sample() function in dplyr package. For instance, if you want to get data from 10 patients that are randomly sampled from the 100 patients in the example dataset:

data_randomRows = slice_sample(data, n = 10)

Row selections can also be done based on the values of a particular variable in a data frame. For example, you can use the slice_min() function to extract data from the five patients with the lowest BMI:

data_lowBMI = slice_min(data, order_by = BMI, n=5)

Or if you want to get data from the five oldest patients, use the slice_max() function:

data_oldPt = slice_max(data, order_by = Age, n=5)

Pipes

Oftentimes your code contains multiple nested parenthesis, which makes it hard to read and understand. The pipe operator %>% allows you to transform complex nested operations into a linear sequence of operations, significantly increase the readability of the code.

Another potential advantage is that if you use a pipe to perform a series of operations, then you also won’t need to store intermediate outputs as separate variables. This might make it easier to keep track of which output variables you actually care about.

For example, to select a single column from the data frame using a pipe:

data %>%
  select(ptID)
## # A tibble: 100 Ă— 1
##    ptID   
##    <chr>  
##  1 pt 7186
##  2 pt 7572
##  3 pt 2570
##  4 pt 8253
##  5 pt 5216
##  6 pt 3775
##  7 pt 7529
##  8 pt 8049
##  9 pt 1862
## 10 pt 8058
## # ℹ 90 more rows

We will write the code in piped format in the rest of this workshop.

Data visualization

We will make some simple plots from the toy dataset using ggplot2, a widely used R package for data visualization. ggplot2 allows you to build complex plots layer by layer through the + operator. You will first use ggplot() to define the plot and aesthetic mappings (e.g., axes, color, size). You will then add layers to define the geometry (e.g., points, lines, bars) of the plot. You can also add layers to create subplots, add title and axis labels, and save the plot.

scatter plot

We will start with scatter plots. We will make a basic scatter plot of “Fasting Blood Sugar” against “CRP Level” using the default settings.

data %>%
  ggplot(aes(x = `Fasting Blood Sugar`, y = `CRP Level`)) +
  geom_point()

You can color code the data points by gender by adding a color argument. You can also modify the appearance of the plot. For instance, we can change it to a white background with gray grid lines using the theme_bw() function. To adjust axis labels or to add a title to the plot, you can use the labs() and theme() functions.

data %>%
  ggplot(aes(x = `Fasting Blood Sugar`, y = `CRP Level`, color = Gender)) +
  geom_point() +
  theme_bw() +
  labs(x = "Fasting Blood Sugar [mg/dL]", y = "CRP Level [mg/L]", title = "Fasting blood sugar VS CRP level") +
  theme(
    axis.title.x = element_text(size = 14, color = "red"),
    axis.title.y = element_text(size = 14, color = "blue"),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18, color = "black")
    )

To help visualize patterns and trends in the data, you can use the geom_smooth() function to fit a model to the data and overlay a trend line to the plot. It also displays shaded 95% confidence interval by default (you can hide or adjust it in the arguments). You can specify the function to use for fitting the model (smoothing function) via the method argument.

For example, if you want to add linear regression lines to the plot above, you can add a layer of smoothing function geom_smooth() and put lm in the “method” argument.

data %>%
  ggplot(aes(x = `Fasting Blood Sugar`, y = `Homocysteine Level`, color = Gender)) +
  geom_point() +
  theme_bw() +
  labs(x = "Fasting Blood Sugar [mg/dL]", y = "CRP Level [mg/L]", title = "Fasting blood sugar VS CRP level") +
  theme(
    axis.title.x = element_text(size = 14, color = "red"),
    axis.title.y = element_text(size = 14, color = "blue"),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18, color = "black")
    ) +
  geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

Since we group the data by gender here, the fitting is applied to female and male patient data separately.

box-and-whisker plot

We will make some box-and-whisker plots to visualize the distribution of data between groups. We will start with a basic box plot on triglyceride level for patients with or without heart disease. We will use geom_boxplot() to define the box plot. If you want the y axis to show the full data range and hide the outliers, specify in the argument with outlier.shape = NA. You can use the geom_jitter() function to display individual data points, which also adds a small amount of random noise to the position (horizontal in this case) of each point for easy visualization.

data %>%
  ggplot(aes(x = `Heart Disease Status`, y = `Triglyceride Level`)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter() +
  theme_bw()

By default, geom_boxplot() does not include error bars (or whisker end-bars) in the plot. You can call the stat_boxplot() function to add horizontal lines at the end of whiskers.

data %>%
  ggplot(aes(x = `Heart Disease Status`, y = `Triglyceride Level`)) +
  geom_boxplot(outlier.shape = NA) +
  stat_boxplot(geom = "errorbar", width = 0.2) +
  geom_jitter() +
  theme_bw()

You can also color code the boxes, bars, and individual data points by groups. To do so, simply add an argument color in the aesthetics function aes().

data %>%
  ggplot(aes(x = `Heart Disease Status`, y = `Triglyceride Level`, color = `Heart Disease Status`)) +
  geom_boxplot(outlier.shape = NA) +
  stat_boxplot(geom = "errorbar", width = 0.2) +
  geom_jitter() +
  theme_bw()