ggplot
hillali@ohsu.edu
@apreshill
@apreshill
OHSU Center for Spoken Language Understanding
This workshop was presented at OHSU on June 24, 2016. The same dataset and plots were made for a parallel breakout session using Python.
For this workshop, we’ll recreate the graphics presented in this article:
Graphics and statistics for cardiology: comparing categorical and continuous variables
Sometimes we’ll offer a challenge for you to try on your own (either during or after this workshop), which will look like this:
Other times, we’ll encourage you to pair up with a partner to solve a problem, which will look like this:
Your pre-work included:
Do this once per machine.
Do this once per R session.
Use the readr
package to import the nhanes_large.csv with the read_csv
function.
read_csv()
parentheses is the url to the datasetna = "."
, specifies that missing data in this dataset is denoted by a period<-
and call that object something simple, like heart
.It is good practice to say HLO to your new dataset by doing a High-Level Overview using some built-in R functions.
Use the head()
function in your console and check with your partner to see if you got the same output. How many rows of data do you see?
?head
in your console and figure out how to reveal the first 10 rows. Search online to find out the R function that returns the last n rows of a dataframe.
Let’s look at the variables in our heart
dataframe:
[1] "BPXSAR" "BPXDAR" "BPXDI1" "BPXDI2" "race_ethc"
[6] "gender" "DR1TFOLA" "RIAGENDR" "BMXBMI" "RIDAGEYR"
You may be wondering what DR1TFOLA
and BPXSAR
are. The data dictionary online defines the variable names as follows:
BPXSAR
: systolic blood pressure (mmHg)BPXDAR
: diastolic blood pressure (mmHg)BPXDI1
, BPXDI2
: two diastolic blood pressure readingsrace_ethc
: race/ethnicity, coded as Hispanic, White non-Hispanic, Black non-Hispanic and Othergender
: sex, coded as Male/FemaleDR1TFOLA
: folate intake (μg/day)RIAGENDR
: sex, coded as 1/2BMXBMI
: body mass index (kg/m2)RIDAGEY
: age (years)We just used 2 different functions in R: head()
and names()
. All functions take arguments, which typically go inside parentheses. Order matters here, unless you explicitly label the arguments; for example:
head(6, heart) # this will not work
head(n = 6, x = heart) # this works, but you type more
head(heart, 6) # this works and is succinct
For both head()
and names()
, the first argument was our dataframe, which we named heart
. ggplot2
is a package that essentially contains a bunch of functions, each taking different arguments, which we’ll explore now…
Let’s focus first on univariate plots by examining folate intake (a continuous variable). The first function we need is ggplot()
: this function initializes a new ggplot object.
We’ll start with 1 argument: your dataframe- we called ours heart
.
The code above should produce an empty plot (a gray rectangle).
ggplot
makes a plot by layering. So let’s take this empty plot and add a layer to it. This layer is a geometric object, or geom
for short. Let’s start with the geom_histogram()
.
Each geom
also takes arguments, and you can always find out what arguments go along with a specific geom using the help function in the console:
We’ll take that empty plot and add a geom_histogram()
layer using the +
sign. This geom requires at least one argument, and it is special: it is an aesthestic mapping (aes()
), meaning that what goes in the parentheses here must be a variable that is in our dataframe. Here, we map folate intake onto the x-axis.
Small tweak #1: change colour outline of bars
One of the arguments that geom_histogram()
takes is colour
- we’ll change it to white. Note that we are outside of the aes()
- this is because the color “white” is not a variable.
Small tweak #2: change colour fill of bars
You might be surprised to see that colour
does not in fact relate to the colour inside the bars (see above). To change that, we use another argument for geom_histogram()
- fill
. How colour
versus fill
works depends on which geom
you are using. For our histogram, we’ll change the the colour inside to orchid.
royalblue
bars and a hotpink
outline.
Small tweak #3: change number of bins
You may have noticed that all of our histograms have printed this warning in your console:
#stat_bin() using bins = 30. Pick better value with binwidth.
This warning tells us that bins = 30
is the default that ggplot
chose for us, based on our data. Let’s play around with this!
Make a histogram with 10 bins, and one with 50 bins. Color/fill any way you like!
Small tweak #3: update x-axis title
This is done by adding a labs
layer, short for labels. We want to relabel the x-axis. Let’s also save our ggplot
object now, and call it myhist
.
myhist <- ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram(colour = "white", fill = "orchid") +
labs(x = "folate intake")
I’m pretty happy with this one, so I’m going to save this plot so I can stop typing the same thing over and over again, and I can add layers using this one as my base plot.
OK, let’s add another variable to the mix. We’ll keep looking at folate intake as the continuous variable, with gender
as the categorical variable. There are a few ways to do this in ggplot2
.
About facets (from facet_wrap
documentation):
“Most displays are roughly rectangular, so if you have a categorical variable with many levels, it doesn’t make sense to try and display them all in one row (or one column). To solve this dilemma, facet_wrap wraps a 1d sequence of panels into 2d, making best use of screen real estate.”
Let’s add + facet_wrap(~variable)
to split the histograms based on gender
Just as a reminder, the below code would create the exact same plot- just with more repetitive typing.
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram(colour = "white", fill = "orchid", bins = 50) +
labs(x = "folate intake (μg/day)") +
facet_wrap(~gender)
geom_density
) instead of geom_histogram
.
So far, when we played with colours, we set fill/colour without mapping them onto another variable (i.e., (colour = "white", fill = "orchid")
). Now, we want fill/colour to vary based on the value of a specific variable. Here we want colour to depend on gender
, so we map the colour aesthetic (aes(colour = variable)
) onto the variable gender
.
Notice moving the colour into aesthetics parentheses.
ggplot(heart, aes(x = DR1TFOLA)) +
geom_density(aes(colour = gender)) +
labs(x = "folate intake (μg/day)")
The below code will produce the exact same plot. Try it! I promise it is identical. Play around with changing global aesthetics (like below) versus geom-specific aesthetics (like above). It won’t matter when you only have one layer, but once you start adding geoms
and stats
, this can be a powerful way to change your visualization.
Now let’s make some side-by-side univariate plots, specifically dotcharts or stripcharts, to visualize systolic blood pressure by gender. We’ll start with a new base plot, and save it to an object called side
that we can then add geoms to.
If you type side
into your console, what will the plot look like? Try adding a geom_point()
- is this plot useful?
Let’s take this plot and tweak it a bit.
Small tweak #2: make points more transparent
Alpha works on a scale from 0 (transparent) to 1 (opaque).
Small tweak #3: try jittering the points
To do this, instead of geom_point()
, we will switch to using geom_jitter()
, which automatically adds both vertical and horizontal space (noise) to your datapoints.
Small tweak #4: control the jitter
Sometimes you may only want to change the width of jitter but not the height.
Now let’s try a different geom in our side-by-side plot, which is available through the beeswarm
package you already loaded.
Small tweak #1: add alpha again
Small tweak #2: add statistics
We’ll include the mean plus 95% CI
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_beeswarm(alpha = .2) +
stat_summary(fun.y = "mean", geom = "point", colour = "orange") +
stat_summary(fun.data = mean_cl_boot, geom = "linerange", colour = "orange") +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #3: add another geom layer
Try adding geom_boxplot
on top of your side-by-side beeswarm plot. Also try re-ordering the geoms to see what changes.
New geom time- this time use geom_violin
.
Small tweak #1: add statistics
Let’s include the sample mean and median
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = .2) +
stat_summary(fun.y = "mean", geom = "point", colour = "orange") +
stat_summary(fun.y = "median", geom = "point", colour = "blue") +
labs(x = "", y = "Systolic BP (mmHg)")
Small tweak #2: add another geom layer
Add another layer to your violin plot: try geom_boxplot
Now we’ll create some bivariate plots to look at the association between age and systolic blood pressure, both of which are continuous variables.
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
If you have big n, try hexbin plot
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_hex() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
Make colors make more sense
library(viridis)
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_hex() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)") +
scale_fill_gradientn(colours = viridis(256, begin = 1, end = 0))
library(ggalt)
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
stat_bkde2d(aes(fill = ..level.., alpha = ..level..), geom = "polygon", bandwidth = c(2,2)) +
labs(x = "Age (years)", y = "Systolic BP (mmHg)") +
scale_fill_gradientn(colours = viridis(256, begin = 1, end = 0))
Add linear regression line with SE
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
Default is loess line
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
geom_smooth() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
Add splines
We’ll finish up by creating some multivariable plots that help us visualize how the association between body mass index & systolic BP varies by gender and age (so 4 variables!).
Just copy this code to create a categorical age variable:
Recreate theirs first
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR)) +
geom_point() +
stat_smooth(aes(colour = gender), method = "lm") +
facet_wrap(~age_cat) +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)")
Try with facet grid, update labels
grid <- ggplot(heart2, aes(x = BMXBMI, y = BPXSAR)) +
stat_smooth(aes(colour = gender), method = "lm") +
facet_grid(gender~age_cat) +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)")
grid + geom_point()
Play with colors!
grid +
geom_point(aes(colour = gender), alpha = .5) +
theme_minimal() +
scale_color_manual(values = c("#B47CC7", "#D65F5F"), guide = FALSE)
my_colors <- c("#C4AD66", "#77BEDB")
grid +
theme_light() +
scale_color_manual(values = my_colors, guide = FALSE)
Since R Markdown use the bootstrap framework under the hood. It is possible to benefit its powerful grid system. Basically, you can consider that your row is divided in 12 subunits of same width. You can then choose to use only a few of this subunits.
Here, I use 3 subunits of size 4 (4x3=12). The last column is used for a plot. You can read more about the grid system here. I got this result showing the following code in my R Markdown document.