Introduction

This vignette is Part 2 of 3 for an R workshop created for BIOL 548L, a graduate-level course on data visualization taught at the University of British Columbia.

When the workshop runs, we split students into three groups with successively increasing levels of difficulty. To make sense of what follows, we recommend working through (or at least looking at) Part 1’s vignette.

The goal of Part 2’s vignette is to learn how to structure data to produce single plots ready for presentations and publications

Part 3’s vignette (which we recommend going through only after completing this page) can be found here.

All code and contents of this vignette were written together by Vikram B. Baliga, Andrea Gaede and Shreeram Senthivasan.

Learning Objectives:

  1. Articulate the advantages of literate programming
  2. Employ piping to restructure data for tidyverse
  3. Describe the grammar of graphics that underlies plotting in ggplot
  4. Manipulate plots to improve clarity and maximize the data : ink ratio

Load or install&load packages

We’ll first load packages that will be necessary. The package.check() function below is designed to first see if each package is already installed. If any aren’t, the function then installs them from CRAN. Then all the packages are loaded.

The code block is modified from this blog post

Data sets

You can get all data used in this vignette (and the other two!) by downloading this zip file.

What is literate programming?

Literate programming is a coding paradigm that rethinks the “grammar” of traditional code to more closely resemble writing in a human language, like English.

The primary goal of literate programming is to move away from coding “for computers” (ie, code that simplifies compilation) and instead to write in a way that more resembles how we think.

In addition to making for much more readable code even with fewer comments, a good literate coding language should make it easier for you to translate ideas into code.

In the tidyverse, we will largely use the analogy of variables as nouns and functions as verbs that operate on the nouns.

The main way we will make our code literate however is with the pipe: %>%

What is a pipe?

A pipe, or %>% , is an operator that takes everything on the left and plugs it into the function on the right.

In short: x %>% f(y) is equivalent to f(x, y)

In RStudio you can use the shortcut CTRL/CMD + SHIFT + M to insert a pipe

Why bother with pipes?

To see how pipes can make code more readable, let’s translate this simple cake recipe into pseudo R code:

  1. Cream together sugar and butter
  2. Beat in eggs and vanilla
  3. Mix in flour and baking powder
  4. Stir in milk
  5. Bake
  6. Frost

Saving Intermediate Steps:
One way you might approach coding this is by defining a lot of variables:

The info is all there, but there’s a lot of repetition and opportunity for typos. You also have to jump back and forth between lines to make sure you’re calling the right variables.

A similar approach would be to keep overwriting a single variable:

But it’s still not as clear as the original recipe…And if anything goes wrong you have to re-run the whole pipeline.

Function Composition:
If we don’t want to save intermediates, we can just do all the steps in one line:

Or with better indentation:

It’s pretty clear why this is no good

Piping

Now for the piping approach:

When you are reading code, you can replace pipes with the phrase “and then.”

And remember, pipes just take everything on the left and plug it into the function on the right. If you step through this code, this chunk is exactly the same as the function composition example above, it’s just written for people to read.

When you chain together multiple manipulations, piping makes it easy to focus on what’s important at every step. One caveat is that you need the first argument of every function in the pipe to be the object you are manipulating. Tidyverse functions all follow this convention, as do many other functions.

Where do you find new verbs?

The RStudio cheat sheets are a whole lot more useful once you start piping – here are a few:

Data transformation

Data import + restructuring

Other cheat sheets

Some prep work for the coding part of the lesson:

In this section our main goal will be to reproduce as closely as possible all the individual plots that make up Figure 3 in the Gaede et al. 2017 paper.

The one exception to that will be the legends and icons, which are easier to align and arrange when organizing the plots into a single multipanel figure, which is left for Group 3.

Let’s start by opening up the paper and setting up some variables we will be using throughout the plotting.

Colours for the different birds

Anatomy of a ggplot

ggplot is built around the idea of a “Grammar of Graphics” – a way of describing (almost) any graphic.

There are three key components to any graphic under the Grammar of Graphics:

  • Data: the information to be conveyed
  • Geometries: the things you actually draw on the plot (lines, points, polygons, etc…)
  • Aesthetic mapping: the connection between relevant parts of the data and the aesthetics (size, colour, position, etc…) of the geometries

Any ggplot you make will at the very minimum require these three things and will usually look something like this:

Or, equivalently:

But how do you know what geometries are available or what aesthetic mappings you can use for each?
Use the plotting with ggplot2 cheat sheet

Other possibly relevant componenets include:

  • Coordinate systems (eg cartesian vs polar plots)
  • Scales (eg mapping specifc colours to groups)
  • Facets (subpanels of similar plots)

Build a boxplot with data superimposed (Figure 3 Panel D):

Let’s make a ggplot!

Add points:

Separate points:

Add summary statistics:

Fix x axis:

Re-plot with new x:

Swap colour to fill:

Fix colours:

De-emphasize boxplots, remove outliers

Tune jitter:

Titles and theming and save:

Let’s create a theme:

Let’s see what it looks like

Let’s create a theme:

Draw in x and y axes:

Draw in x and y axes:

Build a grouped barplot (Figure 3 Panel E):

Visualize the data:

Restructure data for plotting:

# Add the missing info
read_csv("./Fig3E_data.csv") %>%
  mutate(species = c("hb", "zb", "pg"))
#> Parsed with column specification:
#> cols(
#>   `0.1` = col_double(),
#>   `0.24` = col_double(),
#>   `0.5` = col_double(),
#>   `1` = col_double(),
#>   `1.5` = col_double(),
#>   `2` = col_double(),
#>   `4` = col_double(),
#>   `8` = col_double(),
#>   `16` = col_double(),
#>   `24` = col_double(),
#>   `32` = col_double(),
#>   `48` = col_double(),
#>   `64` = col_double(),
#>   `80` = col_double(),
#>   `128` = col_double(),
#>   `256` = col_double()
#> )
#> # A tibble: 3 x 17
#>   `0.1` `0.24`  `0.5`    `1`  `1.5`     `2`     `4`    `8`   `16`   `24`
#>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1     0 0.0357 0.0179 0      0      0.107   0.0357  0.0357 0.0536 0.0893
#> 2     0 0      0      0.0187 0      0.00935 0.00935 0.0935 0.168  0.121 
#> 3     0 0      0.0947 0.0526 0.0105 0.126   0.126   0.0842 0.147  0     
#> # ... with 7 more variables: `32` <dbl>, `48` <dbl>, `64` <dbl>,
#> #   `80` <dbl>, `128` <dbl>, `256` <dbl>, species <chr>

# Pull the y-values down from the header
data_E <- read_csv("./Fig3E_data.csv") %>%
  mutate(species = c("hb", "zb", "pg")) %>%
  gather(key = speed, value = prop, -species)
#> Parsed with column specification:
#> cols(
#>   `0.1` = col_double(),
#>   `0.24` = col_double(),
#>   `0.5` = col_double(),
#>   `1` = col_double(),
#>   `1.5` = col_double(),
#>   `2` = col_double(),
#>   `4` = col_double(),
#>   `8` = col_double(),
#>   `16` = col_double(),
#>   `24` = col_double(),
#>   `32` = col_double(),
#>   `48` = col_double(),
#>   `64` = col_double(),
#>   `80` = col_double(),
#>   `128` = col_double(),
#>   `256` = col_double()
#> )
data_E
#> # A tibble: 48 x 3
#>    species speed   prop
#>    <chr>   <chr>  <dbl>
#>  1 hb      0.1   0     
#>  2 zb      0.1   0     
#>  3 pg      0.1   0     
#>  4 hb      0.24  0.0357
#>  5 zb      0.24  0     
#>  6 pg      0.24  0     
#>  7 hb      0.5   0.0179
#>  8 zb      0.5   0     
#>  9 pg      0.5   0.0947
#> 10 hb      1     0     
#> # ... with 38 more rows

Plot data:

Plot data again!

Pick colours, add labels, add theme:

Fix x axis, add y axis line:

Build a line plot with error bars and other graphical features (Figure 3 Panel B):

Visualize data:

Gather all the bin data into one column:

Group the data by trial, direction speed:

Calculate firing rate for each trial:

Finally let’s ungroup and store this unnormalized data:

Now we need to normalize the data:

# First calculate a baseline
tail(data_B_raw)
#> # A tibble: 6 x 4
#>   trial direction    speed raw_firing_rate
#>   <dbl>     <dbl>    <dbl>           <dbl>
#> 1   234       NaN NaN                0.400
#> 2   235       315  -0.0867           4.00 
#> 3   236       NaN NaN                0    
#> 4   237       135  -0.696            0.600
#> 5   238       NaN NaN                0.600
#> 6   239       315   0.821            0.200

data_B_raw %>%
  filter(direction == 'NaN')
#> # A tibble: 119 x 4
#>    trial direction speed raw_firing_rate
#>    <dbl>     <dbl> <dbl>           <dbl>
#>  1     2       NaN   NaN           0.400
#>  2     4       NaN   NaN           0.400
#>  3     6       NaN   NaN           0    
#>  4     8       NaN   NaN           0.200
#>  5    10       NaN   NaN           1.20 
#>  6    12       NaN   NaN           0.600
#>  7    14       NaN   NaN           0.800
#>  8    16       NaN   NaN           0.6  
#>  9    18       NaN   NaN           0.200
#> 10    20       NaN   NaN           0.400
#> # ... with 109 more rows

data_B_raw %>%
  filter(direction == 'NaN') %>%
  pull(raw_firing_rate)
#>   [1]  0.400010  0.400010  0.000000  0.200005  1.200015  0.600015  0.800020
#>   [8]  0.600000  0.200005  0.400010  0.200005  1.000020  4.600060  1.800045
#>  [15]  2.200045  0.000000 11.000040  1.200030  1.400035  0.400010  1.600010
#>  [22]  0.600030  0.000000  0.000000  0.000000  0.200005  0.000000  0.000000
#>  [29]  0.200005  0.400010  0.200005  0.000000  0.200005  0.200005  0.600015
#>  [36]  0.000000  0.000000  1.600035  3.600060  0.200005  0.200000  0.200005
#>  [43]  0.800020  0.000000  0.400010  0.200005  0.800020  0.800020  0.800020
#>  [50]  0.200005  0.200000  0.600015  0.200005  0.600015  0.600015  0.000000
#>  [57]  0.200005  0.000000  0.200005  0.000000  0.000000  0.000000  0.000000
#>  [64]  0.000000  0.200005  0.400005  1.400020  0.000000  0.000000  0.000000
#>  [71]  0.000000 12.800070  0.000000  0.000000  0.000000  0.000000  1.600070
#>  [78]  0.400010  4.200115  1.200015  0.000000  0.000000  0.000000  0.200005
#>  [85]  0.000000  0.400010  0.400010  0.400010  0.200005  1.400035  0.400010
#>  [92]  0.200005  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
#>  [99]  4.600050  0.200005  0.000000  0.400010  1.800045  0.600015  0.200005
#> [106]  0.000000  0.200005  0.200005  0.000000  0.000000  0.200005  0.000000
#> [113]  0.000000  0.200005  0.000000  0.400005  0.400010  0.000000  0.600015

baseline <- data_B_raw %>%
  filter(direction == 'NaN') %>%
  pull(raw_firing_rate) %>%
  mean
baseline
#> [1] 0.6806833

# Use the baseline to normalize the data
data_B_raw %>%
  filter(direction != 'NaN') %>%
  mutate(
    norm_firing_rate = raw_firing_rate - baseline,
    norm_firing_rate = norm_firing_rate / max(abs(norm_firing_rate))
  )
#> # A tibble: 120 x 5
#>    trial direction   speed raw_firing_rate norm_firing_rate
#>    <dbl>     <dbl>   <dbl>           <dbl>            <dbl>
#>  1     1       135  1.90             2.00           0.0676 
#>  2     3       315 -0.696            0.800          0.00611
#>  3     5       135 -0.0867           0.600         -0.00413
#>  4     7       135  0.821           11.4            0.549  
#>  5     9       315  1.63             0             -0.0349 
#>  6    11       135 -0.401            0.200         -0.0246 
#>  7    13       315  1.90             0             -0.0349 
#>  8    15       135  1.63            20.2            1      
#>  9    17       315 -0.0867           0.200         -0.0246 
#> 10    19       135  1.13            13.2            0.641  
#> # ... with 110 more rows

Next let’s find the mean and SE of firing rate across like trials:

Finally ungroup the data and convert direction to a factor:

Plot the data!

Add error bars and threshold line:

Add a line segment indicating points over threshold:

Set colours, shapes, labels, and themes:

Make the diamonds bigger, reorder layers to emphasize points, remove x axis title:

Add axis lines:

Note the line lengths, y limits, and x breaks are all wrong
To fudge the lines, let’s make another dataset:

Build another grouped barplot (Figure 3 Panel F):

Visualize data:

Remove any rows without data (dir1.area is NA):

Remove unnecessary columns:

Rename columns using speeds_F, change species to words:

Reshape data for plotting:

Summarize proporions by species and speed:

Finally fix the factor levels as before:

Plot data as before:

Build another line plot (Figure 3 Panel C):

Reorganize raw data:

Calculate baseline:

Normalize data and calculate mean, SE:

Plot data: