Chapter 4: Data Manipulation with tidyverse

Goals:
Introduce tidyverse for filtering, mutating, summarizing and pivoting data.
  • Filtering rows (filter()), selecting columns (select()).
  • Adding new columns (mutate()).
  • Grouping and summarizing (group_by() + summarize()).
  • Pivot longer and Pivot wider

Read time: 10 Minutes

Why use tidyverse

If you wanted to just analyse a portion of your data, you could just subset the data frame.

Code
abdo <- data[data$tissue == "Abdomen",] # select only the rows where tissue contains Abdomen
mean(abdo$CCR5_MFI)
## [1] 4605.51
Code
abdo_epi <- abdo[abdo$layer == "Epithelium",]
abdo_um <- abdo[abdo$layer == "Underlying",]
mean(abdo_epi$CCR5_MFI)
## [1] 5298.62
Code
mean(abdo_um$CCR5_MFI)
## [1] 3880.896

However, you might have already noticed that your environment is filling up with objects and variables that we keep assigning.

With more complicated data this can get pretty annoying.

We can remove things from the environment using rm(abdo, abdo_epi, abdo_um).

We could instead write long conditionals within the square brackets.

Code
mean(data[data$tissue == "Abdomen" & data$layer == "Epithelium", "CCR5_MFI"])
## [1] 5298.62
Code
mean(data[data$tissue == "Abdomen" & data$layer == "Underlying", "CCR5_MFI"])
## [1] 3880.896

However, you’ll find that this gets very long and confusing and annoying as you incorporate these into other functions, like ggplot.

This is where dplyr becomes advantageous. dplyr allows us to manipulate an object without changing it permanently, using very efficient functions and a concept called pipes.

dplyr is a package within the tidyverse ecosystem, which contains dplyr, tidyr, ggplot2, tibble, stringr, and others. So, we can install and load tidyverse for this chapter.

Filtering rows and columns

Pipes (%>%) are an important concept and effectively stand for “using this”.

From the example above, we could use dplyr instead, without assigning any new vairables, in a nice and tidy and follwable layout. When spoken, it might read like…

Code
data %>% # using this object, I want to...
  filter(tissue == "Abdomen") %>% # then "using this" new filtered object
  filter(layer == "Epithelium") %>% 
  select(CCR5_MFI) %>% # using this object, select CCR5_MFI column
  unlist() %>% # unlist to make it a vector
  mean() # and get the mean
## [1] 5298.62

Now, this seems like a lot for determining one value, but we’ll see what group_by() and summarize() do shortly, which demonstrate why you might want to learn dplyr asap.

Adding a column using mutate()

Similarly, in previous chapters, we have added columns to the data for ease of use. For example:

Code
# assign a new column
data$percent_CD4 <- 100*data$CD4/data$CD3
# then use dplyr to determine the mean value of Abdomen Epithelium
data %>% 
  filter(tissue == "Abdomen") %>% 
  filter(layer == "Epithelium") %>% 
  select(percent_CD4) %>% 
  unlist() %>% 
  mean()
## [1] 43.32365

However, there might be hundreds of extra columns I want to make. And that means making 100 unique and descriptive column names, that then just gets more and more confusing as you add to the data.

So, we can use mutate() instead, so that we don’t have to alter our original data.

Code
data %>%
  mutate(percent_CD4 = 100*CD4/CD3)%>% # Add a new column
  filter(tissue == "Abdomen") %>% 
  filter(layer == "Epithelium") %>% 
  select(percent_CD4) %>% 
  unlist() %>% 
  mean() %>%
  round(2)
## [1] 43.32

group_by() and summarize()

Here we’ve been writing 8 lines of code to get the mean values of one tissue type, which is useful if you wanted to do something with that exact single value. Alternatively, we can also use group_by() or summarize() to get values across multiple variables.

Code
data %>% 
  mutate(percent_CD4 = 100*CD4/CD3) %>%
  group_by(tissue, layer) %>% # almost like 'treat the data like n-data frames per combinations of tissue and layer
  summarize(mean_percent_CD4 = mean(percent_CD4)) # summarise in a new table
## `summarise()` has grouped output by 'tissue'. You can override using the `.groups` argument.

Pivot

Packages and functions, e.g. ggplot, will sometimes require that data is organised in the correct way. pivot_longer() and pivot_wider() can be effective tools for rearranging your data to suit.

Here we’ll just use a short subset of the data to easily visualise what is happening.

Code
data2 <-data[c(1:10,25:34, 41:50),colnames(data) %in% c("donor", "tissue", "layer", 'group', "CD3", "CD4", "CD8")]

Currently, the data is in what we might consider wide format. We have one row per tissue layer and columns for our CD3, CD4, CD8 cell counts. We’ll start by mutating extra columns to get the CD4 and Cd8 percentages and then making it long, so that we get one column called “percent” with CD4 and CD8 in a different single column called “subset”. This is a format then compatible with ggplot.

Code
data2 %>% 
  mutate(percent_CD4 = 100*CD4/CD3, percent_CD8 = 100*CD8/CD3) %>%
  pivot_longer(cols = c(percent_CD4, percent_CD8), #which columns to choose for elongating
               names_to = "subset", # name of new column containing names
               values_to = "percent") %>% # name of new column containing values
  print()

You’ll notice we have duplicate entries for CD3, CD8 and CD4, but our column subset has alternating percent_CD4/CD8 and the value for that donor is in the percent column. This can now be used in a boxlpot in ggplot.

Code
data2 %>% 
  mutate(percent_CD4 = 100*CD4/CD3, percent_CD8 = 100*CD8/CD3) %>%
  pivot_longer(cols = c(percent_CD4, percent_CD8), 
               names_to = "subset", 
               values_to = "percent") %>%
  ggplot(aes(x=group, y=percent, fill=subset)) + 
  geom_boxplot(outliers=F)+theme_classic()

Imaging having a dataframe tens of thousands long, and you were to pivot longer and then store this data. You can imagine how inefficient your analysis workflow would become in such circumstances.

Ok. Lets say that we wanted to see if there was a correlation between the percent of CD4+ T cells in epithelium versus underlying mucosa. To do this, we calculate the percent of CD4, but we need an x and y vector to give to ggplot. Currently, the data object looks like a long version of what we want. So we can pivot wider to make our graph.

Code
data2 %>% 
  mutate(percent_CD4 = 100*CD4/CD3) %>%
  select(-c(group, CD3,CD4,CD8))%>% # using a - with select lets use remove, instead of select a given column. 
  group_by(tissue)%>%
  pivot_wider(names_from = layer, values_from = percent_CD4) %>%
  ggplot(aes(x=Epithelium, y=Underlying, color=tissue))+
  geom_point()+ 
  stat_ellipse()+
  theme_classic()

Hopefully it is clear why tidyverse is a useful tool.

The next two chapters will continue to build your skills up to generating a range of different figures using ggplot.

Next chapter →