Chapter 2: Data Types and Structures

Goals:
Learn about data types and structures in R.
  • Basic data types: numeric, character, logical.
  • Data structures: vectors, matrices, data frames, lists.
  • Assessing and querying data types and structures.
  • Importing data (RDS, Excel, CSV files).

Read time: 15 Minutes

Data Types

There are a few data types you should know:

  • character: In R, characters or strings are letters, words, sentences, numbers, symbols… They’re defined using quotation marks.

  • numeric: Data can be stored as a number. R recognises numbers without the quotation marks

  • logical: TRUE / FALSE / T / F / 1 / 0. Logicals are an important data type that you’ll notice along the way.

R treats 1 as TRUE and 0 as FALSE

Some functions or operations require data to be of the correct type. For example, mathematical operations can’t be performed on a character.

Code
# character
char = "Hello World"
typeof(char)
## [1] "character"
Code
# numeric
num = "1"
typeof(num) # "1" is enclosed in quotes, so it is a character string.
## [1] "character"
Code
num = 1
typeof(num) # without the quotes
## [1] "double"
Code
# logical 
log = TRUE
typeof(log)
## [1] "logical"

A good example of using logicals is to check a condition and execute a script. I’ll use if here, which we cover in the next chapter.

Examples:
Code
# TRUE
run_script = TRUE
if(run_script) {
  print(paste0("run_script is ", run_script))
}
## [1] "run_script is TRUE"
Code
run_script = T
if(run_script) {
  print(paste0("run_script is ", run_script))
}
## [1] "run_script is TRUE"
Code
run_script = 1
if(run_script) {
  print(paste0("run_script is ", run_script))
}
## [1] "run_script is 1"
Code
# FALSE
run_script = FALSE
if(run_script) {
  print(paste0("run_script is ", run_script))
}

run_script = F
if(run_script) {
  print(paste0("run_script is ", run_script))
}

run_script = 0
if(run_script) {
  print(paste0("run_script is ", run_script))
}

Above, we set the variables deliberately to demonstrate how they can be used. However, they don’t need to be deliberately set to be useful.

Code
a = 3
typeof(a)
## [1] "double"
Code
is.numeric(a)
## [1] TRUE
Code
typeof(is.numeric(a))
## [1] "logical"
Code
if(is.numeric(a)){
  print(paste0("The variable a (", a,") is numeric."))
}
## [1] "The variable a (3) is numeric."
Code
a="3"
if(is.numeric(a)){
  print(paste0("The variable a (", a,") is numeric. Continuing the analysis."))
} else {
  print(paste0("The variable a (", a,") is ", typeof(a), ". Not continuing the analysis."))
}
## [1] "The variable a (3) is character. Not continuing the analysis."

Data Structures

Data types can be organised into structures. These are:

  • vectors: A vector is a collection of data, such as c(1990, 1990, 1991, 1989) or c(“Rou”, “Chris”, “Rory”, “Rob”)

  • matrix: A matrix is data organised into rows and columns. Data in a matrix is only of one data type.

  • data frames: A data frame is similar to a matrix, but can contain different data types.

  • lists: A list is dynamic and can hold all different types of data structures and data types.

Code
yr <- c(1990, 1990, 1991, 1989)
name <- c("Rou", "Chris", "Rory", "Rob")
rating <- c(4,4,4,5)

df <- data.frame(year = yr, 
                 names = name, 
                 ratings = rating)


df
Code
dim(df)
## [1] 4 3

We can see this dataframe is 4 rows tall and 3 columns wide.

We can access information in a dataframe using: df[row, col]

Code
df[1,1]
## [1] 1990

With data frames, we can also use $ to query a specific column.

Code
df$names
## [1] "Rou"   "Chris" "Rory"  "Rob"
More examples:
Code
# show all rows in certain columns
df[,1]
## [1] 1990 1990 1991 1989
Code
df[,"names"]
## [1] "Rou"   "Chris" "Rory"  "Rob"
Code
# show all columns in one row
df[1,]

# show select columns by providing a vector of specific indices
df[,c(1,3)]

# provide a range of indices quickly using colon between numbers. e.g. 1 to 2
df[,c(1:2)] 

# instead of using indices to pull information, we can use logic
df[c(T,T,F,T),]

We can do a lot with a data frame if we combine everything we know about data types and structures

Code
# store a row or column as a new object
patient_1 = df[1,]
patient_1

# return a logical vector based on a query
df$names == "Rou"
## [1]  TRUE FALSE FALSE FALSE
Code
# and add this to a new column
df$is_rou <- df$names == "Rou"
df
# With a large dataframe, we might not know the specific indices we want to see or store, 
#  so we could cut a dataframe or pull information based on certain logic
df[df$names=="Rou",]

df[df$ratings<5,]

When working with enormous data structures, like single cell sequencing or proteomics data, knowing how to query your data frame or matrix will be very useful!

Finally, we can make a list out of all of these different data types and structures,

Code
list <- list("number" = 1, 
             "string" = "string",
             "logical" = TRUE,
             "vector" = name, 
             "dataframe" = df)
list
## $number
## [1] 1
## 
## $string
## [1] "string"
## 
## $logical
## [1] TRUE
## 
## $vector
## [1] "Rou"   "Chris" "Rory"  "Rob"  
## 
## $dataframe
##   year names ratings is_rou
## 1 1990   Rou       4   TRUE
## 2 1990 Chris       4  FALSE
## 3 1991  Rory       4  FALSE
## 4 1989   Rob       5  FALSE
Ways to query your list
Code
list[[1]]
## [1] 1
Code
list[["vector"]]
## [1] "Rou"   "Chris" "Rory"  "Rob"
Code
list$string
## [1] "string"

Its also worth noting that lists can contain other lists.

Code
list$nested_list <- list(a = 1, b = c(2, 3, 4))

Importing and saving files in R

Most file types can be read into R. If you’re not working with xlsx or csv, you might be working with tsv, txt, h5ad, or xml. Some are easier than others, and you’ll learn them as you go.

Another file type is RDS which is R Data Serialization which is the easiest and most efficient way to read and write data once you’ve got it in R.

RDS is more efficient than csv because saving and writing a csv is not as standard as you think. For example, a column name might change, rownames don’t stick unless directly specified when reading it back in.

On the other hand, saveRDS() & readRDS() requires no extra information. Your data will be identical when read back in.

Most commonly, you’ll use either

data <- read.csv()  # For CSV files
data <- readRDS()   # For RDS files

To read in excel files, you’ll need to install a package (try the packages xlsx for read.xlsx(), or readxl for read_excel() - I can’t recall why, but sometimes one of these won’t work).

Throughout these tutorials, you’ll use read.csv() to read in the example data we downloaded in Chapter 1. We’ll then finish each tutorial with saveRDS() and begin the next tutorial with readRDS(). This is a nice and organised way that you may use yourself when analysing data. This is effectively having stop and start points.

Next chapter →