Chapter 1: Getting Started in R and RStudio

Your first steps…

Goals:
Set up R and RStudio, and get familiar with the interface.
  • Understand what a programming language is.
  • Know the difference between R, RStudio and python.
  • Install R and RStudio.
  • Exploring the interface – Console, Environment, Scripts.
  • Installing and loading packages.
  • Start your first project.

By the end of this chapter, you’ll have installed R and RStudio, know how to navigate the RStudio UI and will know how to get started on a new project.

To start with…

Create a folder that you’ll work out of for this tutorial. For example, I have a folder analysis-user-group on my Desktop.

Read time: 30 Minutes

R as a programming language

Simply: A programming language is a way to communicate with a computer.

There are many languages, each with their own advantages and limitations. Some, such as SQL are used in database management, and others like R and python are used for analytic programming. R is designed specifically for statistical computing, data analysis and visualisation, and is considered one of the more user-friendly languages. Python is more of a general-purpose language used in data science, but also for web development, machine learning, and more.

The language itself needs an interface for you to communicate with the computer. You can run R in Terminal (Mac) or Command Prompt (Windows) if you wanted to. You could store your scripts in Word and then copy and paste the lines into the commandline window.

By now, you’ve heard of and seen RStudio. RStudio is what’s called an Integrated Development Environment (IDE) and makes working with R easier by organizing scripts, plots, and files.

Installing R and RStudio

Install R first. Go to CRAN, download the latest version of R and follow the prompts to install on your computer.

You don’t need to do anything else with the R application!

Next install RStudio. Go to the download page and follow the prompts to install on your computer. Once installed, open RStudio—it should automatically connect to your R installation.


Projects

Projects in RStudio help organize your files and workflows.

Create a Project:

  • Go to File > New Project.

  • Select New Directory or link to an Existing Directory.

  • Name your project and choose a folder location.

  • Click Create Project.

This sets up a workspace, saving all files and settings in one place.

Use projects to keep analyses organized and reproducible.

Using R

It might feel like we’re moving too quickly through some of this. Don’t worry! The next few chapters will go into variables, data structures, functions and so on in more details using mock data.

chunk: I’ll refer to the blocks of code you see throughout these lessons as a chunk.
function: A function is a set of statements that are run at the same time.
argument: An argument is a value/variable used in a function.
package: A package is simply a collection of functions.


R as a programming language is a set of basic functions. Some of these include:

Arithmetic
Code
# sums
1+1
## [1] 2
Code
# subtraction
3-1
## [1] 2
Code
# multiplication
2*2
## [1] 4
Code
# division
40/pi
## [1] 12.7324
Code
# exponents
2^2
## [1] 4

Print and Paste
Code
# print a sentence
print("This is a sentence.")
## [1] "This is a sentence."
Code
# print two things 'pasted' together
print(paste("This is pi:", pi))
## [1] "This is pi: 3.14159265358979"

print() and paste() are examples of base R functions. DoHeatmap() and NormalizeData() are examples of functions that are derived from the Seurat package, used for scRNA seq analysis

We might combine a series of these base R functions to create unique and specific output. For example, lets say I was constantly calculating cell percentages based off of cell numbers, and want to output a sentence summary. I might write it like this each time:

Code
num_cd45 = 45000 # number of CD45+ cells
num_cd3 = 22000 # number of CD3+ cells
per_cd3 = num_cd3/num_cd45 # calculate the fraction
per_cd3 = per_cd3*100 # multiply by 100 to get a percentage
per_cd3 = round(per_cd3, digits=2) # round the data off - here is an example of an argument - I can specify how many digits I want to include
paste("There are ", num_cd45, " CD45+ cells and ", num_cd3, " CD3+ cells (", per_cd3, "%)", sep = "") # output the statement
## [1] "There are 45000 CD45+ cells and 22000 CD3+ cells (48.89%)"

I would have several samples to calculate these for, and I don’t want to repeat these lines of code n times! We could write this into our own function (I will demonstrate how to do this in a future lesson) or hopefully someone has already done this.

Lets pretend…

…someone has already solved this issue. They’ve written the functions that we want and published them as package called simple_math. We would install this package and have access to the functions.

Code
# install the package
install.packages("simple_math")

A package only needs to installed once!

You can manually search for your packages in RStudio in the Packages tab on the right. Or simply try to load it using…

Code
# load the package
library(simple_math)

You will need to load packages into your R sessions every time you restart R

Now that simple_math is loaded into our session, we could use the functions within the package. This is what the function might look like:

Code
calculate_percent(
  parent = "CD45",
  parent_num = 45000,
  subset = "CD3", 
  subset_num = 22000,
  digits = 1
)
## [1] "There are 45000 CD45+ cells and 22000 CD3+ cells (48.9%)"

This is a super reductionist explanation of functions and packages.

Not only do functions and packages use base R functions, they themselves require functions from other packages. E.g. the Seurat package uses ggplot functions to generate graphs. As you perform more and more complex analyses, you will be installing and loading more and more packages necessary.

There are some very common packages that you’ll likely use a lot such as:

  • ggplot2 for visualisation

  • dplyr for data manipulation

  • Seurat for single cell analysis

  • BiocManager for installing additonal bioinformatic packages

… etc, etc

Commenting code:

You might have noticed that within the R chunks. Comments are ways of adding details and descriptions to your otherwise dry code. They are prefaced by a #

The can be used in two ways:

On their own line
    # this is a comment
    1+1
    [1] 2

After code
    1+1 # this will add 1 to 1
    [1] 2

It is incredibly useful to comment your code. Some examples may be:

  • You’ve written a script or custom function and want to remember what each line does.

  • You’ve written a script or custom function and want to inform others what each line does.

  • You’ve changed a script, but want to keep a copy of the older version - you could comment out each line.

  • etc etc

Hello World

It’s customary that your first script is helloworld.R. So! Open a new script and save it in your analysis-user-group folder as helloworld.R. In this script add the line:

Code
print("Hello World")

Click on this line in your script editor and press:

  • Mac: Command-Enter

  • Windows: Control-Enter

This sends the line you’re on to the console and runs it. You’ll get the output

[1] "Hello World"



Congratulations. You wrote your first program.You’re a genius.Well done.You’re on your way to becoming a superstar programmer.

Optional Exercise: Write your first second script

Practice:
  Create an R script.
         Send a line to the console. (See Hello World above)
         Set your directory.
         Install the ggplot2 package.
         Load in sthe ggplot2 packages.
         Download the dataset used for these tutorials into my directory.
         Load in the dataset.
         Inspect the data.
         Create a simple plot of the data.
         Save the plot in your directory.

The data: In this exercise, and throughout these tutorials, we’ll be using flow cytometry data I synthesised representing mock cell frequencies and expressions of T cells dervied from human tissues (n=60). The data dictionary below provides a description for each column

Data Dictionary:
Column Description
date Date of acquisition
experiment Experiment number
donor Donor Number
age Age of patient
tissue Tissue type: abdomen, labia, vagina
layer Tissue layer: epithelium, underlying mucosa
group Summarised sample group (e.g. A_E = abdomen epithelium)
CD3 CD3+ cells
CD8 CD8+ cells
CD4 CD4+ cells
HLADR CD4+HLADR+ cells
CCR5 CD4+CCR5+ cells
HLADR_MFI gMFI of HLADR on CD4+ cells
CCR5_MFI gMFI of CCR5 on CD4+ cells
CD28 CD4+CD28+ cells

gMFI:geometric mean fluorescent intensity.

Exercise: Write a script using the following instructions (run the lines as you write to see if its working for you)

  • Open a new R script and save it to a folder of your choice.

  • At the top of each of your scripts, you might want to add some comments about what the script is for.

  • Set your working directory using setwd(). One way to quickly find the directory is to copy the folder and paste it into the script itself.

  • Install ggplot2

  • Load the ggplot2 package

  • Download the example data using:

Code
download.file(url="https://raw.githubusercontent.com/DrThomasOneil/analysis-user-group/refs/heads/main/docs/r-tutorial/assets/synthetic_data.csv", destfile="synthetic_data.csv")
Don’t see the file?

If you have set your directory properly, you will see a new file in your folder called synthetic_data.csv.

  • Load the data into R using data <- read.csv("synthetic_data,csv")

  • Inspect the data

    • colnames() will show you what columns are present

    • dim(data) will show you the dimensions of the data as rows x columns.

    • head(data) will show you the first 6 rows of the data. Conversely, you could use tail().

    • summary(data) will give you detailed output of the data.

  • Assign to a variable plot a graph of % CD4+ T cells in each group using the ggplot() function.

Here you can copy and paste my code, as we don’t cover ggplot until Chapter 5

In Chapter 5 & 6, we will learn how to use ggplot and how to make publication-worthy graphs

Code
plot = ggplot(data, aes(x=group, y=100*(CD4/CD3), color=tissue))+geom_boxplot()

  • Output the plot by writing plot in the console.

  • Save the plot to your directory using ggsave().

Use ?ggsave in the console to bring up the help menu for this function! Here you will find descriptions of the function, the arguments you can use and examples of it’s use.


Solution

Here is one possible solution.

Did you comment your code?

Code
#* This is my first script
#* I will set my directory, install my first package, load it in,  and save a variable to my directory
#* 20250107

# set your directory
setwd("~/Desktop/analysis-user-group") #THIS WONT WORK FOR YOU - YOU'LL NEED TO SET YOUR OWN DIRECTORY

# uncomment and install the package once
# install.packages("ggplot2")

# load the package
library(ggplot2)

# download the example data
download.file(url="https://raw.githubusercontent.com/DrThomasOneil/analysis-user-group/refs/heads/main/docs/r-tutorial/assets/synthetic_data.csv", destfile = "synthetic_data.csv")

# read in synthetic data
data <- read.csv("synthetic_data.csv")

# inspect the data
colnames(data)
dim(data)
head(data)
summary(data)

# create a simple plot of percentage of CD4+ cells per group
plot <- ggplot(data, aes(x=group, y=100*(CD4/CD3), color=tissue))+geom_boxplot()

# output the plot
plot

# save plot
ggsave("myfirstplot.png",plot = plot)

You may have seen me use = and <- interchangeably. That’s because they are.

Next Chapter

Congratulations! You’ve completed Chapter 1. You now have R and RStudio installed, have installed a package and started to write scripts.

In the next two chapters, I will use this data to explain more about the fundamentals of R.

Next Chapter →