Description Target Audience Prerequisites Outline

Course Description

Before we can begin to apply rigorous statistical tools to research data, we often need to approach our data intuitively, and look for meaningful associations, surprising patterns, or irregularities, to formulate hypotheses. This is Analysis using R (AUR). This workshop introduces the essential tools and strategies that are available for AUR through the free statistical workbench R. Steps covered in this workshop are broadly relevant for many areas of modern, quantitative biology such as flow cytometry, expression profile analysis, function prediction and more.

Course Objectives

Participants will gain practical experience and skills to be able to use R to visualize and investigate patterns in their data.

Target Audience

Graduates, postgraduates, and PIs who design and execute strategies for data analysis and who are using the R statistical workbench.

Prerequisites

You are expected to be a regular user of R. If you do not regularly use R, please begin by taking the Introduction to R workshop.

You will require your own laptop computer. Minimum requirements: 1024×768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, you may loan one from the CBW. Please contact support@bioinformatics.ca for more information.

This workshop requires participants to complete pre-workshop tasks and readings.

Course Outline

Module 1: Exploratory data analysis Overview & Clustering

Knowing your data: An overall workflow for exploratory data analysis
- Understand the difference between response variables, explanatory variables, biological variation, technical variation, and batch effects
- Missing data; understand how to identify structured versus unstructured missingness, and the role of imputation
- Finding unwanted sources of variation; surrogate variable analysis and RUVseq

Knowing your data’s structure. Calculating “distance” between (high-dimensional) data points
- What distance metrics represent
- Different kinds of different metrics and when to use them
Clustering principles & methods
- Why cluster?
- A survey of clustering methods
- Choose the clustering method that is right for your data
Assessing the quality of clustering results
- Metrics for identifying the optimal number of clusters
- Existential questions introduced by clustering

Module 2: Dimensionality reduction

What is dimensionality reduction, and common applications in bioinformatics
Dimensionality reduction with Principal Components Analysis (PCA)
- Conduct PCA on different types of data
- Get information out of PCA objects in R
Some practical uses of PCA
- Plot and learn from PCA output
- Use PCs as control variables in your analysis
- Use PCs as variables of interest in your analysis
Other types of dimensionality reduction
- t-stochastic neighbor embedding (tSNE)
- uniform manifold approximation and project (UMAP)

Module 3: Fitting generalized linear models

Read different data files into R

Merge data and handle missing values
Use ggplot to create and modify publication-quality R plots
Plot and fit linear model for continuous-valued outcome, and logistic model for dichotomous outcome

Module 4: Differential expression analysis

Manually conduct many parallel statistical tests
- Different types of statistical tests
- Evaluate and plot output
- Extract output for tables
- Visualize p-values from multiple testing: QQplot, volcano plot.
- Correct for multiple statistical tests: Bonferroni, false discovery rate
Using bioconductor for analysis
- Perform differential expression analysis