Description Target Audience Prerequisites Outline

Course Description

This is a specialized course designed for bioinformatics professionals and researchers dealing with biological datasets who are unsure about which statistical analysis they should implement. This course focuses on empowering participants to conduct end-to-end data analyses. Learn the art of selecting appropriate statistical methods aligned with research questions, and develop a decision tree guiding data cleaning, exploration, analysis, and interpretation. Engage in hands-on coding sessions in the R software with real-world datasets, learning the skills needed to confidently navigate and communicate insights throughout the analytical process. Particular emphasis will be placed on developing an understanding of the statistical methods used, when to apply them, and how to interpret them. By the course’s conclusion, participants will be equipped with a concise and powerful toolkit for unlocking the full potential of complex biological datasets.

Course Objectives

Participants will gain practical experience and skills to be able to:

Understand the challenges and opportunities presented by complex biological data.
Apply fundamental techniques for exploring and summarizing biological datasets.
Address missing values through sophisticated imputation methodologies.
Analyze relationships within high-dimensional data, both between variables and samples, employing regression methods, regularization techniques, and clustering methods.
Integrate various statistical methods into a cohesive workflow to tackle a variety of common problems in bioinformatics from data exploration to interpretation.

Target Audience

The target audience for this course is graduate students, post-doctoral fellows, and researchers in industry or academia already familiar with conducting basic statistics in R and looking to learn more about conducting an analysis from start to finish with more intermediate or advanced statistical techniques.

Prerequisites

Participants are expected to have completed the ‘Introduction to R’ and ‘Analysis Using R’ courses available through Bioinformatics.ca or possess equivalent proficiency in R. Previous years’ R workshop materials are available open-access here. In particular, we expect participants to feel comfortable with the following skills in R as our course will build upon them:

Reading data into R
Data types and classes
Manipulating data in R with Tidyverse (e.g., select, mutate, filter, etc.)
Making plots in ggplot2
Writing custom functions
Estimating a correlation matrix
Hierarchical and K-means clustering algorithms
Fitting a linear regression model with the lm function
Fitting a logistic regression model with the glm function

You will also require your own laptop computer. Minimum requirements: 1024×768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements).

This workshop requires participants to complete pre-workshop tasks and readings.

Course Outline

Module 1: Data cleaning and exploration review

Introduction to end-to-end analysis (overview of the entire analysis process)
Discussion of different data types (continuous, count, binary, multi-categorical)
Discussion of the importance of “tidy data” when performing statistical analyses
Introduction to advanced data cleaning methods
- Discussion of methods from the perspective of finding patterns in order to plan out your analyses and pre-processing steps

Module 1 Lab:

Data wrangling 101: cleaning and structuring messy datasets
Exploratory analysis and pre-processing, including visualization, outlier detection, and dimensionality reduction
Identifying variables to be used in future analyses

Module 2: Dealing with Missingness

Introduction to different kinds of missingness
Discussion of how missingness impacts statistical analyses
Review of complete-case analysis and its limitations
Introduction to common types of imputation methods, including multiple imputation methods
Discussion of when to use each type of method and the advantages and disadvantages of each

Module 2 Lab:

Explore and visualize missingness to understand how it influences the analysis if not dealt with properly
Impute missing data with single and multiple imputation methods (i.e. mean replacement, k-nearest neighbour, missForest, MICE) and use the imputed dataset(s) in analyses

Module 3: Modeling part 1

Overview of regression based statistical modeling methods:
- Regression (multiple linear regression models, generalized linear models for binary outcomes, count outcomes, binomial outcomes, etc.)
- Variable selection (lasso, elastic net)
Model evaluation and selection
Model visualization

Module 3 Lab:

Identify the proper model given a particular question of interest with examples including
- Multiple linear regression analyses
- Generalized linear models for a diverse range of outcomes (e.g., binary, count, binomial)
Conduct variable selection using regularization methods such as lasso and elastic net.

Module 4: Modeling part 2

Overview of non-regression based statistical modeling methods
- Classification (e.g., Random Forest)
- Clustering (K nearest neighbours)
Model evaluation and selection
Model visualization

Module 4 Lab:

Identify appropriate classification or clustering techniques given a particular question of interest with examples including
- Logistic regression
- Random forest
- K-nearest neighbours
- Hierarchical clustering
Evaluate and visualize chosen models.

Module 5: Putting it all together

Brief review of first four modules
Finalizing a decision tree to be used as a tool when planning out your analysis
Examples of identifying which model to use based on different datasets using the decision tree

Module 5 Lab:

Discuss when the decision tree may not be sufficient for identifying a model and resources for continual learning.
End-to-end analysis of real-world data starting with data cleaning, exploration, analysis, and interpretation
Select from a list of curated datasets and address predefined research questions using the learned methods

Module 6: Introduction to Causal inference

Causation vs. correlation
Understanding the potential outcome framework
How to estimate propensity scores with methods learned and use them for estimating the average treatment effect

Module 6 Lab:

Assess the balance of covariates between treated and untreated groups
Propensity score estimation via classification methods such as logistic regression and random forest
Estimation of Average Treatment Effect (ATE) using propensity score matching and inverse weighting

Workshop Details:

Duration: 3 days

Start: Aug 19, 2024

End: Aug 21, 2024

Location:

Course Mode:
Mode Filter

Status: Application Open

Apply

Offers:

CAD $695 for applications received between February 7, 2024 to July 26, 2024

CAD $895 for applications received between July 27, 2024 to August 5, 2024

Limited to: 30 participants