This is a specialized course designed for bioinformatics professionals and researchers dealing with biological datasets who are unsure about which statistical analysis they should implement. This course focuses on empowering participants to conduct end-to-end data analyses. Learn the art of selecting appropriate statistical methods aligned with research questions, and develop a decision tree guiding data cleaning, exploration, analysis, and interpretation. Engage in hands-on coding sessions in the R software with real-world datasets, learning the skills needed to confidently navigate and communicate insights throughout the analytical process. Particular emphasis will be placed on developing an understanding of the statistical methods used, when to apply them, and how to interpret them. By the course’s conclusion, participants will be equipped with a concise and powerful toolkit for unlocking the full potential of complex biological datasets.
Participants will gain practical experience and skills to be able to:
- Understand the challenges and opportunities presented by complex biological data.
- Apply fundamental techniques for exploring and summarizing biological datasets.
- Address missing values through sophisticated imputation methodologies.
- Analyze relationships within high-dimensional data, both between variables and samples, employing regression methods, regularization techniques, and clustering methods.
- Integrate various statistical methods into a cohesive workflow to tackle a variety of common problems in bioinformatics from data exploration to interpretation.
The target audience for this course is graduate students, post-doctoral fellows, and researchers in industry or academia already familiar with conducting basic statistics in R and looking to learn more about conducting an analysis from start to finish with more intermediate or advanced statistical techniques.
Participants are expected to have completed the ‘Introduction to R’ and ‘Analysis Using R’ courses available through Bioinformatics.ca or possess equivalent proficiency in R. Previous years’ R workshop materials are available open-access here. In particular, we expect participants to feel comfortable with the following skills in R as our course will build upon them:
- Reading data into R
- Data types and classes
- Manipulating data in R with Tidyverse (e.g., select, mutate, filter, etc.)
- Making plots in ggplot2
- Writing custom functions
- Estimating a correlation matrix
- Hierarchical and K-means clustering algorithms
- Fitting a linear regression model with the lm function
- Fitting a logistic regression model with the glm function
You will also require your own laptop computer. Minimum requirements: 1024×768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements).
This workshop requires participants to complete pre-workshop tasks and readings.
Module 1: Data cleaning and exploration review
- Introduction to end-to-end analysis (overview of the entire analysis process)
- Discussion of different data types (continuous, count, binary, multi-categorical)
- Discussion of the importance of “tidy data” when performing statistical analyses
- Introduction to advanced data cleaning methods
- Discussion of methods from the perspective of finding patterns in order to plan out your analyses and pre-processing steps
Module 1 Lab:
- Data wrangling 101: cleaning and structuring messy datasets
- Exploratory analysis and pre-processing, including visualization, outlier detection, and dimensionality reduction
- Identifying variables to be used in future analyses
Module 2: Dealing with Missingness
- Introduction to different kinds of missingness
- Discussion of how missingness impacts statistical analyses
- Review of complete-case analysis and its limitations
- Introduction to common types of imputation methods, including multiple imputation methods
- Discussion of when to use each type of method and the advantages and disadvantages of each
Module 2 Lab:
- Explore and visualize missingness to understand how it influences the analysis if not dealt with properly
- Impute missing data with single and multiple imputation methods (i.e. mean replacement, k-nearest neighbour, missForest, MICE) and use the imputed dataset(s) in analyses
Module 3: Modeling part 1
- Overview of regression based statistical modeling methods:
- Regression (multiple linear regression models, generalized linear models for binary outcomes, count outcomes, binomial outcomes, etc.)
- Variable selection (lasso, elastic net)
- Model evaluation and selection
- Model visualization
Module 3 Lab:
- Identify the proper model given a particular question of interest with examples including
- Multiple linear regression analyses
- Generalized linear models for a diverse range of outcomes (e.g., binary, count, binomial)
- Conduct variable selection using regularization methods such as lasso and elastic net.
Module 4: Modeling part 2
- Overview of non-regression based statistical modeling methods
- Classification (e.g., Random Forest)
- Clustering (K nearest neighbours)
- Model evaluation and selection
- Model visualization
Module 4 Lab:
- Identify appropriate classification or clustering techniques given a particular question of interest with examples including
- Logistic regression
- Random forest
- K-nearest neighbours
- Hierarchical clustering
- Evaluate and visualize chosen models.
Module 5: Putting it all together
- Brief review of first four modules
- Finalizing a decision tree to be used as a tool when planning out your analysis
- Examples of identifying which model to use based on different datasets using the decision tree
Module 5 Lab:
- Discuss when the decision tree may not be sufficient for identifying a model and resources for continual learning.
- End-to-end analysis of real-world data starting with data cleaning, exploration, analysis, and interpretation
- Select from a list of curated datasets and address predefined research questions using the learned methods
Module 6: Introduction to Causal inference
- Causation vs. correlation
- Understanding the potential outcome framework
- How to estimate propensity scores with methods learned and use them for estimating the average treatment effect
Module 6 Lab:
- Assess the balance of covariates between treated and untreated groups
- Propensity score estimation via classification methods such as logistic regression and random forest
- Estimation of Average Treatment Effect (ATE) using propensity score matching and inverse weighting
Duration: 3 days
Start: Aug 19, 2024
End: Aug 21, 2024
Status: Registration Closed
Workshop Ended
Canadian Bioinformatics Workshops promotes open access. Past workshop content is available under a Creative Commons License.
Posted on: