Before we can begin to apply rigorous statistical tools to research data, we often need to approach our data intuitively, and look for meaningful associations, surprising patterns, or irregularities, to formulate hypotheses. This is Exploratory Data Analysis (EDA). This workshop introduces the essential tools and strategies that are available for EDA through the free statistical workbench R. Steps covered in this workshop are broadly relevant for many areas of modern, quantitative biology such as flow cytometry, expression profile analysis, function prediction and more.
Participants will gain practical experience and skills to be able to:
- Use R and its analysis tools, read and modify code, and explore protocols that can be adapted for their own research tasks.
- Write R functions and analysis scripts.
- Plot and visualize data using the elementary built-in routines via their (sometimes bewildering) array of parameters to sophisticated, publication-ready presentations.
Graduates, postgraduates, and PIs who design and execute strategies for data analysis who have some familiarity with the R statistical workbench.
You will also require your own laptop computer. Minimum requirements: 1024×768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, please contact email@example.com for other possible options.
This workshop requires participants to complete pre-workshop tasks and readings.
Module 1: Exploratory data analysis for biological data (EDA)
- Exploratory data analysis principles
- Reading and writing data from common biological file-formats, including numeric data, sequences, annotations, and networks
- Regular expressions
- Descriptive statistics: mean/median and variance, quantiles, outliers
- Plotting in R: basics, advanced options, special packages and best practices
Module 2: Regression
- Types of models for regression analysis in R
- Calculating linear regressions and plotting residuals
- Non-linear regression with arbitrary functions
- Maximum Information Coefficient
Module 3: Dimension reduction
- Visualizing multi-dimensional data
- Dimensionality reduction with Principal Components Analysis
- Using explicit models for data reduction
- t-Stochastic Neighbour Embedding
- Principal component analysis of high dimensional data
- An integrated tutorial (until 8pm)
Module 4: Clustering
- Calculating ‘distance’ between (high-dimensional) data points
- Clustering principles & methods
- Assessing the quality of clustering results
- Evaluation and comparison of different clustering techniques
Module 5: Hypothesis testing for EDA
- Statistical models, hypotheses and how to test them
- Quantifying quality: p-values, distributions, Z-scores and “significance”
- Nonparametric approaches
- Bootstrap and resampling techniques
- Multiple testing corrections: Bonferroni, family wise error rate, false discovery rate
- Simulation testing
Duration: 2 days
Start: May 15, 2019
End: May 16, 2019
Course Mode: Onsite
Status: Registration Closed
Open Access Content:
Canadian Bioinformatics Workshops promotes open access. Past workshop content is available under a Creative Commons License.