Bioinformatics for Cancer Genomics

Course Objectives
Cancer research has rapidly embraced high throughput technologies and Cloud computing into its research. Large amounts of data are being created from various microarray, tissue array, and next generation sequencing platforms. Dedicated compute clouds such as the Cancer Genome Collaboratory [http://cancercollaboratory.org/] facilitate complex analyses on big cancer data sets from projects hosting their data in the Cloud, such as the ICGC and PCAWG. Now more than ever, having the informatic skills and knowledge of available bioinformatic resources specific to cancer and how to access and use available data sets in the Cloud is critical.
This 6-day workshop will cover the key bioinformatics concepts and tools required to analyze cancer genomic data sets and access and work with data sets in the Cloud.
Participants will gain practical experience and skills to:
- Visualize genomic data;
- Analyze cancer -omic data for gene expression, genome rearrangement, somatic mutations, and copy number variation;
- Analyze and conduct pathway analysis on the resultant cancer gene list;
- Integrate clinical data;
- Launch, configure, customize, and scale virtual machines (VM);
- Navigate and work with data sets from Cloud repositories; and
- Follow best practices in data and workflow management.
Target Audience
This workshop is intended for clinical researchers, researcher scientists, post-doctoral fellows, and graduate students with cancer genomics research projects.
Prerequisites: UNIX and R familiarity is required. Familiarity can be gained through online activities. You should be familiar with these UNIX concepts (tutorial 1-3) [http://www.ee.surrey.ac.uk/Teaching/Unix/] and these R concepts (chapters 1-5) [http://www.cyclismo.org/tutorial/R/]. A useful hands-on tool for getting started in R is Swirl [http://swirlstats.com/students.html].
You will also require your own laptop computer. Minimum requirements: 1024x768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, you may loan one from the CBW. Please contact course_info@bioinformatics.ca for more information.
Pre-work and pre-readings can be found at https://bioinformaticsdotca.github.io/bicg_2018.
Course Material
-
Module 1: Introduction to Cancer Genomics (Instructor: Trevor Pugh)
Content:
- Overview of cancer genomics field
- Common applications of HT technologies in cancer genomics
- {"title"=>"Concepts and case studies of cancer genomics from the literature:", "content"=>["Cancer genetics", "Pharmacogenomics", "Diagnostic vs. prognostic markers and druggable targets"]}
-
Module 2: Ethics of Data Usage and Security (Instructor: Mark Phillips)
Content:
- Introduction to Cloud computing and virtual machines (VMs)
- Ethical conduct when using genomic data
- Security of big data sets in the Cloud
- VM security and usage best practices: ssh keys, ports, snapshot of VM without privileged data, and shut down of VM when not in use.
-
Module 3: Databases and Visualization Tools (Francis Ouellette)
Content:
- Overview of cancer specific databases, as well genome browsing and cancer genome browsing.
- {"title"=>"The databases:", "content"=>["Collaboratory, ICGC portal, TCGA, etc.", "COSMIC, dbSNP, etc."]}
- {"title"=>"The browser tools:", "content"=>["IGV", "UCSC"]}
Lab Practical
-
Logging into the Cloud (Zhibin Lu)
-
Module 4: Genome Alignment (Jared Simpson)
Content:
- What is involved in mapping reads to a reference genome
- What are the FASTQ and SAM/BAM file formats
- Some common terminology used to describe alignments
Lab Practical
-
Module 5: Genome Assembly (Jared Simpson)
Content:
- Fundamentals of de novo assembly
- Data structures used by assemblers (de Bruijn graphs and overlap graphs)
- Common steps that assemblers perform
- Overview of commonly used software
Lab Practical
-
Module 6: Somatic Copy Number Changes (Sorana Morrissy)
Content:
- Importance of copy number alterations in cancer
- Methods for detecting copy number alterations
- Tools for evaluating CNAs in HT-seq data
Lab Practical
-
Module 7: Somatic Mutations and Annotations (Sorana Morrissy)
Content:
- Relevance of detecting somatic mutations in cancer genomics
- Strategies for detection of somatic mutations and factors considered by SNP callers
- Binomial mixture models to model allelic counts
- Simultaneous analysis of tumor and normal data
- Sources of artifacts and false positives
Lab Practical
-
Module 8: Gene Expression Profiling (Fouad Yousif)
Content:
- {"title"=>"The Technology Platform: high-throughput sequencing", "content"=>["Variety of platforms and their differences", "Experimental design considerations", "Limitations of experiments"]}
- {"title"=>"The Analysis Tools:", "content"=>["Outline of a RNA-Seq analysis pipeline", "Tools for analysis of RNA-Seq data"]}
Lab Practical
-
Module 9: Gene Fusion Discovery and Genomic Rearrangements (Brian Haas)
Content:
- Genomic rearrangement and its effect on the transcriptome
- Biological relevance of gene fusions in cancer biology
- The technology platform: RNA-Seq
- Overview of the RNA-Seq protocol
- Experimental design considerations
- {"title"=>"The analysis tools:", "content"=>[{"title"=>"Alignment based fusion discovery", "content"=>["Paired end RNA-Seq alignments and gene fusion evidence", "Discerning true fusions from artifacts"]}, {"title"=>"Assembly based fusion discovery", "content"=>["Benefits/drawbacks of assembly methods", "Identifying gene fusions from RNA-Seq assemblies"]}]}
- Clinical applications
Lab Practical
-
Module 10: Sharing and Scaling a VM (Instructor: George Mihaiescu)
Content:
- Considerations when snapshotting a VM
- How to snapshot a VM
- How to share a snapshot with other cloud tenants
- Launching a new VM from a snapshot
- Scaling out your VMs fleet to meet your analysis needs
Lab Practical
Content:
- Setup and launch a VM
- Install Docker engine
- Snapshot your VM and share it with other cloud tenants
- Launch a new VM from a shared image
- From a given data set analysis task (e.g. VCF of 100 normal-tumour samples), determine the size of your job and your compute needs.
- Scale out your VMs to meet the needs of this task
Title: Within the Collaboratory environment:
Presentation file(s):
-
Module 11: Working Reproducibly in the Cloud (Instructor: Brian O’Connor)
Content:
- Introduction to Docker and Dockstore
- Overview of packaging your tools in Docker
- Overview of Dockstore and its current state
- Introduction to the Common Workflow Language
Lab Practical
-
Module 12: Big Data Analysics in the Cloud (Instructor: Christina Yung)
Content:
- Large-scale biological activities that generate datasets used in the Cloud
Lab Practical
-
Module 13: Genes to Pathways (Instructor: Jüri Reimand)
Content:
- Introduction to gene lists
- Gene annotations: Gene identifiers and pathway databases
- What is pathway enrichment analysis?
Lab Practical
-
Module 14: Variants to Networks (Instructor: Robin Haw)
Content:
- Overview of pathway and network analysis
- Basic network concepts
- {"title"=>"Types of pathway and network information", "content"=>["Focus on transcription factor regulatory networks. Pathway Databases: KEGG, Reactome"]}
Lab Practical: Reactome (Robin Haw)
-
Module 15: Integration of Clinical Data (Instructor: Lauren Erdman)
Content:
- Introduction to correlating clinical outcomes with cancer genomic data
- How do variants discovered in genomic data result in clinical outcomes?
- Challenges with integration of heterogeneous data types (clinical vs. genomics)
- Survival analysis (univariate and multivariate)
Lab Practical