High-Throughput Biology: From Sequence to Networks (2017)

Course Objectives

With the introduction of next generation sequencing platforms, it is becoming feasible to consider sequencing approaches to address many research projects. Now more than ever, having knowledge of the available bioinformatic resources and the informatic skills to analyze such data is critical.

The Canadian Bioinformatics Workshops, in collaboration with Cold Spring Harbor Laboratory, has developed a comprehensive 7-day course covering the key bioinformatics concepts and tools required to analyze DNA- and RNA-sequence reads using a reference genome. Participants will gain experience in cloud computing and data visualization tools, which will be applied throughout the course. Beginning with an understanding of the workflow involved to move from platform images to sequence generation, participants will gain practical experience and skills to be able to evaluate sequence read quality, map reads to a reference and analyze sequence reads for variation and expression level. The workshop will conclude with analyzing and conducting pathway and network analysis on the resultant ‘gene’ list. The tutorials are designed as self-contained units that include example data (e.g. Illumina paired-end data) and detailed instructions for installation of all required bioinformatics tools.

Target Audience

Graduates, postgraduates and PIs working with or about to embark on analysis of data from next generation sequencing platforms (Illumina focus). A reference genome is required.

Prerequisites for attendance:

Basic familiarity with Linux environment and S, R, or Matlab. Must be able to complete and understand the following simple Linux and R tutorials before attending:

You will also require your own laptop computer with wireless internet capability Minimum requirements: 1024x768 screen resolution, 1.5GHz CPU, 1GB RAM, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, you may loan one from the CSHL. Please contact CSHL in advance to request a laptop.


Cold Spring Harbor, NY


$2855 USD taxes included

To Apply:

Please visit CSHL Courses to apply. Deadline for applications is January 15, 2017.

Course Outline

Day 1

Module 1 - Introduction to High-Throughput Sequencing (2017) (Instructor: Jared Simpson)

  • Overview of high-throughput sequencing technologies: major players and their strengths and weaknesses

Module 2 - Data Visualization (2017) (Instructor: Florence Cavalli)

  • Data file formats used in genome visualization (FASTA, BED, WIG, GFF, etc)
  • Introduction to genomic data visualization tools and how they can be used to visualize sequencing read data: UCSC, IGV, Savant, GBrowse
  • Integrating other data sets into a browser

Lab Practical: Variant detection and visualization within the genome using IGV

Module 3 - Genome Alignment (2017) (Instructor: Mathieu Bourgey)

  • What is involved in mapping reads to a reference genome
  • What are the FASTQ and SAM/BAM file formats
  • Some common terminology used to describe alignments

Lab Practical:

  • Connecting to the Cloud
  • Genome alignment exercise

Integrated Assignment
Consolidate the skills you learned by performing an alignment.

Day 2

Module 6 - De Novo Assembly (2017) (Instructor: Jared Simpson)

  • Fundamentals of de novo assembly
  • Data structures used by assemblers (de Bruijn graphs and overlap graphs)
  • Common steps that assemblers perform
  • Overview of commonly used software

Lab Practical: Perform a de novo assembly task.

Module 4 - Small-Variant Calling and Annotation (2017) (Instructor: Mathieu Bourgey)

  • SNPs, SNVs, and short-INDELs and why to look for them
  • BQ recalibration, duplicate removal, aligner choice
  • Detecting variants and factors taken into account by the SNP callers
  • Different types of SNP calling: haploid/diploid, trio, somatic mutations, pooled
  • Determining which SNPS are good from the millions detected
  • INDEL cleaning
  • Standard file formats for SNPs
  • Introduction to SNP calling tools and how they compare with each other

Lab Practical: SNP detection exercise

Module 5 - Structural Variant Calling (2017) (Instructor: Mathieu Bourgey)

  • Structural variants (SVs), different types, mechanisms that give rise to SVs, and how SVs and CNVs differ
  • Differences between human and model organism genomes
  • Detecting SVs via sequencing (read pair, read depth, combined approach, local de novo assembly) and which SV types are detectable by which strategies
  • Introduction to SV detection tools
  • File formats used to describe SVs

Lab Practical:

  • SV discovery in a single human genome
  • Brief intro to SV visualization and interpretation

Day 3

Module 7 - Introduction to RNA Sequencing and Analysis (2017) (Instructor: Malachi Griffith)

  • Basic introduction to biology of RNA-seq
  • Experimental design and analysis considerations
  • Commonly asked questions

Lab Practical:

  • Introduction to the test data
  • Examine and understand the format of raw FastQ files
  • Obtain reference genomes (fasta) and gene annotation resources (GTF/GFF)
  • Perform pre-alignment QC

Module 8 - RNA-Seq alignment and visualization (2017) (Instructor: Fouad Yousif)

  • Use of Bowtie/TopHat
  • Introduction to the BAM format
  • Basic manipulation of BAMs with samtools, Picard etc.
  • Visualization of RNA-seq alignments - IGV
  • BAM read counting and determination of variant allele expression status

Lab Practical:

  • Run Bowtie2/TopHat2 with parameters suitable for gene expression analysis
  • Use samtools to explore the features of the SAM/BAM format and perform basic manipulation of these alignment files (view, sort, index, manipulate headers, extract data, etc.)
  • Use IGV to visualize TopHat2 alignments, view a variant position, load exon junctions files, etc.

Day 4

Module 9 - Expression and Differential Expression (2017) (Instructor: Obi Griffith)

  • Get FPKM style expression estimates using Cufflinks
  • Perform differential expression analysis with Cuffdiff
  • Perform summary analysis with CummeRbund

Downstream interpretation of expression analysis (multiple testing, clustering, heatmaps, classification, pathway analysis, etc) will also be discussed.

Lab Practical:

  • Run Cufflinks, Cuffdiff, and CummeRbund
  • Explore the output of these in R

Module 10 - Reference free analysis (2017) (Instructor: Malachi Griffith)

  • Explore the use of Kallisto to get abundance estimates without first aligning to a reference.

Module 11 - Isoform Discovery and Alternate Expression (2017) (Instructor: Malachi Griffith)

  • Explore use of Cufflinks in reference annotation based transcript (RABT) assembly mode and ‘de novo’ assembly mode. Both modes require a reference genome sequence.

Lab Pracitical: Run Cufflinks in alternate modes more conducive to isoform discovery and explore the results

Day 5

Module 12 - Introduction to Pathway and Network Analysis (2017) (Instructor: Jüri Reimand)

  • Where do gene lists come from and what are they useful for?
  • Pathway and network analysis overview
  • Presenting a workflow of concepts and tools from gene list to pathway analysis
  • Provide examples of multiple paths through the workflow that will be covered in the workshop
  • Sources of pathway and network information: GO biological process, network databases, pathway databases. Examples of pros and cons of each type of information
  • General issues: gene identifiers, data normalization

Module 13 - Finding Over-represented Pathways in Gene Lists (2017) (Instructor: Jüri Reimand)

  • Statistics for detecting over-representation e.g. hypergeometric test, GSEA
  • Multiple testing correction: Bonferroni, Benjamini-Hochberg FDR
  • Filtering Gene Ontology e.g. using evidence codes

Lab Practical: Performing over-representation analysis

  • Workflow of tools and steps
  • g:Profiler tool for over-representation analysis
  • Gene Set Enrichment Analysis (GSEA) and Enrichment Maps software tool
  • Running gene enrichment tools on your gene list

Module 14 - Network Visualization and Analysis with Cytoscape (2017) (Instructor: Veronique Voisin)

  • Introduction to Cytoscape
  • Cytoscape demo

Lab Practical: Tutorials on Cytoscape

  • Layouts
  • Labels
  • Enrichment maps

Day 6

Module 15 - More depth on Pathway and Network Analysis (2017) (Instructor: Robin Haw)

  • Basic network concepts
  • Types of pathway and network information
  • Network and pathway databases
  • More examples of pathway and network analysis methods
  • Reactome analysis tools: network clustering and paradigm

Lab Practical #1: Tutorials on Cytoscape (Veronique Voisin)

  • Networks

Lab Practical #2: Reactome (Robin Haw)

  • Workflow of tools and steps
  • Reactome FI

Module 16 - Gene Function Prediction (2017) (Instructor: Quaid MorrisInstructor: Veronique Voisin)

  • Functional association networks and gene function prediction
  • Functional relationships, similarity space
  • Guilt-by-association concept
  • GeneMANIA and STRING tools

Lab Practical: (Veronique Voisin)

  • Workflow of tools and steps
  • Using GeneMANIA to assess gene and gene list function

Evening Integrated Assignment Part: g:Profiler, EnrichmentMap, ReactomeFi, geneMANIA (Veronique Voisin)

Day 7

Module 17 - Regulatory Network Analysis (2017) (Instructor: Michael HoffmanInstructor: Veronique Voisin)

  • Overview of transcription and transcriptional regulation
  • Data sources for regulatory data - ChIP-seq, DNAse-seq, methylation data
  • Using epigenomics data
  • Finding transcription factor binding sites

Lab Practical: Overrepresentation of transcription factor motifs using the Cytoscape app iRegulon

Open Access LogoCanadian Bioinformatics Workshops promotes open access. Past workshop content is available under a Creative Commons License.