Informatics for RNA-seq Analysis

Workshop banner

Course Objectives

High-throughput sequencing of RNA libraries (RNA-seq) has become increasingly common and largely supplanted gene microarrays for transcriptome profiling. When processed appropriately, RNA-seq data has the potential to provide a considerably more detailed view of the transcriptome. The CBW has developed a 3-day course providing an introduction to RNA-seq data analysis followed by integrated tutorials demonstrating the use of popular RNA-seq analysis packages. The tutorials are designed as self-contained units that include example data (Illumina paired-end RNA-seq data) and detailed instructions for installation of all required bioinformatics tools (HISAT, StringTie, etc.).

Participants will gain practical experience and skills to be able to:

  • Perform command-line Linux based analysis on the cloud
  • Assess quality of RNA-seq data
  • Align RNA-seq data to a reference genome
  • Estimate known gene and transcript expression
  • Perform differential expression analysis
  • Discover novel isoforms
  • Visualize and summarize the output of RNA-seq analyses in R
  • Assemble transcripts from RNA-Seq data.

Target Audience

Graduates, postgraduates, and PIs working or about to embark on an analysis of RNA-seq data. Attendees may be familiar with some aspect of RNA-seq analysis (e.g. gene expression analysis) or have no direct experience.

Prerequisites: Basic familiarity with Linux environment and S, R, or Matlab. Must be able to complete and understand the following simple Linux and R tutorials (up to and including “Descriptive Statistics”) before attending:

You will also require your own laptop computer. Minimum requirements: 1024x768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, you may loan one from the CBW. Please contact for more information.

Pre-work and pre-readings can be found at

Course Outline

Day 1

Module 1: Introduction to Cloud Computing (Obi Griffith)

  • Introduction to cloud computing concepts

Lab Practical:

  • Learn to configure, launch, and connect to an Amazon cloud instance.

Module 2: Introduction to RNA sequencing and analysis (Malachi Griffith)

  • Basic introduction to biology of RNA-seq
  • Experimental design and analysis considerations
  • Commonly asked questions

Lab Practical:

  • Introduction to the test data
  • Examine and understand the format of raw FastQ files
  • Obtain reference genomes (fasta) and gene annotation resources (GTF/GFF)
  • Perform pre-alignment QC

Module 3: RNA-Seq alignment and visualization (Fouad Yousif)

  • RNA-seq alignment challenges and common questions
  • Alignment strategies
  • Introduction to HISAT2
  • Introduction to the BAM and BED formats
  • Basic manipulation of BAMs with samtools, Picard, etc.
  • Visualization of RNA-seq alignments - IGV
  • Alignment QC Assessment
  • BAM read counting and determination of variant allele expression status

Lab Practical:

  • Run HISAT2 with parameters suitable for gene expression analysis
  • Use samtools to explore and manipulate the features of the SAM/BAM files
  • Use IGV to visualize HISAT2 alignments, view a variant position, load exon junctions files, etc.
  • Determine BAM-read counts at a variant position
  • Use samtools flagstat, samstat, FastQC to assess quality of alignments

Integrated Assignment:

  • Using a subset of data, assess the specific expression of a given gene.

Day 2

Module 4: Expression and differential expression (Obi Griffith)

  • Expression estimation for known genes and transcripts
  • FPKM/TPM expression estimates vs. raw counts
  • Differential expression methods
  • Downstream interpretation of expression and differential expression estimates

Lab Practical:

  • Generate gene/transcript expression estimates with StringTie
  • Perform differential expression analysis with Ballgown
  • Summarize and visualize differential expression results

Module 5: Reference free alignment (Malachi Griffith)

  • Explore the use of Kallisto to get abundance estimates without first aligning to a reference.

Day 3

Module 6: Genome Guided and Genome-Free Transcriptome Assembly (Brian Haas)

  • Explore use of StringTie in reference annotation based transcript (RABT) assembly mode and de novo assembly mode. Both modes require a reference genome sequence.
  • Reconstructing transcripts using Trinity
  • Genome-free transcript quantification and differential expression analysis

Lab Practical:

  • Assemble RNA-Seq transcripts
  • Explore use of StringTie in reference annotation based transcript (RABT) assembly mode and de novo assembly mode. Both modes require a reference genome sequence.

Module 7: Functional Annotation and Analysis of Transcripts (Brian Haas)

  • Predict coding regions of transcripts
  • Using Trinotate to capture evidence for transcript function

Lab Practical:

  • Explore TrinotateWeb for navigating transcript annotation and expression data