Cloud Computing in Bioinformatics with Big Data (2017)
A poster announcing this workshop can be found here
Several big data genomics projects, including the ICGC, are deciding to host their data in the Cloud and to provide access to configurable virtual machines (VM) with which to compute on this data (thereby removing the need to purchase and maintain your own compute cluster). Similarly, many labs are moving to renting compute time from various cloud providers. Analysis of a single genome or a smaller selected subset differs from analysis of multiple genomes, particularly in the compute infrastructure required.
To navigate through working in this new compute space, the CBW has developed a 2-day course providing an introduction to security and privacy issues related to working on human genome data and the processes necessary to access such data. After reviewing cloud computing infrastructure, the workshop will also provide a hands-on introduction to launching and configuring your own virtual machine (VM), accessing cloud-based data sets, and how to scale up the number of VMs to meet your analysis needs. Customizing VMs with your own tools and cloud-computing best practices will also be discussed.
Participants will gain practical experience and skills to be able to:
- Launch their own virtual machine (VM)
- Configure a VM with prepackaged tools
- Pull in data sets from Cloud repositories
- Follow best practices in data and workflow management
- Customize a VM with their own tools
- Scale up their VM to meet their analysis needs
Graduates, postgraduates and PIs who need to learn how to access and work with cloud compute infrastructure in support of their research on big data sets (e.g. cancer data from the International Cancer Genome Consortium (ICGC)). Attendees will need to be familiar with the Unix command line interfaces.
Prerequisites for attendance: Familiarity with Linux environment and UNIX commands. Must be able to complete, understand and be very familiar with the following simple Linux before attending:
You will also require your own laptop computer. Minimum requirements: 1024x768 screen resolution, 1.5GHz CPU, 4GB RAM, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, you may loan one from the CBW. Please contact email@example.com for more information.
Module 1: Introduction to Cloud Computing and Virtual Machines (2017) (Instructor: Francis Ouellette)
- What is a Cloud?
- What is a VM?
- Where can one launch a VM? Collaboratory, OpenStack, AWS, Compute Canada, etc.
- Components of a VM to consider: cores, memory, etc.
- Workflow best practices
- Configuring a VM for your analysis needs
Module 2: Ethics of data usage & security best practices (2017) (Instructor: Mark Phillips)
- Ethical conduct when using genomic data
- Security of big data sets in the Cloud
- VM security & usage best practices: ssh keys, ports, snapshot of VM without privileged data, shut down of VM when not in use, etc.
Module 3: Working Reproducibly in the Cloud (2017) (Instructor: Brian O'Connor)
- Introduction to Docker and Dockstore
- Overview of packaging your tools in Docker
- Overview of Dockstore and its current state
- Introduction to the Common Workflow Language
- Within the Collaboratory environment:
- setup and launch a VM
- install Docker engine
- configure a VM with Docker packages
- Package a tool in Docker
- Practice CWL
Module 4: Sharing and Scaling a VM (2017) (Instructor: George Mihaiescu)
- Considerations when snapshotting a VM
- How to snapshot a VM
- How to share a snapshot with other cloud tenants
- Launching a new VM from a snapshot
- Scaling out your VM fleet to meet your analysis needs
Lab Practical :
- Snapshot your VM and share it with other cloud tenants
- Launch a new VM from a shared image
- From a given data set analysis task (e.g. VCF of 100 tumor-normal samples), determine the size of your job and your compute needs
- Scale out your VMs to meet the needs of this task
Module 5: Big Data Analysis in the Cloud (2017) (Instructor: Christina Yung)
- Overview of large-scale biological activities that generate datasets used in the Cloud
- How to access data in the Cloud and accessing non-protected data
- Setup and configuration for worked example
- Initiate a sequence alignment task in the Cloud