HBC bio photo

HBC

Bioinformatics support team at the Harvard School of Public Health. Focus on research computing, NGS and functional analysis.

Email Github

About the site

The sequencing workshops are run by the Bioinformatics Core at the Harvard School of Public Health, sponsored by the Tools and Technology Program of Harvard Medical School and the Harvard NeuroDiscovery Center.

Courses are being run as hands-on sessions with a focus on next-gen sequence data in public health and stem cell biology. We make use of the Galaxy framework to ensure little prior computational experience is required and to help establishing a few key concepts that are important to any computational research:

  • organizing your in silico work just as well as any other lab work,
  • focus on reproducibility of your results, and
  • document your work and make it available to the scientific community.

Those three skills alone will put you ahead of the majority of practicing bioinformaticians.

The importance of reproducible science

If you haven’t heard about the Duke Clinical Trials and how simple index errors spiraled out of control, ultimately resulting in several clinical trials based on false data I strongly recommend watching Keith Baggerly’s talk. Promise it’s worth the time.

While the data issues described by Keith were not accidental not all bad data is generated with malicious intent. As a recent study noted the majority of biomedical studies can not be reproduced:

“Of 47 cancer projects at Bayer during 2011, less than one-quarter could reproduce previously reported findings, despite the efforts of three or four scientists working full time for up to a year. Bayer dropped the projects.”

And quite a few of these discrepancies arise due to a perceived need to come up with a likable scientific story:

“We went through the paper line by line, figure by figure,” said Begley. “I explained that we re-did their experiment 50 times and never got their result. He said they’d done it six times and got this result once, but put it in the paper because it made the best story. It’s very disillusioning.”

Ionnidis et all1 (PDF) described similar findings in a survey of array data – one of the best understood high-throughput data sources in bioinformatics – published in Nature Genetics:

“One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced.”

And the Science Exchange Blog has additional examples if you are still curious.

Keeping notes

To avoid creating similar problems for your peers keeping notes on your work is paramount. If you are interested in the topic we recommend working through articles from Titus Brown’s blog and familiarizing yourself with automated systems such as IPython (web-based notebooks for Python), knitr (dynamic reports in R) or initiatives such as Sage’s Synapse (data repositories and workflow documentation).

In this course Galaxy, our analytical framework, will keep most of the notes for you. You will most likely still need to document additional work and keep track of files, papers and other documents. Whatever system you use for this is fine – be it Evernote, GitHub, or a simple folder-based system recommend by Bill Noble2 (PDF)

Dissemination

While keeping good notes is one thing, making them available along with primary and derived data is just as important, Frustrated by closed lab protocols Titus Brown and Brad Chapman have suggested a simple checklist for publications in bioinformatics, and Titus himself leads the way with a recent manuscript, making just about everything available:

  • a link to the paper itself, in preprint form, stored at the arXiv site;
  • a tutorial for running the software on a Linux machine hosted in the Amazon cloud;
  • a git repository for the software itself (hosted on github);
  • a git repository for the LaTeX paper and analysis scripts (also hosted on github), including an ipython notebook for generating the figures (more about that in my next blog post);
  • instructions on how to start up an EC2 cloud instance, install the software and paper pipeline, and build most of the analyses and all of the figures from scratch;
  • the data necessary to run the pipeline;
  • some of the output data discussed in the paper.

For this course almost all work we will be doing is captured automatically and can be shared with others through a simple URL, but as you start utilizing tools outside of Galaxy (or similar systems such as GenePattern) we strongly recommend to browse through some of these materials.

Contact information

If you have any questions about the course materials do not hesitate to ask Radhika Khetani or Oliver Hofmann directly. You can also get hold of Oliver on Twitter or in emergencies give him a call at +1 617 365 0984.

  1. Ioannidis, John P A, David B Allison, Catherine A Ball, Issa Coulibaly, Xiangqin Cui, Aedín C Culhane, Mario Falchi, et al. “Repeatability of Published Microarray Gene Expression Analyses.” Nature Genetics 41, no. 2 (February 1, 2009): 149–155.

  2. Noble, William Stafford. “A Quick Guide to Organizing Computational Biology Projects.” PLoS Computational Biology 5, no. 7 (June 30, 2009): e1000424.