Genome in a Bottle FTP site now live at NCBI

Dear GIAB Participants,

Thanks to the efforts of Chunlin Xiao, Chris O’Sullivan, Steve Sherry, and others at NCBI, we now have collected a large number of fastq and bam files for NA12878 on an ftp site at NCBI (ftp://ftp-trace.ncbi.nih.gov/giab/ftp). The collected datasets represent a variety of sequencing platforms and library preparation methods for whole genome sequencing, including Illumina (PCR and PCR-free; 100bp and 250bp paired-end), SOLiD (v2 and v4), Complete Genomics, Ion Torrent, and 454. In addition, we have targeted PacBio sequencing and Sanger capillary sequencing of fosmids.

The website is organized with fastq files organized by project subdivided by platform. The README.ftp_structure file describes the organization of the directories. The README.sequence_data file describes the layout of the current.sequence.index file, which contains detailed information about each file (e.g., project, platform, sequencing center, insert size, library name, etc.) In addition, we have added links to the appropriate directories/files for the datasets in our google docs spreadsheet for NA12878 (https://docs.google.com/spreadsheet/ccc?key=0ArAo1qqJJDHQdHo0U1FzQV9JYVZ...) - note that we have datasets on the ftp site from each of the 3 tabs in the spreadsheet, though most are WGS. The SRR and SRP numbers in the spreadsheet can also help identify the corresponding files for download. Both fastq and bam files are available for many of the selected datasets, though a few do not yet have bam files available or are in the process of being uploaded. If anyone is interested in generating bam files for any datasets, NCBI is working on a way to upload data in the future.

Future updates may include:
1. Adding vcf files as we generate them from individual datasets or by combining datasets.
2. Adding data from family members of NA12878 (as well as for future NIST Reference Materials from PGP)
3. Adding the ability to upload new fastq, bam, and vcf files.

We encourage you to explore the ftp site and make any suggestions by posting comments on this blog. For those of you who have asked for our NIST integrated calls for NA12878, you can use these bam files to explore some of the evidence we used for our calls. You are also welcome to start downloading and analyzing these data in preparation for our August consortium workshop. We will likely schedule a teleconference in the next couple weeks to discuss the ftp site and analyzing the data, so stay tuned to your emails. In the meantime, please let us know if you have any questions or suggestions for organizing or improving the ftp site.

Cheers,
Justin Zook