NIST, in collaboration with others, has been developing methods to integrate multiple datasets from different sequencing platform to have a highly confident set of genotypes for its Reference Materials. As of now, we have integrated 12 datasets from 5 sequencing platforms to develop highly confident SNP and indel calls. We've also developed a bed file that excludes regions/variant locations that are uncertain due to low coverage, genotypes called in < 3 datasets, locations with unresolved discordant genotypes, locations where most datasets have evidence of bias (systematic sequencing errors, local alignment problems, mapping problems, or abnormal allele balance), variants inside possible deletions, known segmental duplications, and structural variants reported in dbVar for NA12878. In all, this excludes ~15% of the non-N bases in the GRCh37 reference assembly. To assess false positive and false negative rates, it is important to compare only variants inside the bed file regions. Our vcf and bed files, as well as a README.NIST with up-to-date information, are available at ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST.
If you have sequenced NA12878 or analyzed data from NA12878, we encourage you to compare the genotypes you have found to our vcf and bed files. This will allow us to improve our genotypes over time. In addition to discussing comparisons at our August 15-16 GIAB workshop at NIST, we've created a forum (http://genomeinabottle.org/forum-topic/nist-na12878-integrated-highly-co...) for discussing any discrepancies found between any individual dataset and our integrated genotypes. These discussions will be most productive if you post the integrated vcf and bed filenames that you compared, and how you compared them. Various programs can be used to compare variants, including GATK CombineVariants, USeq VCFComparator (http://sourceforge.net/projects/useq/), bcbio.variation (http://bcbio.wordpress.com/2013/05/06/framework-for-evaluating-variant-d...), and GCAT (http://www.bioplanet.com/gcat). We have found that it is often very useful to look at alignments around any variants that are discordant between datasets. We are in the process of resubmitting a paper describing the methods used to develop our integrated genotypes. Any questions about this dataset or publishing based on these data can be directed to the GIAB admin email or Justin Zook at NIST.