New Human Reference Genomes Reveal Greater Diversity

We will continue to explore ways graph-based analyses could be used to benchmark methods used to characterize the MHC. It will be important to identify if these haplotypes can be represented in standard VCF files with respect to the primary GRCh37/38 references in GIAB benchmark sets, or whether existing benchmarks will need new representations and benchmarking tools. Although vg can project haplotypes into a VCF file with respect to the primary reference, it remains to be determined whether this is compatible with current benchmarking tools for small variants and structural variants.

It is presumed that the latest release of human reference genome, GRCh38 will contribute more to high throughput sequencing data analysis by providing more accuracy. We conducted a study to compare the genomic analysis results between the GRCh38 reference and its predecessor GRCh37. Through analyses of alignment, single nucleotide polymorphisms, small insertion/deletions, copy number and structural variants, we show that GRCh38 offers overall more accurate analysis of human sequencing data.

Once the genome has been sequenced, the readings of these individual fragments need to be put back together in order to be analysed by scientists looking for variation in regions of DNA that could have an impact on our health. In order to do this, scientists refer to what is known as a ‘reference genome’ – a template genome incorporating the most up to date information we have on human genomics. The current human reference genome build, known as GRCh38, is used worldwide, but whose genome is it based on and how reliable is it? To answer these questions, we must first go back to 2003 and the completion of the Human Genome Project . Moreover, in applications like tumor profiling, gene expression, or other functional genomics assays, a single reference sequence can be problematic. For a cancer genome, the best reference genome to which tumor data should be aligned is a matched normal genome of patient.

Release 23 of HapMap containing variant calls for 90 CEU individuals based on human reference assembly hg18. Through efforts like the 1KGP and CG public data releases, we are getting a new view that human variation much more extensive than previously thought. These data also expose several shortcomings of current microarray tools and alter the view of some basic tenets of the allelic variance of the human genome. While it is understood that increasing variation will decrease LD block size, the impact of increased variation has not been documented.

While a large number of SNPs (∼14 million) that are shared by both sets, 21% of the SNPs are unique to one of the sets. Part of the large difference could be due sampling differences; while both projects are sequencing HapMap individuals, there is not a complete overlap between the samples that were sequenced. Hence, the analysis was repeated with the 32 genomes that are shared by the two projects. Because the 1KGP sequencing is at a low depth, it will miss variants that should be detected in the higher coverage CG sequencing.

Second, the SNP must be one of the two variants for which the array was designed as variants are assumed to be biallelic. If either of these two conditions are violated, then the microarray probe will not function as well, or at all, and lead to false negative or false positive results for some individuals . The single person closest to the reference genome, RP-11, was almost certainly African-American. I also remember hearing once that an early draft of the reference genome accidentally included the sickle cell allele . I think the article makes an important point , but let’s try to tell a more nuanced story about the reference genome. “There are so many uses of the reference genome, and for every single one it has problems,” said computational biologist Jesse Gillis of Cold Spring Harbor Laboratory, one of many scientists arguing that it’s long past time to fix those problems.


