Our Reference point techniques suggestions within 2021 

Posted on

Our Reference point techniques suggestions within 2021 

There’s A Huge Problem With The Core Of The Human Genome Project

“These regions are essentially invisible to the genetics community until we have a reference genome that includes those regions,” Salzberg tells Inverse. Transitioning from using the hg19 reference to the hg38 reference takes significant time and resources. Through this large-scale study of sequencing data, the researchers aim to ease the burden on labs considering the transition. The study quantifies the benefits and drawbacks of the new reference and validates its utility in a lab setting. “We wanted to provide the list of 206 genes enriched with discordant variants and bring this issue to the attention of the labs working on these genes.”

GnomAD v3.0 contains SNV calls from short-read whole-genome data from 1662 Ashkenazi individuals. Because some variants were only called in a subset of these individuals, we considered only variant sites that were reported in a minimum of 200 people. We then collected major allele SNVs, requiring the allele frequency to be above 0.5 in the sampled population.

We also made small variant calls from Ash1 v1.1 relative to GRCh37 and compared these to the v4.0 benchmark variants from GIAB using the Global Alliance for Genomics and Health Benchmark tools . To identify candidates for correction in the assembly, we also excluded FPs in UCSC GRCh37 vs. GRCh37 self-chain alignments longer than 10 kb, since these were potential collapses in the assembly that would need to be corrected in a different way. Using the remaining FPs, we corrected 32,814 substitution errors, 6670 insertion errors, and 14,151 deletion errors in the Ash1 assembly. This did not correct any regions in Ash1 that aligned outside the v4.0 benchmark regions for GRCh37. We examined the translocation between chromosomes 15 and 20, which contains three of the genes in Table4, by looking more closely at the alignment between GRCh38 and Ash1. The translocation is at the telomere of both chromosomes, from position 65,079,275 to 65,109,824 of Ash1 chr20 and 101,950,338 to 101,980,928 of GRCh39 chr15.

Importantly, all Havana transcripts are included in the final Ensembl/Havana merged gene set. This would also require the adoption of better ways of representing the data (e.g., as a genome graph), along with the development of new informatics tools to make use of the new reference. This file is an odgi visualization of the Zea mays chr10 minimap2/seqwish graph for two species.

All the unmapped reads (including paired-end unmapped reads and single-end unmapped reads) were extracted with SAMTools (Li et al. 2009; Li 2011). We then remapped the unmapped reads to the non-redundant gap-closing sequences, with 99 bp flanking sequences on both sides. Finally, breadth of coverage and depth of coverage of the gap-closing sequences were calculated by SAMTools with related custom scripts. The issues with the reference are well known and ancillary approaches to handle sequence and variant analysis exist. Trio analysis strategies inform inheritance for rare variants, ancestry, if not known apriori in an individual, can be largely discerned through sequence data, and long read data can fill gaps and repetitive regions. The bigger issue will be user interfaces and other digital access tools which allow for exploration and use of that reference for translational research and the practice of medicine.

The next step would be to map RNA-seq reads to this graph and estimate coverage per base-pair using vg pack and gene-level quantification computed using GENCODE 29 annotation. Linear genomes currently rely on genomic intervals as a core formalism for annotation but it is difficult to generalize this formalism to reference graphs. However, if we restrict the annotation to one path in the graph, the alternate alleles in the graph are not included in the annotation. We argue that connected subgraphs are a more appropriate formalism for annotating genome graphs. Using a new core formalism for annotation necessarily means that infrastructure to manipulate it does not yet exist.


Leave a Reply

Your email address will not be published. Required fields are marked *