A Complete Reference Genome Improves Analysis Of Human Genetic Variation

The Torrent Suite software allows downloading of a particular human reference genome from the Ion Torrent servers. Ion Torrent calls it “hg19”, but it has distinct differences from the UCSC hg19. In particular, it uses the UCSC naming conventions (“chr1” to “chr22”, “chrX”, “chrY”, “chrM”), but has replaced the stale UCSC hg19 mitochondrial sequence with the newer GRCh37 one.

After long reads are generated from the PacBio, we assemble them using the Falcon algorithm followed by error correction using Quiver. The output of this step is a fasta file of unordered and unoriented contigs. We then align the BioNano genomic map generated from the same individual and clone end sequences to check for global misassemblies. We make breaks where possible based on these data, and output ordered and oriented contigs based on the map alignments.

This is also discussed in the presence of gene duplication where in a single copy gene case the mutations are rare due to the selection pressure. However, this selection pressure is reduced when there are two or more copies of the gene, and higher mutation rates are possible for at least one copy of the gene. Such a coordinate system offers a host of advantages, as it allows easier surjection/projection of graph coordinates onto the linear reference coordinates. It also streamlines variant discovery and improves annotation portability. A non-linear representation of a genome, in which paths in the graph represent individual genomes.

Different choices will be useful in different circumstances but these are very hard to establish when the choice of reference is largely arbitrary. If we pick a reference in a principled way, then those principles can also tell us when we should not pick the reference for our analyses. RefSeq biocurators focus on data curation for eukaryotic organisms, including several aspects of manual curation like sequence analysis, functional annotation, data validation and community collaboration. Learn how to access resources associated with human sequence variations and phenotypes associated with specific human genes and phenotypes. Our offering includes DNA sequencing, as well as RNA and gene expression analysis and future technology for analysing proteins.

When a probe matches one of these regions, the actual location that is interrogated in the genome is ambiguous. In individuals lacking the SV, the reference location is interrogated, while in individuals with the SV, the variant location is interrogated. We acknowledge that neither 1KGP data nor arrays are perfect and therefore the exact list of probes that are found to be potentially problematic will always be a moving target. Nevertheless, the overall counts and distribution of problematic probes would be highly similar if the 1KGP data were error-free. For each probe, we tested whether there were any SNPs or indels detected in the 1KGP data within 10bp of the targeted SNP on either the 5′ or 3′ side of the probe. We also tested whether the probe was contained within an annotated structural variant .

The effort is revealing interesting genomic rearrangements, new repeat predictions, new satellite arrays and transposable elements, new tandem repeats, and new genes, she said. “This has been tremendously useful in gaining data that can go up to 100kb plus,” Miga said. She highlighted the consortium’s partnership with Circulomics, which has helped develop their ultra-long dataset sequencing. “Some labs have been hesitant to use the new reference, but this study provides reassurance and guidance for those who are considering moving over.” In addition, the pseudo-autosomal regions of chromosome Y have been masked out (replaced with “N”), so that the respective regions in chromosome X may be treated as diploid.


