Rxivist logo

De novo assembly of two Swedish genomes reveals missing segments from the human GRCh38 reference and improves variant calling of population-scale sequencing data

By Adam Ameur, Huiwen Che, Marcel Martin, Ignas Bunikis, Johan Dahlberg, Ida Höijer, Susana Häggqvist, Francesco Vezzi, Jessica Nordlund, Pall Olason, Lars Feuk, Ulf Gyllensten

Posted 18 Feb 2018
bioRxiv DOI: 10.1101/267062 (published DOI: 10.3390/genes9100486)

We have performed de novo assembly of two Swedish genomes using long-read sequencing and optical mapping, resulting in total assembly sizes of nearly 3 Gb and hybrid scaffold N50 values of over 45 Mb. A further analysis revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have elevated GC-content and are primarily located in centromeric or telomeric regions. A BLAST search showed that 31% of the NS are different from any sequences deposited in nucleotide databases. The remaining NS correspond to human (62%) or primate (6%) nucleotide entries, while 1% of hits show the highest similarity to other species, including mouse and a few different classes of parasitic worms. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are missing from GRCh38 also at chromosomes 14, 17 and 21. Inclusion of these novel sequences into the GRCh38 reference radically improves the alignment and variant calling of whole-genome sequencing data at several genomic loci. Through a re-analysis of 200 samples from a Swedish population-scale sequencing project, we obtained over 75,000 putative novel SNVs per individual when using a custom version of GRCh38 extended with 17.3 Mb of NS. In addition, about 10,000 false positive SNV calls per individual were removed from the GRCh38 autosomes and sex chromosomes in the re-analysis, with some of them located in protein coding regions.

Download data

  • Downloaded 1,040 times
  • Download rankings, all-time:
    • Site-wide: 16,183 out of 118,994
    • In genomics: 1,759 out of 6,455
  • Year to date:
    • Site-wide: 100,284 out of 118,994
  • Since beginning of last month:
    • Site-wide: 110,129 out of 118,994

Altmetric data


Downloads over time

Distribution of downloads per paper, site-wide


PanLingua

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News