Rxivist logo

De Novo PacBio long-read and phased avian genome assemblies correct and add to genes important in neuroscience research

By Jonas Korlach, Gregory Gedman, Sarah B. Kingan, Chen-Shan Chin, Jason Howard, Lindsey Cantin, Erich J Jarvis

Posted 28 Jan 2017
bioRxiv DOI: 10.1101/103911

Reference quality genomes are expected to provide a resource for studying gene structure and function. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution to this problem is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna??s hummingbird reference, two vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range (N50s of 5.4 and 7.7 Mb, respectively), and representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read assemblies corrected and resolved what we discovered to be misassemblies, including due to erroneous sequences flanking gaps, complex repeat structure errors in the references, base call errors in difficult to sequence regions, and inaccurate resolution of allelic differences between the two haplotypes. We analyzed protein-coding genes widely studied in neuroscience and specialized in vocal learning species, and found numerous assembly and sequence errors in the reference genes that the PacBio-based assemblies resolved completely, validated by single long genomic reads and transcriptome reads. These findings demonstrate, for the first time in non-human vocal learning species, the impact of higher quality, phased and gap-less assemblies for understanding gene structure and function.

Download data

  • Downloaded 1,869 times
  • Download rankings, all-time:
    • Site-wide: 14,899
    • In genomics: 1,342
  • Year to date:
    • Site-wide: 88,262
  • Since beginning of last month:
    • Site-wide: 149,135

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide