Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 62,303 bioRxiv papers from 276,577 authors.

Most downloaded bioRxiv papers, year to date

in category bioinformatics

6,116 results found. For more information, click each entry to expand.

1: Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects
more details view paper

Posted to bioRxiv 28 Dec 2018

Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects
8,042 downloads bioinformatics

Fedor Galkin, Alexander Aliper, Evgeny Putin, Igor Kuznetsov, Vadim N Gladyshev, Alex Zhavoronkov

The human gut microbiome is a complex ecosystem that both affects and is affected by its host status. Previous analyses of gut microflora revealed associations between specific microbes and host health and disease status, genotype and diet. Here, we developed a method of predicting the biological age of the host based on the microbiological profiles of gut microbiota using a curated dataset of 1,165 healthy individuals (1,663 microbiome samples). Our predictive model, a human microbiome clock, has an architecture of a deep neural network and achieves the accuracy of 3.94 years mean absolute error in cross-validation. The performance of the deep microbiome clock was also evaluated on several additional populations. We further introduce a platform for biological interpretation of individual microbial features used in age models, which relies on permutation feature importance and accumulated local effects. This approach has allowed us to define two lists of 95 intestinal biomarkers of human aging. We further show that this list can be reduced to 39 taxa that convey the most information on their host's aging. Overall, we show that (a) microbiological profiles can be used to predict human age; and (b) microbial features selected by models are age-related.

2: End-to-end differentiable learning of protein structure
more details view paper

Posted to bioRxiv 14 Feb 2018

End-to-end differentiable learning of protein structure
7,772 downloads bioinformatics

Mohammed AlQuraishi

Accurate prediction of protein structure is one of the central challenges of biochemistry. Despite significant progress made by co-evolution methods to predict protein structure from signatures of residue-residue coupling found in the evolutionary record, a direct and explicit mapping between protein sequence and structure remains elusive, with no substantial recent progress. Meanwhile, rapid developments in deep learning, which have found remarkable success in computer vision, natural language processing, and quantum chemistry raise the question of whether a deep learning based approach to protein structure could yield similar advancements. A key ingredient of the success of deep learning is the reformulation of complex, human-designed, multi-stage pipelines with differentiable models that can be jointly optimized end-to-end. We report the development of such a model, which reformulates the entire structure prediction pipeline using differentiable primitives. Achieving this required combining four technical ideas: (1) the adoption of a recurrent neural architecture to encode the internal representation of protein sequence, (2) the parameterization of (local) protein structure by torsional angles, which provides a way to reason over protein conformations without violating the covalent chemistry of protein chains, (3) the coupling of local protein structure to its global representation via recurrent geometric units, and (4) the use of a differentiable loss function to capture deviations between predicted and experimental structures. To our knowledge this is the first end-to-end differentiable model for learning of protein structure. We test the effectiveness of this approach using two challenging tasks: the prediction of novel protein folds without the use of co-evolutionary information, and the prediction of known protein folds without the use of structural templates. On the first task the model achieves state-of-the-art performance, even when compared to methods that rely on co-evolutionary data. On the second task the model is competitive with methods that use experimental protein structures as templates, achieving 3-7Å accuracy despite being template-free. Beyond protein structure prediction, end-to-end differentiable models of proteins represent a new paradigm for learning and modeling protein structure, with potential applications in docking, molecular dynamics, and protein design.

3: Moving beyond P values: Everyday data analysis with estimation plots
more details view paper

Posted to bioRxiv 26 Jul 2018

Moving beyond P values: Everyday data analysis with estimation plots
7,127 downloads bioinformatics

Joses Ho, Tayfun Tumkaya, Sameer Aryal, Hyungwon Choi, Adam Claridge-Chang

Over the past 75 years, a number of statisticians have advised that the data-analysis method known as null-hypothesis significance testing (NHST) should be deprecated (Berkson, 1942; Halsey et al., 2015). The limitations of NHST have been extensively discussed, with an emerging consensus that current statistical practice in the biological sciences needs reform. However, there is less agreement on the specific nature of reform, with vigorous debate surrounding what would constitute a suitable alternative (Altman et al., 2000; Benjamin et al., 2017; Cumming and Calin-Jageman, 2016). An emerging view is that a more complete analytic technique would use statistical graphics to estimate effect sizes and their uncertainty (Cohen, 1994; Cumming and Calin-Jageman, 2016). As these estimation methods require only minimal statistical retraining, they have great potential to change the current data-analysis culture away from dichotomous thinking towards quantitative reasoning (Claridge-Chang and Assam, 2016). The evolution of statistics has been inextricably linked to the development of improved quantitative displays that support complex visual reasoning (Tufte, 2001). We consider that the graphic we describe here as an estimation plot is the most intuitive way to display the complete statistical information about experimental data sets. However, a major obstacle to adopting estimation is accessibility to suitable software. To overcome this hurdle, we have developed free software that makes high-quality estimation plotting available to all. Here, we explain the rationale for estimation plots by contrasting them with conventional charts used to display NHST data, and describe how the use of these graphs affords five major analytical advantages.

4: Performance of neural network basecalling tools for Oxford Nanopore sequencing
more details view paper

Posted to bioRxiv 07 Feb 2019

Performance of neural network basecalling tools for Oxford Nanopore sequencing
7,091 downloads bioinformatics

Ryan R Wick, Louise M Judd, Kathryn Holt

Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rules consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences ('polishing') with Nanopolish somewhat negates the accuracy differences in basecallers, but prepolish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last two years. The current version of ONT's Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

5: Evaluation of UMAP as an alternative to t-SNE for single-cell data
more details view paper

Posted to bioRxiv 10 Apr 2018

Evaluation of UMAP as an alternative to t-SNE for single-cell data
6,165 downloads bioinformatics

Etienne Becht, Charles-Antoine Dutertre, Immanuel W. H. Kwok, Lai Guan Ng, Florent Ginhoux, Evan W Newell

Uniform Manifold Approximation and Projection (UMAP) is a recently-published non-linear dimensionality reduction technique. Another such algorithm, t-SNE, has been the default method for such task in the past years. Herein we comment on the usefulness of UMAP high-dimensional cytometry and single-cell RNA sequencing, notably highlighting faster runtime and consistency, meaningful organization of cell clusters and preservation of continuums in UMAP compared to t-SNE.

6: The art of using t-SNE for single-cell transcriptomics
more details view paper

Posted to bioRxiv 25 Oct 2018

The art of using t-SNE for single-cell transcriptomics
5,951 downloads bioinformatics

Dmitry Kobak, Philipp Berens

Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.

7: Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes
more details view paper

Posted to bioRxiv 26 Jan 2019

Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes
5,419 downloads bioinformatics

Nicola De Maio, Liam P. Shaw, Alasdair Hubbard, Sophie George, Nick Sanderson, Jeremy Swann, Ryan Wick, Manal AbuOun, Emma Stubberfield, Sarah J Hoosdally, Derrick W Crook, Timothy E. A. Peto, Anna E Sheppard, Mark J. Bailey, Daniel S Read, Muna F. Anjum, A Sarah Walker, Nicole Stoesser, The REHAB consortium

Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly); however, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies. Both strategies facilitate high-quality genome reconstruction. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.

8: MASST: A Web-based Basic Mass Spectrometry Search Tool for Molecules to Search Public Data.
more details view paper

Posted to bioRxiv 28 Mar 2019

MASST: A Web-based Basic Mass Spectrometry Search Tool for Molecules to Search Public Data.
5,291 downloads bioinformatics

Mingxun Wang, Alan K. Jarmusch, Fernando Vargas, Alexander A. Aksenov, Julia M. Gauglitz, Kelly Weldon, Daniel Petras, Ricardo da Silva, Robby Quinn, Alexey V Melnik, Justin J.J. van der Hooft, Andrés Mauricio Caraballo Rodríguez, Louis Felix Nothias, Christine M. Aceves, Morgan Panitchpakdi, Elizabeth Brown, Francesca Di Ottavio, Nicole Sikora, Emmanuel O. Elijah, Lara Labarta-Bajo, Emily C. Gentry, Shabnam Shalapour, Kathleen E. Kyle, Sara P. Puckett, Jeramie D. Watrous, Carolina S. Carpenter, Amina Bouslimani, Madeleine Ernst, Austin D Swafford, Elina I. Zúñiga, Marcy J. Balunas, Jonathan L. Klassen, Rohit Loomba, Rob Knight, Nuno Bandeira, Pieter C Dorrestein

We introduce a web-enabled small-molecule mass spectrometry (MS) search engine. To date, no tool can query all the public small-molecule tandem MS data in metabolomics repositories, greatly limiting the utility of these resources in clinical, environmental and natural product applications. Therefore, we introduce a Mass Spectrometry Search Tool (MASST) (https://proteosafe-extensions.ucsd.edu/masst/), that enables the discovery of molecular relationships among accessible public metabolomics and natural product tandem mass spectrometry data (MS/MS).

9: Using Deep Learning to Annotate the Protein Universe
more details view paper

Posted to bioRxiv 03 May 2019

Using Deep Learning to Annotate the Protein Universe
5,025 downloads bioinformatics

Maxwell L Bileschi, David Belanger, Drew H Bryant, Theo Sanderson, Brandon Carter, D. Sculley, Mark L DePristo, Lucy J Colwell

Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. In this paper, we explore an alternative methodology based on deep learning that learns the relationship between unaligned amino acid sequences and their functional annotations across all 17929 families of the Pfam database. Using the Pfam seed sequences we establish rigorous benchmark assessments that use both random and clustered data splits to control for potentially confounding sequence similarities between train and test sequences. Using Pfam full, we report convolutional networks that are significantly more accurate and computationally efficient than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space, allowing sequences from novel families to be accurately annotated. These results suggest deep learning models will be a core component of future protein function prediction tools.

10: Fast and accurate long-read assembly with wtdbg2
more details view paper

Posted to bioRxiv 26 Jan 2019

Fast and accurate long-read assembly with wtdbg2
4,925 downloads bioinformatics

Jue Ruan, Heng Li

Existing long-read assemblers require tens of thousands of CPU hours to assemble a human genome and are being outpaced by sequencing technologies in terms of both throughput and cost. We developed a novel long-read assembler wtdbg2 that, for human data, is tens of times faster than published tools while achieving comparable contiguity and accuracy. It represents a significant algorithmic advance and paves the way for population-scale long-read assembly in future.

11: A comparison of single-cell trajectory inference methods: towards more accurate and robust tools
more details view paper

Posted to bioRxiv 05 Mar 2018

A comparison of single-cell trajectory inference methods: towards more accurate and robust tools
4,667 downloads bioinformatics

Wouter Saelens, Robrecht Cannoodt, Helena Todorov, Yvan Saeys

Using single-cell -omics data, it is now possible to computationally order cells along trajectories, allowing the unbiased study of cellular dynamic processes. Since 2014, more than 50 trajectory inference methods have been developed, each with its own set of methodological characteristics. As a result, choosing a method to infer trajectories is often challenging, since a comprehensive assessment of the performance and robustness of each method is still lacking. In order to facilitate the comparison of the results of these methods to each other and to a gold standard, we developed a global framework to benchmark trajectory inference tools. Using this framework, we compared the trajectories from a total of 29 trajectory inference methods, on a large collection of real and synthetic datasets. We evaluate methods using several metrics, including accuracy of the inferred ordering, correctness of the network topology, code quality and user friendliness. We found that some methods, including Slingshot, TSCAN and Monocle DDRTree, clearly outperform other methods, although their performance depended on the type of trajectory present in the data. Based on our benchmarking results, we therefore developed a set of guidelines for method users. However, our analysis also indicated that there is still a lot of room for improvement, especially for methods detecting complex trajectory topologies. Our evaluation pipeline can therefore be used to spearhead the development of new scalable and more accurate methods, and is available at github.com/dynverse/dynverse. To our knowledge, this is the first comprehensive assessment of trajectory inference methods. For now, we exclusively evaluated the methods on their default parameters, but plan to add a detailed parameter tuning procedure in the future. We gladly welcome any discussion and feedback on key decisions made as part of this study, including the metrics used in the benchmark, the quality control checklist, and the implementation of the method wrappers. These discussions can be held at github.com/dynverse/dynverse/issues.

12: A comparison of three programming languages for a full-fledged next-generation sequencing tool
more details view paper

Posted to bioRxiv 22 Feb 2019

A comparison of three programming languages for a full-fledged next-generation sequencing tool
4,577 downloads bioinformatics

Pascal Costanza, Charlotte Herzeel, Wilfried Verachtert

Background: elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use. Results: The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case. Conclusions: Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.

13: Visualizing Structure and Transitions for Biological Data Exploration
more details view paper

Posted to bioRxiv 24 Mar 2017

Visualizing Structure and Transitions for Biological Data Exploration
4,536 downloads bioinformatics

Kevin R Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B Burkhardt, William S Chen, Kristina Yim, Antonia van den Elzen, Matthew J Hirn, Ronald R. Coifman, Natalia B Ivanova, Guy Wolf, Smita Krishnaswamy

With the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.

14: Third-generation sequencing and the future of genomics
more details view paper

Posted to bioRxiv 13 Apr 2016

Third-generation sequencing and the future of genomics
4,407 downloads bioinformatics

Hayan Lee, James Gurtowski, Shinjae Yoo, Maria Nattestad, Shoshana Marcus, Sara Goodwin, W. Richard McCombie, Michael C. Schatz

Third-generation long-range DNA sequencing and mapping technologies are creating a renaissance in high-quality genome sequencing. Unlike second-generation sequencing, which produces short reads a few hundred base-pairs long, third-generation single-molecule technologies generate over 10,000 bp reads or map over 100,000 bp molecules. We analyze how increased read lengths can be used to address long-standing problems in de novo genome assembly, structural variation analysis and haplotype phasing.

15: Opportunities And Obstacles For Deep Learning In Biology And Medicine
more details view paper

Posted to bioRxiv 28 May 2017

Opportunities And Obstacles For Deep Learning In Biology And Medicine
4,243 downloads bioinformatics

Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, Wei Xie, Gail L. Rosen, Benjamin J. Lengerich, Johnny Israeli, Jack Lanchantin, Stephen Woloszynek, Anne E Carpenter, Avanti Shrikumar, Jinbo Xu, Evan M. Cofer, Christopher A. Lavender, Srinivas Turaga, Amr Mohamed Alexandari, Zhiyong Lu, David J. Harris, Dave DeCaprio, Yanjun Qi, Anshul Kundaje, Yifan Peng, Laura K. Wiley, Marwin H. S. Segler, Simina M. Boca, S. Joshua Swamidass, Austin Huang, Anthony Gitter, Casey S. Greene

Deep learning, which describes a class of machine learning algorithms, has recently showed impressive results across a variety of domains. Biology and medicine are data rich, but the data are complex and often ill-understood. Problems of this nature may be particularly well-suited to deep learning techniques. We examine applications of deep learning to a variety of biomedical problems - patient classification, fundamental biological processes, and treatment of patients - and discuss whether deep learning will transform these tasks or if the biomedical sphere poses unique challenges. We find that deep learning has yet to revolutionize or definitively resolve any of these problems, but promising advances have been made on the prior state of the art. Even when improvement over a previous baseline has been modest, we have seen signs that deep learning methods may speed or aid human investigation. More work is needed to address concerns related to interpretability and how to best model each problem. Furthermore, the limited amount of labeled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning powering changes at both bench and bedside with the potential to transform several areas of biology and medicine.

16: Modular and efficient pre-processing of single-cell RNA-seq
more details view paper

Posted to bioRxiv 17 Jun 2019

Modular and efficient pre-processing of single-cell RNA-seq
4,121 downloads bioinformatics

Páll Melsted, A. Sina Booeshaghi, Fan Gao, Eduardo da Veiga Beltrame, Lambda Lu, Kristján Eldjárn Hjorleifsson, Jase Gehring, Lior Pachter

Analysis of single-cell RNA-seq data begins with pre-processing of sequencing reads to generate count matrices. We investigate algorithm choices for the challenges of pre-processing, and describe a workflow that balances efficiency and accuracy. Our workflow is based on the kallisto (<https://pachterlab.github.io/kallisto/>) and bustools (<https://bustools.github.io/>) programs, and is near-optimal in speed and memory. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses. Documentation and tutorials for using the kallisto | bus workflow are available at <https://www.kallistobus.tools/>.

17: A Systematic Evaluation of Single Cell RNA-Seq Analysis Pipelines
more details view paper

Posted to bioRxiv 19 Mar 2019

A Systematic Evaluation of Single Cell RNA-Seq Analysis Pipelines
4,089 downloads bioinformatics

Beate Vieth, Swati Parekh, Christoph Ziegenhain, Wolfgang Enard, Ines Hellmann

The recent rapid spread of single cell RNA sequencing (scRNA-seq) methods has created a large variety of experimental and computational pipelines for which best practices have not been established, yet. Here, we use simulations based on five scRNA-seq library protocols in combination with nine realistic differential expression (DE) setups to systematically evaluate three mapping, four imputation, seven normalisation and four differential expression testing approaches resulting in ∼ 3,000 pipelines, allowing us to also assess interactions among pipeline steps. We find that choices of normalisation and library preparation protocols have the biggest impact on scRNA-seq analyses. Specifically, we find that library preparation determines the ability to detect symmetric expression differences, while normalisation dominates pipeline performance in asymmetric DE-setups. Finally, we illustrate the importance of informed choices by showing that a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the sample size.

18: Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit
more details view paper

Posted to bioRxiv 26 Jul 2019

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit
3,997 downloads bioinformatics

Kishwar Shafin, Trevor Pesout, Ryan Lorig-Roach, Marina Haukness, Hugh E Olsen, Colleen Bosworth, Joel Armstrong, Kristof Tigyi, Nicholas Maurer, Sergey Koren, Fritz J. Sedlazeck, Tobias Marschall, Simon Mayes, Vania Costa, Justin M Zook, Kelvin J Liu, Duncan Kilburn, Melanie Sorensen, Katy M Munson, Mitchell R. Vollger, Evan E Eichler, Sofie Salama, David Haussler, Richard E. Green, Mark Akeson, Adam Phillippy, Karen H Miga, Paolo Carnevali, Miten Jain, Benedict Paten

Present workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

19: Deep learning in bioinformatics: introduction, application, and perspective in big data era
more details view paper

Posted to bioRxiv 28 Feb 2019

Deep learning in bioinformatics: introduction, application, and perspective in big data era
3,934 downloads bioinformatics

Yu Li, Chao Huang, Lizhong Ding, Zhongxiao Li, Yijie Pan, Xin Gao

Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples .

20: A deep learning approach to pattern recognition for short DNA sequences
more details view paper

Posted to bioRxiv 22 Jun 2018

A deep learning approach to pattern recognition for short DNA sequences
3,825 downloads bioinformatics

Akosua Busia, George E. Dahl, Clara Fannjiang, David H. Alexander, Elizabeth Dorfman, Ryan Poplin, Cory Y. McLean, Pi-Chuan Chang, Mark DePristo

Motivation Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess a deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences. Results We demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species than k -mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences. Availability TensorFlow training code is available through GitHub (<https://github.com/tensorflow/models/tree/master/research>). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/). Contact seq2species-interest{at}google.com Supplementary information Supplementary data are available in a separate document.

Previous page 1 2 3 4 5 . . . 306 Next page

Sign up for the Rxivist weekly newsletter! (Click here for more details.)