Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 54,982 bioRxiv papers from 253,713 authors.

Most downloaded bioRxiv papers, year to date

in category bioinformatics

5,465 results found. For more information, click each entry to expand.

1: Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects
more details view paper

Posted to bioRxiv 28 Dec 2018

Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects
6,899 downloads bioinformatics

Fedor Galkin, Alexander Aliper, Evgeny Putin, Igor Kuznetsov, Vadim N Gladyshev, Alex Zhavoronkov

The human gut microbiome is a complex ecosystem that both affects and is affected by its host status. Previous analyses of gut microflora revealed associations between specific microbes and host health and disease status, genotype and diet. Here, we developed a method of predicting the biological age of the host based on the microbiological profiles of gut microbiota using a curated dataset of 1,165 healthy individuals (1,663 microbiome samples). Our predictive model, a human microbiome clock, has an architecture of a deep neural network and achieves the accuracy of 3.94 years mean absolute error in cross-validation. The performance of the deep microbiome clock was also evaluated on several additional populations. We further introduce a platform for biological interpretation of individual microbial features used in age models, which relies on permutation feature importance and accumulated local effects. This approach has allowed us to define two lists of 95 intestinal biomarkers of human aging. We further show that this list can be reduced to 39 taxa that convey the most information on their host's aging. Overall, we show that (a) microbiological profiles can be used to predict human age; and (b) microbial features selected by models are age-related.

2: End-to-end differentiable learning of protein structure
more details view paper

Posted to bioRxiv 14 Feb 2018

End-to-end differentiable learning of protein structure
5,767 downloads bioinformatics

Mohammed AlQuraishi

Accurate prediction of protein structure is one of the central challenges of biochemistry. Despite significant progress made by co-evolution methods to predict protein structure from signatures of residue-residue coupling found in the evolutionary record, a direct and explicit mapping between protein sequence and structure remains elusive, with no substantial recent progress. Meanwhile, rapid developments in deep learning, which have found remarkable success in computer vision, natural language processing, and quantum chemistry raise the question of whether a deep learning based approach to protein structure could yield similar advancements. A key ingredient of the success of deep learning is the reformulation of complex, human-designed, multi-stage pipelines with differentiable models that can be jointly optimized end-to-end. We report the development of such a model, which reformulates the entire structure prediction pipeline using differentiable primitives. Achieving this required combining four technical ideas: (1) the adoption of a recurrent neural architecture to encode the internal representation of protein sequence, (2) the parameterization of (local) protein structure by torsional angles, which provides a way to reason over protein conformations without violating the covalent chemistry of protein chains, (3) the coupling of local protein structure to its global representation via recurrent geometric units, and (4) the use of a differentiable loss function to capture deviations between predicted and experimental structures. To our knowledge this is the first end-to-end differentiable model for learning of protein structure. We test the effectiveness of this approach using two challenging tasks: the prediction of novel protein folds without the use of co-evolutionary information, and the prediction of known protein folds without the use of structural templates. On the first task the model achieves state-of-the-art performance, even when compared to methods that rely on co-evolutionary data. On the second task the model is competitive with methods that use experimental protein structures as templates, achieving 3-7Å accuracy despite being template-free. Beyond protein structure prediction, end-to-end differentiable models of proteins represent a new paradigm for learning and modeling protein structure, with potential applications in docking, molecular dynamics, and protein design.

3: Moving beyond P values: Everyday data analysis with estimation plots
more details view paper

Posted to bioRxiv 26 Jul 2018

Moving beyond P values: Everyday data analysis with estimation plots
5,190 downloads bioinformatics

Joses Ho, Tayfun Tumkaya, Sameer Aryal, Hyungwon Choi, Adam Claridge-Chang

Over the past 75 years, a number of statisticians have advised that the data-analysis method known as null-hypothesis significance testing (NHST) should be deprecated (Berkson, 1942; Halsey et al., 2015). The limitations of NHST have been extensively discussed, with an emerging consensus that current statistical practice in the biological sciences needs reform. However, there is less agreement on the specific nature of reform, with vigorous debate surrounding what would constitute a suitable alternative (Altman et al., 2000; Benjamin et al., 2017; Cumming and Calin-Jageman, 2016). An emerging view is that a more complete analytic technique would use statistical graphics to estimate effect sizes and their uncertainty (Cohen, 1994; Cumming and Calin-Jageman, 2016). As these estimation methods require only minimal statistical retraining, they have great potential to change the current data-analysis culture away from dichotomous thinking towards quantitative reasoning (Claridge-Chang and Assam, 2016). The evolution of statistics has been inextricably linked to the development of improved quantitative displays that support complex visual reasoning (Tufte, 2001). We consider that the graphic we describe here as an estimation plot is the most intuitive way to display the complete statistical information about experimental data sets. However, a major obstacle to adopting estimation is accessibility to suitable software. To overcome this hurdle, we have developed free software that makes high-quality estimation plotting available to all. Here, we explain the rationale for estimation plots by contrasting them with conventional charts used to display NHST data, and describe how the use of these graphs affords five major analytical advantages.

4: MASST: A Web-based Basic Mass Spectrometry Search Tool for Molecules to Search Public Data.
more details view paper

Posted to bioRxiv 28 Mar 2019

MASST: A Web-based Basic Mass Spectrometry Search Tool for Molecules to Search Public Data.
5,014 downloads bioinformatics

Mingxun Wang, Alan K. Jarmusch, Fernando Vargas, Alexander A. Aksenov, Julia Gauglitz, Kelly Weldon, Daniel Petras, Ricardo da Silva, Robby Quinn, Alexey Melnik, Justin J.J. van der Hooft, Andres Mauricio Caraballo Rodriguez, Louis Felix Nothias, Christine M. Aceves, Morgan Panitchpakdi, Elizabeth Brown, Francesca Di Ottavio, Nicole Sikora, Emmanuel O. Elijah, Lara Labarta-Bajo, Emily G. Gentry, Shabnam Shalapour, Kathleen E. Kyle, Sara P. Puckett, Jeramie D. Watrous, Carolina S. Carpenter, Amina Bouslimani, Madeleine Ernst, Austin D Swafford, Elina I Zuniga, Marcy J. Balunas, Jonathan L. Klassen, Rohit Loomba, Rob Knight, Nuno Bandeira, Pieter C Dorrestein

We introduce a web-enabled small-molecule mass spectrometry (MS) search engine. To date, no tool can query all the public small-molecule tandem MS data in metabolomics repositories, greatly limiting the utility of these resources in clinical, environmental and natural product applications. Therefore, we introduce a Mass Spectrometry Search Tool (MASST) (https://proteosafe-extensions.ucsd.edu/masst/), that enables the discovery of molecular relationships among accessible public metabolomics and natural product tandem mass spectrometry data (MS/MS).

5: Performance of neural network basecalling tools for Oxford Nanopore sequencing
more details view paper

Posted to bioRxiv 07 Feb 2019

Performance of neural network basecalling tools for Oxford Nanopore sequencing
4,892 downloads bioinformatics

Ryan R Wick, Louise M Judd, Kathryn Holt

Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rules consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences ('polishing') with Nanopolish somewhat negates the accuracy differences in basecallers, but prepolish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last two years. The current version of ONT's Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

6: Evaluation of UMAP as an alternative to t-SNE for single-cell data
more details view paper

Posted to bioRxiv 10 Apr 2018

Evaluation of UMAP as an alternative to t-SNE for single-cell data
4,577 downloads bioinformatics

Etienne Becht, Charles-Antoine Dutertre, Immanuel W. H. Kwok, Lai Guan Ng, Florent Ginhoux, Evan W Newell

Uniform Manifold Approximation and Projection (UMAP) is a recently-published non-linear dimensionality reduction technique. Another such algorithm, t-SNE, has been the default method for such task in the past years. Herein we comment on the usefulness of UMAP high-dimensional cytometry and single-cell RNA sequencing, notably highlighting faster runtime and consistency, meaningful organization of cell clusters and preservation of continuums in UMAP compared to t-SNE.

7: The art of using t-SNE for single-cell transcriptomics
more details view paper

Posted to bioRxiv 25 Oct 2018

The art of using t-SNE for single-cell transcriptomics
4,072 downloads bioinformatics

Dmitry Kobak, Philipp Berens

Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.

8: A comparison of three programming languages for a full-fledged next-generation sequencing tool
more details view paper

Posted to bioRxiv 22 Feb 2019

A comparison of three programming languages for a full-fledged next-generation sequencing tool
3,971 downloads bioinformatics

Pascal Costanza, Charlotte Herzeel, Wilfried Verachtert

Background: elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use. Results: The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case. Conclusions: Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.

9: Fast and accurate long-read assembly with wtdbg2
more details view paper

Posted to bioRxiv 26 Jan 2019

Fast and accurate long-read assembly with wtdbg2
3,900 downloads bioinformatics

Jue Ruan, Heng Li

Existing long-read assemblers require tens of thousands of CPU hours to assemble a human genome and are being outpaced by sequencing technologies in terms of both throughput and cost. We developed a novel long-read assembler wtdbg2 that, for human data, is tens of times faster than published tools while achieving comparable contiguity and accuracy. It represents a significant algorithmic advance and paves the way for population-scale long-read assembly in future.

10: A comparison of single-cell trajectory inference methods: towards more accurate and robust tools
more details view paper

Posted to bioRxiv 05 Mar 2018

A comparison of single-cell trajectory inference methods: towards more accurate and robust tools
3,772 downloads bioinformatics

Wouter Saelens, Robrecht Cannoodt, Helena Todorov, Yvan Saeys

Using single-cell -omics data, it is now possible to computationally order cells along trajectories, allowing the unbiased study of cellular dynamic processes. Since 2014, more than 50 trajectory inference methods have been developed, each with its own set of methodological characteristics. As a result, choosing a method to infer trajectories is often challenging, since a comprehensive assessment of the performance and robustness of each method is still lacking. In order to facilitate the comparison of the results of these methods to each other and to a gold standard, we developed a global framework to benchmark trajectory inference tools. Using this framework, we compared the trajectories from a total of 29 trajectory inference methods, on a large collection of real and synthetic datasets. We evaluate methods using several metrics, including accuracy of the inferred ordering, correctness of the network topology, code quality and user friendliness. We found that some methods, including Slingshot, TSCAN and Monocle DDRTree, clearly outperform other methods, although their performance depended on the type of trajectory present in the data. Based on our benchmarking results, we therefore developed a set of guidelines for method users. However, our analysis also indicated that there is still a lot of room for improvement, especially for methods detecting complex trajectory topologies. Our evaluation pipeline can therefore be used to spearhead the development of new scalable and more accurate methods, and is available at github.com/dynverse/dynverse. To our knowledge, this is the first comprehensive assessment of trajectory inference methods. For now, we exclusively evaluated the methods on their default parameters, but plan to add a detailed parameter tuning procedure in the future. We gladly welcome any discussion and feedback on key decisions made as part of this study, including the metrics used in the benchmark, the quality control checklist, and the implementation of the method wrappers. These discussions can be held at github.com/dynverse/dynverse/issues.

11: Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes
more details view paper

Posted to bioRxiv 26 Jan 2019

Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes
3,515 downloads bioinformatics

Nicola De Maio, Liam P. Shaw, Alasdair Hubbard, Sophie George, Nick Sanderson, Jeremy Swann, Ryan Wick, Manal AbuOun, Emma Stubberfield, Sarah J Hoosdally, Derrick W Crook, Timothy E. A. Peto, Anna E Sheppard, Mark J. Bailey, Daniel S Read, Muna F. Anjum, A Sarah Walker, Nicole Stoesser, The REHAB consortium

Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly); however, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies. Both strategies facilitate high-quality genome reconstruction. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.

12: Opportunities And Obstacles For Deep Learning In Biology And Medicine
more details view paper

Posted to bioRxiv 28 May 2017

Opportunities And Obstacles For Deep Learning In Biology And Medicine
3,465 downloads bioinformatics

Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, Wei Xie, Gail L. Rosen, Benjamin J. Lengerich, Johnny Israeli, Jack Lanchantin, Stephen Woloszynek, Anne E Carpenter, Avanti Shrikumar, Jinbo Xu, Evan M. Cofer, Christopher A. Lavender, Srinivas Turaga, Amr Mohamed Alexandari, Zhiyong Lu, David J. Harris, Dave DeCaprio, Yanjun Qi, Anshul Kundaje, Yifan Peng, Laura K. Wiley, Marwin H. S. Segler, Simina M. Boca, S. Joshua Swamidass, Austin Huang, Anthony Gitter, Casey S. Greene

Deep learning, which describes a class of machine learning algorithms, has recently showed impressive results across a variety of domains. Biology and medicine are data rich, but the data are complex and often ill-understood. Problems of this nature may be particularly well-suited to deep learning techniques. We examine applications of deep learning to a variety of biomedical problems - patient classification, fundamental biological processes, and treatment of patients - and discuss whether deep learning will transform these tasks or if the biomedical sphere poses unique challenges. We find that deep learning has yet to revolutionize or definitively resolve any of these problems, but promising advances have been made on the prior state of the art. Even when improvement over a previous baseline has been modest, we have seen signs that deep learning methods may speed or aid human investigation. More work is needed to address concerns related to interpretability and how to best model each problem. Furthermore, the limited amount of labeled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning powering changes at both bench and bedside with the potential to transform several areas of biology and medicine.

13: Third-generation sequencing and the future of genomics
more details view paper

Posted to bioRxiv 13 Apr 2016

Third-generation sequencing and the future of genomics
3,258 downloads bioinformatics

Hayan Lee, James Gurtowski, Shinjae Yoo, Maria Nattestad, Shoshana Marcus, Sara Goodwin, W. Richard McCombie, Michael Schatz

Third-generation long-range DNA sequencing and mapping technologies are creating a renaissance in high-quality genome sequencing. Unlike second-generation sequencing, which produces short reads a few hundred base-pairs long, third-generation single-molecule technologies generate over 10,000 bp reads or map over 100,000 bp molecules. We analyze how increased read lengths can be used to address long-standing problems in de novo genome assembly, structural variation analysis and haplotype phasing.

14: Deep learning in bioinformatics: introduction, application, and perspective in big data era
more details view paper

Posted to bioRxiv 28 Feb 2019

Deep learning in bioinformatics: introduction, application, and perspective in big data era
3,126 downloads bioinformatics

Yu Li, Chao Huang, Lizhong Ding, Zhongxiao Li, Yijie Pan, Xin Gao

Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples .

15: A deep learning approach to pattern recognition for short DNA sequences
more details view paper

Posted to bioRxiv 22 Jun 2018

A deep learning approach to pattern recognition for short DNA sequences
2,671 downloads bioinformatics

Akosua Busia, George E. Dahl, Clara Fannjiang, David H. Alexander, Elizabeth Dorfman, Ryan Poplin, Cory Y. McLean, Pi-Chuan Chang, Mark DePristo

Motivation: Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess an deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences. Results: We demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species than k-mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences. Availability: TensorFlow training code is available through GitHub (https://github.com/tensorflow/models/tree/master/research). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/). Contact: seq2species-interest@google.com. Supplementary information: Supplementary data are available in a separate document.

16: Generative modeling and latent space arithmetics predict single-cell perturbation response across cell types, studies and species
more details view paper

Posted to bioRxiv 29 Nov 2018

Generative modeling and latent space arithmetics predict single-cell perturbation response across cell types, studies and species
2,668 downloads bioinformatics

Mohammad Lotfollahi, F. Alexander Wolf, Fabian J. Theis

Accurately modeling cellular response to perturbations is a central goal of computational biology. While such modeling has been proposed based on statistical, mechanistic and machine learning models in specific settings, no generalization of predictions to phenomena absent from training data ('out-of-sample') has yet been demonstrated. Here, we present scGen, a model combining variational autoencoders and latent space vector arithmetics for high-dimensional single-cell gene expression data. In benchmarks across a broad range of examples, we show that scGen accurately models dose and infection response of cells across cell types, studies and species. In particular, we demonstrate that scGen learns cell type and species specific response implying that it captures features that distinguish responding from non-responding genes and cells. With the upcoming availability of large-scale atlases of organs in healthy state, we envision scGen to become a tool for experimental design through in silico screening of perturbation response in the context of disease and drug treatment.

17: Visualizing Structure and Transitions for Biological Data Exploration
more details view paper

Posted to bioRxiv 24 Mar 2017

Visualizing Structure and Transitions for Biological Data Exploration
2,574 downloads bioinformatics

Kevin R Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel Burkhardt, William Chen, Kristina Yim, Antonia van den Elzen, Matthew J Hirn, Ronald R. Coifman, Natalia B Ivanova, Guy Wolf, Smita Krishnaswamy

With the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.

18: Droplet scRNA-seq is not zero-inflated
more details view paper

Posted to bioRxiv 19 Mar 2019

Droplet scRNA-seq is not zero-inflated
2,523 downloads bioinformatics

Valentine Svensson

Potential users of single cell RNA-sequencing often encounter a choice between high-throughput droplet based methods and high sensitivity plate based methods. In particular there is a widespread belief that single-cell RNA-sequencing will often fail to generate measurements for particular gene, cell pairs due to molecular inefficiencies, causing data to have an overabundance of zero-values. Investigation of published data of technical controls in droplet based single cell RNA-seq experiments demonstrates the number of zeros in the data is consistent with count statistics, indicating that over-abundances of zero-values in biological data are likely due to biological variation as opposed to technical shortcomings.

19: So you think you can PLS-DA?
more details view paper

Posted to bioRxiv 21 Oct 2017

So you think you can PLS-DA?
2,437 downloads bioinformatics

Daniel Ruiz Perez, Giri Narasimhan

Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a dimensionality reducer and a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models for the synthetic data sets used. Our experiments consider many useful distributions encountered when analyzing bioinformatics and clinical data, especially in the context of machine learning, where it is hoped that the program automatically extracts and/or learns the hidden relationships.

20: Harmonization and Annotation of Single-cell Transcriptomics data with Deep Generative Models
more details view paper

Posted to bioRxiv 29 Jan 2019

Harmonization and Annotation of Single-cell Transcriptomics data with Deep Generative Models
2,427 downloads bioinformatics

Chenling Xu, Romain Lopez, Edouard Mehlman, Jeffrey Regier, Michael I Jordan, Nir Yosef

As single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations --- for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.

Previous page 1 2 3 4 5 . . . 274 Next page

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News