Rxivist logo

NS-Forest: A machine learning method for the objective identification of minimum marker gene combinations for cell type determination from single cell RNA sequencing

By Brian Aevermann, Yun Zhang, Mark Novotny, Trygve E. Bakken, Jeremy Andrew Miller, Rebecca D Hodge, Boudewijn Lelieveldt, Ed Lein, Richard H. Scheuermann

Posted 24 Sep 2020
bioRxiv DOI: 10.1101/2020.09.23.308932

Single cell genomics is rapidly advancing our knowledge of cell phenotypic types and states. Driven by single cell/nucleus RNA sequencing (scRNA-seq) data, comprehensive atlas projects covering a wide range of organisms and tissues are currently underway. As a result, it is critical that the cell transcriptional phenotypes discovered are defined and disseminated in a consistent and concise manner. Molecular biomarkers have historically played an important role in biological research, from defining immune cell-types by surface protein expression to defining diseases by molecular drivers. Here we describe a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, which leverages the non-linear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that precisely captures the cell type identity represented in the complete scRNA-seq transcriptional profiles. The marker genes selected provide a barcode of the necessary and sufficient characteristics for semantic cell type definition and serve as useful tools for downstream biological investigation. The use of NS-Forest to identify marker genes for human brain middle temporal gyrus cell types reveals the importance of cell signaling and non-coding RNAs in neuronal cell type identity. ### Competing Interest Statement The authors have declared no competing interest.

Download data

  • Downloaded 558 times
  • Download rankings, all-time:
    • Site-wide: 67,056
    • In bioinformatics: 6,365
  • Year to date:
    • Site-wide: 56,035
  • Since beginning of last month:
    • Site-wide: 32,311

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide