Background: Predicting outcomes on human genetic studies is difficult because the number of variables (genes) is often much larger than the number of observations (human subject tissue samples). We investigated means for improving model performance on the types of under-constrained problems that are typical in human genetics, where the number of genes (features) are strongly correlated but may exceed 10,000, and the number of study participants (observations) may be limited to under 1,000. Methods: We created 'train', 'validate' and 'test' datasets from 240 microarray observations from 127 subjects diagnosed with autism spectrum disorder (ASD) and 113 'typically developing' (TD) subjects (a.k.a., the 'naive' model). We trained a neural network model (a.k.a., the 'naive' model) on 10,422 genes using the 'train' dataset, composed of 70 ASD and 65 TD subjects, and we restricted the model to one, fully-connected hidden layer to minimize the number of trainable parameters, including a drop-out layer to further thin the network. We experimented with alternative network architectures and tuned the hyperparameters using the 'validate' dataset and performed a single, final evaluation using the hold-out 'test' dataset. Next, we trained a neural network model using the identical architecture and identical genes to predict tissue type in GTEx data. We transferred that learning by replacing the top layer of the GTEx model with a layer to predict ASD outcome and we retrained on the ASD dataset, again using the identical 10,422 genes. Findings: The 'naive' neural network model had AUROC=0.58 for the task of predicting ASD outcomes, which saw a statistically significant 7.8% improvement through the use of transfer learning. Interpretation: We demonstrated that neural network learning can be transferred from models trained on large RNA-Seq gene expression to a model trained on a small, microarray gene expression dataset with clinical utility for mitigating over-training on small sample sizes. Incidentally, we built a highly accurate classifier of tissue type with which to perform the transfer learning. Author Summary: Image recognition and natural language processing have enjoyed great success in reusing the computational efforts and data sources to overcome the problem of over-training a neural network on a limited dataset. Other domains using deep learning, including genomics and clinical applications, have been slower to benefit from transfer learning. Here we demonstrate data preparation and modeling techniques that allow genomics researchers to take advantage of transfer learning in order to increase the utility of limited clinical datasets. We show that a non-pretrained, 'naive' model performance can be improved by 7.8% by transferring learning from a highly performant model trained on GTEx data to solve a similar problem.
- Downloaded 208 times
- Download rankings, all-time:
- Site-wide: 131,798
- In bioinformatics: 10,430
- Year to date:
- Site-wide: 43,598
- Since beginning of last month:
- Site-wide: 50,671
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!