Rxivist logo

GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank

By Ronghui You, Zihan Zhang, Yi Xiong, Fengzhu Sun, Hiroshi Mamitsuka, Shangfeng Zhu

Posted 03 Jun 2017
bioRxiv DOI: 10.1101/145763 (published DOI: 10.1093/bioinformatics/bty130)

Motivation: Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of more than 70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multi-label classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore, homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus, the vital and challenging problem now is to develop a method for SAFP, particularly for difficult proteins. Methods: The key of this method is to extract not only homology information but also diverse, deep-rooted information/evidence from sequence inputs and integrate them into a predictor in an efficient and also effective manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a new paradigm of machine learning, especially powerful for multi-label classification. Results: The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods.

Download data

  • Downloaded 971 times
  • Download rankings, all-time:
    • Site-wide: 21,873
    • In bioinformatics: 2,591
  • Year to date:
    • Site-wide: 84,286
  • Since beginning of last month:
    • Site-wide: 99,722

Altmetric data


Downloads over time

Distribution of downloads per paper, site-wide


PanLingua

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News