In phenotype annotations curated from the biological and medical literature, considerable human effort must be invested to select ontological classes that capture the expressivity of the original natural language descriptions, and finer annotation granularity can also entail higher computational costs for particular reasoning tasks. Do coarse annotations suffice for certain applications? Here, we measure how annotation granularity affects the statistical behavior of semantic similarity metrics. We use a randomized dataset of phenotype profiles drawn from 57,051 taxon-phenotype annotations in the Phenoscape Knowledgebase. We compared query profiles having variable proportions of matching phenotypes to subject database profiles using both pairwise and groupwise Jaccard (edge-based) and Resnik (node-based) semantic similarity metrics, and compared statistical performance for three different levels of annotation granularity: entities alone, entities plus attributes, and entities plus qualities (with implicit attributes). All four metrics examined showed more extreme values than expected by chance when approximately half the annotations matched between the query and subject profiles, with a more sudden decline for pairwise statistics and a more gradual one for the groupwise statistics. Annotation granularity had a negligible effect on the position of the threshold at which matches could be discriminated from noise. These results suggest that coarse annotations of phenotypes, at the level of entities with or without attributes, may be sufficient to identify phenotype profiles with statistically significant semantic similarity.
- Downloaded 479 times
- Download rankings, all-time:
- Site-wide: 69,618
- In bioinformatics: 6,603
- Year to date:
- Site-wide: 144,233
- Since beginning of last month:
- Site-wide: 151,849
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!