Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 65,094 bioRxiv papers from 288,491 authors.

Large-scale annotated image datasets like ImageNet and CIFAR-10 have been essential in developing and testing sophisticated new machine learning algorithms for natural vision tasks. Such datasets allow the development of neural networks to make visual discriminations that are done by humans in everyday activities, e.g. discriminating classes of vehicles. An emerging field -- computational pathology -- applies such machine learning algorithms to the highly specialized vision task of diagnosing cancer or other diseases from pathology images. Importantly, labeling pathology images requires pathologists who have had decades of training, but due to the demands on pathologists' time (e.g. clinical service) obtaining a large annotated dataset of pathology images for supervised learning is difficult. To facilitate advances in computational pathology, on a scale similar to advances obtained in natural vision tasks using ImageNet, we leverage the power of social media. Pathologists worldwide share annotated pathology images on Twitter, which together provide thousands of diverse pathology images spanning many sub-disciplines. From Twitter, we assembled a dataset of 2,746 images from 1,576 tweets from 13 pathologists from 8 countries; each message includes both images and text commentary. To demonstrate the utility of these data for computational pathology, we apply machine learning to our new dataset to test whether we can accurately identify different stains and discriminate between different tissues. Using a Random Forest, we report (i) 0.959 +- 0.013 Area Under Receiver Operating Characteristic [AUROC] when identifying single-panel human hematoxylin and eosin [H&E] stained slides that are not overdrawn and (ii) 0.996 +- 0.004 AUROC when distinguishing H&E from immunohistochemistry [IHC] stained microscopy images. Moreover, we distinguish all pairs of breast, dermatological, gastrointestinal, genitourinary, and gynecological [gyn] pathology tissue types, with mean AUROC for any pairwise comparison ranging from 0.771 to 0.879. This range is 0.815 to 0.879 if gyn is excluded. We report 0.815 +- 0.054 AUROC when all five tissue types are considered in a single multiclass classification task. Our goal is to make this large-scale annotated dataset publicly available for researchers worldwide to develop, test, and compare their machine learning methods, an important step to advancing the field of computational pathology.

Download data

  • Downloaded 1,384 times
  • Download rankings, all-time:
    • Site-wide: 4,361 out of 65,094
    • In pathology: 15 out of 325
  • Year to date:
    • Site-wide: 1,694 out of 65,094
  • Since beginning of last month:
    • Site-wide: 3,094 out of 65,094

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide

Sign up for the Rxivist weekly newsletter! (Click here for more details.)