Precision Medicine Corpus

Welcome to the new homepage of the Precision Medicine Corpus!

This is a public resource providing a manually annotated corpus and related resources for information extraction in the biomedical domain.This corpus is collaboratively developed by researchers at Institute of Medical Information, Chinese Academy of Medical Sciences.

Corpus data (UPDATED 07.05.2019)

A version of corpus based on Precision Medicine Ontology V1.0 is available here, liver cancer V1.0 and intestinal cancer V1.0.

Corpus format

Every ZIP contains two collections, one is raw documents in .txt fomat (UTF-8 encoding), another is annotations in .ann format (text-based). Each document is named as PubMed ID and published year (e.g. 28581518_2017.txt).

All annotations follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type.

Examples of annotation for an entity (T1) and a relation (R1) are shown in the following.

Each entity annotation has a unique ID and is defined by type (e.g. Genes) and the span of characters containing the entity mention (represented as a "start end" offset pair).

Relation arguments are commonly identified simply as relation type and Arg1 and Arg2.

Note annotations provide a way to associate freeform text with either the document or a specific annotation. Notes lines begin with the number (or "hash") sign #.

Contact
The corpus is a developing resource, and there may be annotation errors in the data. If you identify any issues in the corpus data, we would like to know about them! Please address any comments and questions to Shaoping Fan： fan.shaoping@imicams.ac.cn

License
Precision Medicine Corpus is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.