Documents

The dataset is gathered from open public data sources on Health Sciences innovation area. It consists of three document corpora, that can be downloaded here:

  • Projects:
    • A corpus of 2607 innovative granted projects from Health Sciences taken from ISCIII (spanish equivalent to US NIH organism) , funded by FIS (Fondo de Investigación en Salud).
    • From each project, title and abstract (both in Spanish) will be provided.
  • Publications: a corpus of scientific publications taken from Scielo, a collection of Ibero-american journals about Health Sciences (Neves, 2016). Title and abstract will be provided form each paper
  • Patents: a corpus of granted patent applications, taken from the Spanish patent office (OEMP, Oficina Española de Patentes y Marcas) web service.

Each corpus will be split in two sub-collections, one for training and the other one for the evaluation of the subtasks.

Reference graphs

The participants should compute similarity graphs providing a similarity value between any pair of documents from the corpora.

The evaluation of the subtasks will be based on the comparison of these similarity graphs and a set of reference graphs.

We have computed reference graphs using metadata information available on the original corpora: citations, keywords and/or standard categories available in the original corpora. These metadata have been generated directly from authors and experts with a knowledge of the document and the domain, which may be a difficult task to a non-expert annotator.

Reference graphs computed from the training data are provided to participants. Reference graphs computed from validation data will be used to evaluate the task.

The similarity values in the reference graphs have been computed as follows:

  • Subtask IberRDI-U: similarity between homogeneous documents:
    1. Projects: the reference similarity measure is based on the available keywords: specifically, for each pair of projects we calculate the Kessler’s similarity (Gipp, 2014) between them as the number of common keywords over the square root of the product of the number of keywords in each project.
    2. Publications: the reference similarity measure is based on common citations: specifically, we will calculate the Kessler’s similarity between papers as the number of common out-citations over the square root of the product of the number of citations included by each paper.
    3. Patents: the reference similarity is based on available IPC classification codes, taking into account the multilabel structure and hierarchical nature of the IPC system (alternative CPC code quality for document similarity will be examined). To be more specific, the similarity between each pair of patents will be computed as the average of the number of hops in the tree for all pairs of IPC codes assigned to each document.
  • Subtask IberRDI-H: similarity between heterogenous documents:
    1. Projects vs Publications: Since keywords are available for both corpora, we will use the same reference metric as in subtasks IberRDI-U for projects and publications.
    2. Publications vs Patents: Citation information is available for the patent data, including also citations to scientific papers. Therefore, it will be possible to calculate the reference similarities among items from both corpora using the same out-citation scheme proposed for subtask IberRDI-U for patents.