Description

Motivation

The Spanish National Plan for the Advancement of Language Technologies (PTL, Plan de Tecnologías del Lenguaje) is encouraging public administrations to take advantage of the high degree of digitization to develop smart public services based on the application of language technologies. Within the field of Research, Development and Innovation (RDI), large repositories of open data resources (publications, patent records, project proposals) can be processed to identify the structure and dynamics of the RDI activity in Spanish.

A key component of smart systems for the analysis of the RDI production is the identification of different type of relations between documents. Beyond metadata available in the datasets (e.g., funding entities for projects, journal categories for papers, etc.), in this task we focus on the automatic extraction of semantic similarity among RDI items based just on the text available in documents.

To this aim, we propose a task involving the analysis of three corpora from the Health sector, which is one of the prioritized areas of the PTL (Villegas, 2017).  The goal is to explore different metrics to identify similarities between documents that could work efficiently in different circumstances.

The overall aim of IberRDI is to bring together actors across sectors from Academia, Industry (NLP and RDI policy makers and innovation evaluation agents) and Public Administration.

We expect contributions by researchers from different fields, from NLP to Machine Learning. The task will be an opportunity to test technologies ranging from text analysis, topic models (Alexander, 2015), and word or doc embeddings (Kusner, 2015)

The target community is the set of Universities, RDI funding institutions and policy-makers. Specialists on innovation surveillance are potential users of efficient methods to compute similarities between document in this field.

Background

The computation of semantic similarity measurements between words, sentences, paragraphs or documents is a key component for most NLP tasks. It is also a major component for the analysis of large corpora. Tasks related to the evaluation of text similarities have been proposed in relation to text retrieval (Aslam, 2013) and plagiarism  detection (Potthast, 2013) (Kasprzak, 2009).

Evaluation tasks related to semantic similarity and text classification in Spanish are not new (Rosso, 2018). The design of efficient semantic similarity metrics has been the main purpose of many text analysis tasks. The SemEval workshop series has proposed several tasks on Semantic Textual Similarity between sentences (2012, 2013, 2014, 2015, 2016) and tweets (2015), both in English, Spanish or in cross-lingual pairs (Agirre, 2016). Similarity values are expected to be maximum in case of meaning equivalence.

Our focus here is in document similarity, and our goal is not to identify meaning equivalence (as in a plagiarism detection competitions) but thematic similarity. The present task focuses also on three types RDI: project proposals, scientific papers, and patent applications.

Use cases

There are two main use cases:

Evaluators of innovative projects must assess the degree of innovation of an RDI project based on the current state of the art. This task requires the analysis of other similar projects (submitted to the same or other funding bodies), analyzing related patents that could invalidate the business model and, finally, it is interesting to look for related scientific publications, both to analyze the state of the art and/or to select possible project evaluators.

On the other hand, the direction of public policies on innovation areas needs to analyse the strengths and weaknesses of specific sectors (e.g. AI, IoT, blockchain …), compare these sectors between countries, analyse their temporal dynamics, quantify the relationship between sectors, etc.

The classifications associated with these corpuses (e.g. IPC/CPC patents, MESH/DECS medical scientific publications) do not have the appropriate granularity, they only classify one corpus, they do not show the degree of belonging of documents to the classes, they are rarely annotated with multiple classes, classifications are not often updated (which is a problem in the RDI sector).

For this reason, a fine-grained analysis of the textual content of a heterogeneous collection of innovation document corpus is necessary. The basis of all described use cases is the similarity between documents.

These corpus, in addition to metadata on the authors, dates, classifications, etc., have a rich collection of links or citations between documents. 

This task aims to exploit the relationships between peer documents (established by the authors of the documents themselves or by their evaluators) to find the optimal representation of the documents as well as a measure of their similarity.

Public institutions and private organizations could take advantage of efficient methods to compute similarity graphs for evaluation processes.