Semantics For Big Data Integration

November 2016 - November 2019
69.500 €
Funding organization: 

Italian Ministry of Education, University and Research

Person(s) in charge: 
Executive summary: 

Knowledge Graphs (KG), that encode information in the form of entities and relationships, are gaining attention in different areas of computer science, from the improvement of search engines to the development of intelligent systems.
On the one side, in fact, from 2012 Google exploits its KG (currently known as Knowledge Vault) to provide better results, also analyzing natural language questions.
On the other side, personal assistants like Apple Siri, Microsoft Cortana, and Amazon Alexa employ high quality KGs in order to improve their services and the interaction process with the user. KGs are able to harmonize the variety dimension of Big Data, standardizing concepts, data types, and relations among data by means of domain ontologies. The semantic integration of such diverse and scattered data from different sources is a crucial step in order to maximize value from information.
The value of data published in high-quality KGs in terms of semantic accuracy is demonstrated by the increasing use of such information to trainMachine Learning (ML) algorithms, also based on Neural Networks (NN) to model natural language data.


The integration of the vast amount of scattered data available on the Web through different formats (HTML tables, CSVs, JSONs, and -more generally- information retrieved through API services) is a key ingredient to build large-scale KGs.
In the Semantic Web studies, a common approach to combine information from multiple heterogeneous sources exploits domain ontologies to produce a semantic description of data sources through a process called semantic mapping (SM).

Such process implies the construction of a bridge between the attributes of a specific data source -for instance the fields of a table- and data types and relationships described by a domain ontology. Considering the variety of data released on the Web, manual generation of SMs requires a significant effort and expertise and, although desirable, the automation of this task is currently a challenging problem.


Automatic SM is a well-known research problem in the field of ontology-based data integration systems. In fact, a number of systems to support SM generation has been developed and, consequently, generic and effective benchmarks for their evaluation are also available.
Recently, Neural Language Models (NLMs) are employed to assign a distributed vector representation that store semantic and syntactic information (embedding) of the concepts mentioned within a KG. Nevertheless, there is still room to explore these techniques for semantic data integration.
For these reasons, the research objective is twofold: (i) the development of NLMs-based approach to generate SMs in an automatic (or semi-automatic) way; (ii) the comparison of such approach with existing techniques and its evaluation in terms of semantic accuracy.


During the reporting period, we have explored an approach to train NLMs with SPARQL queries performed on KGs, in order to reconstruct the semantics of data sources. Such approach is described in the article entitled “Training Neural Language Models with SPARQL queries for Semi-Automatic Semantic Mapping”, that will be presented in a leading  European conference on Semantic Technologies and AI (Semantics 2018 -
On the one hand, such technique will be used to automate the process of building and to improve the semantic accuracy of the KG (see 3.6 Public Contracts).
On the other hand, it will be exploited to improve the semantic classification service provided by the TellMeFirst tool ( developed by the Nexa Center.

  • Large-scale (and cloud-based) data management systems for storing and processing (RDF) information
  • Related Publications:
    Giuseppe Futia, Alessio Melandri, Antonio Vetrò, Federico Morando, and Juan Carlos De Martin
    28 May -1 June 2017
    14th European Semantic Web Conference
    Giuseppe Futia, Federico Morando , Alessio Melandri, Lorenzo Canova, Francesco Ruggiero
    19 November 2016
    Third Workshop on Legal Knowledge and the Semantic Web (LK&SW-2016)

    Project news can be found on the following channel: GitHub

    semantic-for-big-data-integration commits feed

    12/04/2017 - 11:01
    Add S4BDI logo
    15/02/2017 - 16:26
    Add source data
    15/02/2017 - 16:20
    Algorithm skeleton
    15/02/2017 - 13:01
    Add ontologies