This content is also available in: fr

The e-Infrastructure Knowledge Base (KB) is one of the largest existing e-Infrastructure related digital information systems. It currently contains information, gathered both from dedicated surveys and other web and documental sources, for largely more than half of the countries in the world.

Information is presented to visitors through geographic maps and tables.

In the “country view”, for example, users can choose a continent in the map and, for each country where a marker is displayed, get the information about the Regional Network the country is connected to, the National Research & Education Network(s), the National Grid Initiative, the Certification Authority, and the Identity Federation available in the country, and the Regional Operation Centre the country is associated to.

Besides network and e-Infrastructure-related services,  the e-Infrastructure KB publishes information about more than 4,000 Open Access Document Repositories (OADRs), Open Data Repositories (DRs) and Open Educational Resources (OERs) in the world.


The Semantic Search Engine

Although it is quite useful to have a central access point to thousands of repositories and millions of documents and datasets, with both geographic and tabular information, the OADR and DR part of the KB is only a demonstrator with limited impact on scientists’ day-by-day life. In order to find a document or a dataset, users should know beforehand what they are looking for and there is no way to correlate documents and data which would actually be of the most important facilitators. In order to overcome these limitations and turn the KB into a powerful research tool, the metadata related to the OADRs and DRs gathered in the KB are semantically enriched  and a search engine on the related linked data has been made available.

The multi-layered architecture of the Sci-GaIA Semantic Search Engine (SSE) is sketched in the figure below where both the official and de facto Semantic Web standards and technologies adopted are described by small logos.

Starting from the bottom of the figure, the first two components of the service are described below.

The metadata harvester is a process able to run both on Grid and Cloud infrastructures, which consists of the following parts:

  • Get the address of each repository publishing an OAI-PMH standard endpoint
  • Retrieve, using the OAI-PMH repository address, the related Dublin Core encoded metadata in XML format
  • Get the records from the XML files and, using the Apache Jena API, transform the metadata in RDF format
  • Save the RDF files into a Virtuoso triple store according to an OWL-compliant ontology built using Protégé.

Each Resource Description Framework (RDF) file retrieved and saved in a Virtuoso-enabled triple store is mapped onto a Virtuoso Graph that contains the ontology expressly developed for the search engine, shown in the figure below for the sake of completeness.

The ontology, built using Dublin Core and FOAF standards, consists of:

  • Classes that describe the general concepts of the domain: Resource, Author, Organisation, Repository and Dataset (where Resource is a given open access document)
  • Object properties that describe the relationships among the ontology classes; the ontology developed for the service described in this paper has several specific properties such as hasAuthor (i.e., the relation between Resources and Authors) and hasDataSet (i.e., the relation between Resources and Datasets)
  • Data properties (or attributes) that contain the characteristics or classes’ parameters.

The third, and highest-level, component is the search engine itself that translates human-understandable searches to SPARQL queries, which can be made by title, subject, author, type, format, publisher and in more than 100 different languages.  

Search results are ranked according to the Ranking Web of Repositories and, for each results, links are provided to its citations on Google Scholar, if any, to its altmetrics (via APIs), if any, and to the graphic representation of the corresponding Linked Data (by means of Lodlive).

A programmable use of the Semantic Search Engine is also possible due to a RESTful API that has been created on purpose.