This content is also available in: fr

In the last 30 years or so, scientific computing has steadily evolved from centralized  to a more distributed environments. This has been due to the concurrent availability of cost-effective “Commercial Of The Shelf” (COTS) components and decrease of costs of Local Area Networks. In the first half of 90’s, the emergence of cluster computing for High Throughput Computing (HTC) applications was confirmed and “farms” of computers with many-core processors, interconnected by  low-latency networks,  become the norm. This eventually extended to the domain of High Performance Computing (HPC) to the extent that about 80% of the Top500 machines built in the last 5 years are based on a cluster architecture [1].

Furthermore, the steep decrease of costs of high-bandwidth Wide Area Networks has fostered in the recent years the spread and the uptake of the Grid Computing paradigm and the distributed computing ecosystem has become even more complex with the recent emergence of Cloud Computing.

At the onset of the 21st century all these developments have led to the new concept of e-Infrastructure – defined as: “an environment where research resources (hardware, software and content) can be readily shared and accessed where necessary to promote better and more effective research; such environment integrate hard-, soft- and middle-ware components, networks, data repositories, and all sorts of support enabling virtual research collaborations to flourish globally” [2].

Indeed, e-Infrastructures  have been built over several years both in Europe and the rest of the world, to support diverse multi- and inter-disciplinary Virtual Research Communities (VRCs) [3]. There is a shared vision for 2020 that e-Infrastructures will allow scientists across the world to do better (and faster) research, irrespective of where they are and of the paradigm(s) adopted to build them.

E-Infrastructure components can be key platforms to support the Scientific Method [4], the “knowledge path” followed in many aspects by scientists since the time of Galileo Galilei (see figure below).

With reference to the figure above, Distributed Computing and Storage Infrastructures (local HPC/HTC resources, Grids, Clouds, long term data preservation services) are ideal both for the creation of new datasets and the analysis of existing ones while Data Infrastructures (including Open Access Document Repositories – OADRs – and Data Repositories – DRs) are essential  to evaluate existing data and annotate them with results of the analysis of new data produced by experiments and/or simulations. Last but not least, Semantic Web-based enrichment of data is key to correlate documents and data, allowing scientists to discover new knowledge in an easy way, and engage in a more robust scholarly discourse.

One of the cornerstones of the Scientific Method, which is a key driver through the knowledge path, is science reproducibility. In recent years, the issue of the reproducibility of scientific results has attracted  increasing attention worldwide, both inside and outside  scholarly communities, to which a recent Special Edition of Nature [5]  is testament. As striking examples, Begley and Ellis [6]  could not reproduce the results of 47 out of 53 “landmark” publications in cancer research and Casadevall et al. [7]  have identified more than 2,000 articles listed in Pubmed [8]  as retracted since the first identified article was retracted in 1977.

The problem goes well beyond the topic of cancer. In March 2012 a committee of the US National Academy of Sciences heard testimony that the number of scientific papers that had to be retracted increased more than tenfold over the last decade while the number of journal articles published rose only 44 percent over the same period [9]. At the current rate, by 2045 there will be as many papers published as retracted.

In light of these findings, researchers and other scholarly observers have recently been proposing and conducting initiatives to help the scientific community to address the issue of reproducibility. Some of the most interesting ones are gathered under the umbrella of the Reproducibility Initiative [10] jointly started by the lab-services start-up Science Exchange [11] and the open access journal PLoS ONE [12]. Scientists can submit studies to Science Exchange that they would like to see replicated. An independent scientific advisory board selects studies for replication and service providers are then selected at random to conduct the experiments. The results are returned to the original investigators, who can then publish them in a special issue of the open-access journal PLoS ONE and are awarded with a “certificate of reproducibility” for studies that are successfully replicated.

Although the initiative of Science Exchange is commendable, it is however limited to the health domain, authors have to pay to have their results reproduced, and the choice of studies to be reproduced is entirely decided by the advisory board.

Furthermore, some very important considerations are in order.

    1. As pointed out by C. Drummond [13], reproducibility and replicability are different concepts and “replicability is not reproducibility”.
    2. The “re-’s” of the Scientific Method go beyond re-plicability and re-producibility and indeed include both re-peatability and re-usability.
    3. In the last 2-3 decades science has become more and more computationally intensive and computer simulations are actually “reconciling” the inductive and deductive approaches of the Scientific Method. In particular:
  • “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures” [14].
  • “Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail” [15].
  • “The publication and open exchange of knowledge and material form the backbone of scientific progress and reproducibility and are obligatory for publicly funded research. Despite increasing reliance on computing in every domain of scientific endeavor, the computer source code critical to understanding and evaluating computer programs is commonly withheld, effectively rendering these programs “black boxes” in the research work flow” [16].

For all the above, real science reproducibility should include full access to papers, datasets, data collections, algorithms, configurations, tools and applications, codes, workflows, scripts, libraries, services, system software, infrastructure, compilers, hardware, etc. In order to ensure all that, besides and beyond e-Science, the new concept of o-Science (Open Science – also referred to as Open Knowledge) is emerging.  

According to a recently published seminal book [17], Open Science “refers to a scientific culture that is characterized by its openness. Scientists share results almost immediately and with a very wide audience”.

Five schools of thought on Open Science have been identified so far [18], characterised by their central assumptions, the involved stakeholder groups, their aims, and the tools and methods used to achieve and promote these aims (see the figure below). The infrastructure school is concerned with the technical infrastructure that enables emerging research practices on the Internet, for the most part software tools and applications, as well as computing networks. The infrastructure school regards Open Science as a technological challenge and focuses on the technological requirements that facilitate particular research practices, such as Grid and, more recently, Cloud Computing.

The Sci-GaIA project consortium very much supports the Open Science paradigm and has a strong focus on the application of the guidelines of the infrastructure school.

The project has deployed an Open Science Platform for re-producible and re-usable science across Europe and Africa whose components are depicted in the various areas of the figure shown on the right. Move the mouse over the figure to identify them and get more information about them.


[1] Go tohttp://top500.org/statistics/overtime/, select Category = Architecture, choose Type = Systems Share, and then click on Submit to generate the graph.

[2]  This definition of e-Infrastructure appears in an European Commission web page: http://cordis.europa.eu/ictresults/index.cfm?ID=90825&section=news&tpl=article

[3]  G. Andronico et al, “E-Infrastructures for International Cooperation”, in “Computational and Data Grids: Principles, Applications and Design” (N. Preve Editor), IGI Global 2011, DOI: 10.4018/978-1-61350-113-9; see also www.igi-global.com/book/computational-data-grids/51946.

[4]  There are many equivalent definitions and depictions of the Scientific Method, both on the web and on textbooks. In this document we refer to http://home.badc.rl.ac.uk/lawrence/blog/2009/04/16/scientific_method, from which we have re-used the picture included in Figure 1.

[5]  www.nature.com/nature/focus/reproducibility/.

[6]  C. Glenn Begley and Lee M. Ellis, “Drug development: Raise standards for preclinical cancer research”, Nature 483, 531–533 (29 March 2012), doi:10.1038/483531a.

[7]  Ferric C. Fanga, R. Grant Steenc and Arturo Casadevall, “Misconduct accounts for the majority of retracted scientific publications”, Proceedings of the National Academy of Sciences of the United States of America, vol. 109, no. 42, p. 17028–17033, doi: 10.1073/pnas.1212247109.

[8] www.pubmed.org.

[9] www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328.

[10] http://validation.scienceexchange.com.

[11] https://www.scienceexchange.com.

[12] www.plosone.org.

[13] C. Drummond, “Replicability is not reproducibility: nor is it good science”, Proc. Eval. Methods Mach. Learn. Workshop 26th ICML (2009), Montreal, Quebec, Canada. http://goo.gl/7f8WX9.

[14] Jonhatan B, Buckheit and David L. Donoho, “WaveLab and Reproducible Research”, Lecture Notes in Statistics Volume 103, 1995, pp 55-81.

[15] Darrel C. Ince, Leslie Hatton and John Graham-Cumming, “The case for open computer programs”, Nature 482, p. 485–488 (23 February 2012), doi:10.1038/nature10836.

[16] A. Morin et al, “Shining Light into Black Boxes”, Science (13 April 2012) Vol. 336 no. 6078 pp. 159-160, DOI: 10.1126/science.1218263.

[17] “Opening Science – The Book”. DOI: 10.1007/978-3-319-00026-8.http://book.openingscience.org.

[18] Fecher, B., Friesike., S.: “Open Science: One Term, Five Schools of Thought”. A chapter of:  “Opening Science – The Book”. DOI: 10.1007/978-3-319-00026-8.


Back