Earth Observations on Grid

cloud
earth-observation
grid

(Sean Murray) #1

Hi

Hopefully @StewartBernard_550c does not mind me bastardising his email about our meeting this afternoon .....

Email

There are large external (Phakisa Oceans, Oceans and Coastal Information Management System - OCIMS) and internal projects underway to develop significant marine/aquatic earth observation (& modelling) capabilities. There is also significant new satellite capability to exploit, notably the European Sentinel series (Sentinels 3,2,1 in order of priority). We are looking to best partner with CHPC around capabilities for this. These capabilities are:

  1. Acquisition (±20GB pd typically, growing as more Sentinels are launched or our capabilities mature). Data will mostly be acquired at L1 & L2, ideally through a combination of EUMETCAST-Terrestrial (operational push service, see attached) and downloads from FTP etc. Occasional demand for large data downloads e.g. initially or reprocess...

  2. Processing. Routine near real time processing using Python and Python APIs to e.g. SNAP (ESA), SeaDAS (NASA) etc plus our own algorithms in Python, plus the more infrequent ( 3 - 24 months) research/reprocessing of larger chunks of data e.g. 10 - 100TB.

  3. Storage. Storage of both raw data, intermediate & final products, typically organised hierarchically with metadata in a database. Most of the raw & intermediate products are only processed intermittently after the initial process, with a core front end serving data set of ±2TB queried constantly in real time.

  4. Serving. The primary product servers for OCIMS are in Pretoria so just pass through, but some need for basic data serving here. The data sets will be freely available (logistics aside)

Summary

The SAGrid capability with Sean as contact has capability that can be used reasonably soon i.e. < 1 month, while the existing ACCESS server can continue to be used for immediate acquisition & processing of small volumes. SAGrid capability includes preferential bandwidth, storage of potentially several 100TB, and Infiniband link to the SAGrid compute cluster, with web access to data. Very excellent. Ongoing interaction to be continued with DIRISA around longer term capability but this not likely to be available with a year....

Actions:

@StewartBernard_550c - set up connect between EUMETCAST & @SeanMurray_59b6/Andy to test EUMETCAST-Terrestrial on SAGrid site once ready. Docs attached - first test requires giving EUMETCAST login to basic machine within our network. Our thinking was better to test on SAGrid (rather than ACCESS) as operational system will be located there.

@StewartBernard_550c/@SeanMurray_59b6 - contact @brucellino as SAGrid lead to help formalise and get the EO capability recognised

@SeanMurray_59b6/@StewartBernard_550c - @SeanMurray_59b6 to provide docs for SAGrid workflows to get code approved - Github for CODE-RADE etc

A login node to external users, either ldap01.chpc.ac.za or one of the other sun boxes dedicated.

Some training on submitting jobs.

Questions

  • I dont suppose they need their own VO ?
  • is there one in europe already for eumetcast ?

Storage

  • need to sort out the storage, currently a sun box is connected to the 107B of storage (EMC).
  • the above storage is going to be used for stratum ?(1/2/3) of cvmfs lustre, 104TB, must come back, status unknown at the moment, its like running around a bush.

The SAGrid 3rd of the 1.2PB to be ordered can be used as well, but I have no clue on the timelines there.

The rest can come as and when we need it.

Hopefully that gets comments started.

Sean


EUMETSAT EO on the grid - software stack
(Stewart Bernard) #2

Thanks to Sean for this. Some further points:

  • Many similar groups around the world facing similar problems - how to drink from the firehose of Sentinel data. Cool new satellites=cool new data volumes too....so SAGrid solutions have broader relevance than just our potential South African solution....

  • EUMETCAST a common earth observation delivery mechanism through Europe using K-band sat comms, and in Africa using C-band sat comms. But we're looking at specifically at EUMETCAST-Terrestrial for significantly larger data volumes of 300m resolution Sentinel 3 data, which becomes a more specific and less commonly used solution - and we don't need to maintain a C-band dish (high overhead), EUMETSAT don't have to pay high price for satellite transponder bandwidth, and we have the bandwidth anyway through SANREN etc...

  • Broader relevance to Africa coming soon with GMES-Africa project which will implement services, e.g. marine & coastal, around the Sentinel data in 2017 - an analogue to the EC Copernicus. So the solution will need to be scaled up soon....

Thanks
Stewart


(Bruce Becker) #3

Thanks @StewartBernard_550c and @SeanMurray_59b6 ! This looks like a very exciting project. I've passed it on to friends at EGI for some suggestions. Will have a reply in a bit...


(Bruce Becker) #4

@SeanMurray_59b6, @StewartBernard_550c thanks very much for this thread. This seems like a very complex and mature workflow and science case, which a distributed infrastructure is very good for.

Homework

I would suggest first of all that we all do our homework - the idea is to design a relevant infrastructure for the community.

Design your own e-Infrastructure

I would say please check out the recent Design your own e-Infrastructure event held around the Digital Infrastructures for Research conference in Krakow. The "pitch" of EUDAT, EGI, OpenAIRE, and Geant infrastructures are clearly laid out there. We can offer almost any of those services in some form or another, thanks to the interoperability between the Africa-Arabia ROC and EGI.

Identify your platform

Also please read the service catalogue of the Indigo DataCloud - we recently ran a few hackfests wherein we pulled in whatever components were needed to a final "product". Another place to look is the Sci-GaIA service catalogue.

Hack it together.

If you could initially try to identify whatever components in these lists you think you need, we can define a time and place to get together and hack a platform together to exploit the underlying infrastructure. This is pretty standard in our communities, and helps a lot to get prototypes out in the field for user testing.

I would like to suggest that you consider becoming one of the Sci-GaIA Champions


Ok, let me get to specific items here :

This DAQ problem needs some coordination from the network providers I think. If the idea is to have a store at a central DIRISA node, then we need to know what endpoints and transport protocols are required for the data transfer. It's a secondary issue to the "grid" side of things, but it will help in determining the components of the platform.

So a realtime processing system obviously needs to be set up in a semi-static way, and fully distributed data processing is not such a great idea. I would like to hear more about this however, since personally, I'm very interested in realtime processing, as I'm sure @SeanMurray_59b6 is, given our work on the ALICE HLT. Anyway - the reprocessing :

This can likely be done very efficiently using the sites on the grid. We have several options for staging data, using metadata catalogues and accessing repositories via API. See the options from EUDAT, Indigo and EGI - all of them are in principle available, but the easiest to get off the ground is the EGI stack - UMD (because it's already deployed at all the sites).

do you guys have a metadata schema and a preferred repository ? We suggest using Invenio, but there are several options here.

Do you need persistent identifiers ? (yes, you do, but the question is - do you want DataCite DOIs or do you have your own system internal to the community ?

So, we're talking Open Access here. Again, we suggest Invenio for this. Publishing data from the grid to the repository can be done via API, we have several examples of how to do that. Invenio is OAI-PMH compliant and just look at Zenodo if you want to see it in action. @roberto_barbera can comment some more here...

As for the code, I am quite happy to work with you to get that into CODE-RADE, so that you can process data everywhere. See http://www.africa-grid.org/applications/ for what we have in the repo and what state it's in, and then help us understand your stack. If it's just python modules, we can easily put them on top of the 2.7 or 3.4 dependency tree.

This project is going to rule so hard !


(Stewart Bernard) #5

Hi Bruce,

Thanks - so good to get such an enthusiastic response! This has potential to work out very very well. Will bring the rest of the team into the discussion and start doing the homework through the links you've provided. Once we've got a first order feel for the components in the infrastructure/components lists then we can have a go at the hack. Becoming a Champion sounds great, maybe best post-hack when we've got a clearer idea of how to structure things.

On some of the specifics:

DIRISA links: sounds good, but the evolvement on DIRISA side slow, so we can pragmatically bank that one til next year probably...

This an exciting aspect because it's where the meat of the numerics lie, also where our cool EO science gets used - bio-physical inversion models, fuzzy logic classification schemes, multi-sensor & - algorithm blending, interpolation schemes from MDS, etc etc. The same numerics plus others get applied to the re-processing so some nice workflow/architectural development there...

This important - we've used a Postgres/Django scheme in the past for the metadata (including some data summary) with simple hierarchical file structures for the data, but totally open to whatever works best - remember we're EO rather than computer scientists so we need help!

And the Open Access is big - the EO community very big on data democracy and we have strong focus on open repositories, public service platforms, empowering non-specialists to use the data etc....

Anyway, thanks again for all the energy, more later...


(Bruce Becker) #6

Hi @StewartBernard_550c, all. After about a year of bouncing around, I have come back to this project. I had a chat to guys in the EO unit here at Meraka about what you're working on. We had a chat about the various pipelines and application stacks. I think things may be re-converging. I'd like to hear about what progress has been made or whatever has changed on your side.

Thanks !