OneData for H3ABioNet

onedata
datamanagement

(Bruce Becker) #1

Continuing the discussion from EGI Conference / Indigo Summit 2017:

Hi @PetervanHeusden_73bd - first of all, I realise that we might need to start a new category here to discuss infrastructure needs of research groups, but that's a topic for a different time. I'd like to follow up your interest in using OneData. This was one of the components that we used in the e-Research Hackfests so we have a bit of experience amongst colleagues in Sci-GaIA when it comes to integrating it into scientific workflows.

Just for some reference for others who may arrive here later, OneData is a kind of "overlay" on existing storage, bringing the storage management into research communities' hands. To use it, you need to store your data in a provider and attach the providers to a zone which takes care of the authentication, movement, etc. The really great thing about OneData is the API, in my opinion though.

If H3ABioNet wanted to use OneData to serve and move data to researchers' applications, we could look into a small side project to do this. We would need some sites to provide the actual storage wherever it is, and either set up a zone ourselves, or in an existing zone.

From my side, I would just to map out a little bit what the end points may look like. I would obviously like to include existing storage at grid sites, without disrupting them - and also include storage from DIRISA and ARC where possible.


(Timothy John Carr) #2

Hi Bruce,

How different is OneData to iRODS ? Seems like the two are very similar. One feature which makes iRODS shine is the complex rules engine for automating data movement and using micro-services as part of the pipeline. I will give Onedata a go to understand and get a good feel for it

Cheers
Tim


(Bruce Becker) #3

hey @TimothyJohnCarr_925f - Ya, I think they are totally different, but may have some complementarities, they may be able to do the same thing in some areas (automate data movement), but also complement in other areas (like, use iRODS as a backend to OneProvider).

The real issue is what H3ABioNet wants and needs... at least in this particular discussion. I've been hanging off OneData so far, waiting for a mature enough discussion like this to come along to have a serious chat about it :slight_smile:

If this is the first time hearing about OneData, have a look at the video lectures we recorded during the Hackfest - https://www.youtube.com/playlist?list=PLRNChYjPMFFtuXcTUAQvLtln1Yk2tLQxe

See -

and take it from there...


(Peter van Heusden) #4

There are similarities (between OneData and iRODS). So basically iRODS uses a PostGres database for metadata storage to complement the data storage. There's a rule engine to do things like manage numbers of replicas. A iRODS deployment is typically institution-level. There's an authentication layer that controls access to things. OneData is design from the ground-up for data distribution on cross-institutional scale. Metadata is stored in CouchBase, an eventually-consistent distributed database. This eventually-consistent thing carries across to OneData as a whole: there's no locking, for instance.

The components of OneData are, as Bruce says, a OneProvider that provides actual storage (on Lustre, Ceph, S3, Swift, etc). Multiple OneProviders can form part of a Space - which is basically your filesystem. Then the OneZone links to various authentication providers and is basically the top level interface: you log in to OneZone, create a space, add providers to the space. There is a rule interface that you can write against but it is emphasised less than iRODS' rule engine.

The design of OneData is very much clustered / shared nothing.

So the thing that is replicated across a space by default is the metadata view - both filesystem metadata and then any extended metadata you chose to add. Then data is - by default - replicated on read. Remember, no locking, so if two remote users hit the same file blocks at the same time, its a problem. In terms of resources, the CouchBase install is IO heavy and is apparently best backed by SSD. Then for replication, they're using their own protocol - apparently fast but not as fast as GridFTP.


(Bruce Becker) #5

@PetervanHeusden_73bd, all

I wonder if there has been any movement on this topic ?

Apart from the discussion about performance and other technical things, I was wondering if we have any takers in the H3A community to use these services. Perhaps you can invite them to participate in this topic ?

Maybe after some discussion, we can have another round ? I can set a reminder for whenever your next meeting will be.


(Bruce Becker) #6

This topic was automatically closed after 7 days. New replies are no longer allowed.