D4.4

From West-Life
Jump to: navigation, search

D4.4 Overview of external datasets, strategy of access methods, and implications on the portal architecture

Due April 2017, responsibility STFC. Part of:

Task 4.3: Programmatic Access to datasets

Leader: STFC Participants: LUNA, INFN Many services rely on existing external data sets. The infrastructure will have to be capable of leveraging this external data upon users’ request. The task will review the relevant datasets to be made available, and it will define architecture and appropriate interfaces to access them, including eventual strategies to make “caching” copies of the data. It will be built on the metadata services to be offered in WP6. Together with T4.4 security issues (authorization and user identity delegation in particular) will be addressed. Unlike the other tasks of WP4, T4.3 brings functionality that is not widely present in current portal solutions. Therefore, it starts later (year 2) in order to build on experience and intermediate outcomes of WP6.

Relevant datasets:

  • PDB
  • PDB-REDO
  • Experimental data at synchrotrons
  • Experimental data at EM centres
  • Experimental data at other facilities
  • Data linked from publications
  • EMDB and EMPIAR
  • BMRB
  • MyTardis
  • SBGrid
  • Uniprot

Strategies

File system level

  • use driver to directly mount a dataset repository to the custom VM. The dataset will be available for analysis on filesystem level, caching is managed by driver.
    • examples: WEBDAV mount.davfs driver do this job
    • other FUSE based driver gives access on file system level

API level

  • Most datasets allows API (e.g. REST API) to access the datasets for further analysis. This is domain of web application and needs to be implemented by analytical tool
  • Component of West-life project, e.g. Virtual Folder will have to download selected datasets and offer on file system level, local cache
    • custom VM cache is limited to scrathc disk space (about 50 GB, 1-8 GB is used by installed OS and application)

custom storage

  • EGI Fedcloud allows creation of a cloud storage to be mounted on file system level strategy
    • block storage with capacity of up to 2-5 TB can be mounted to custom VM instance and accessible from VM file system.
    • such mounted cloud storage have to be created and data needs to be coppied/cached from it's original location.

PDB and PDB-REDO

Data from these sources will be available through the PDB REST API. Work in WP6 will provide a Web Component for use by service portals to support searching of these databases and reuse of the data in them.

In 2015 there were a total of 526,126,409 downloads from the PDB, mostly for Molecular Replacement pipelines. In addition there are synotpic studies: many papers are published that report on studies that begin by downloading the whole PDB, then running a program that analyses all the structures to obtain such generalized knowledge.

Where In an online database
What .pdb/.mmcif plus .mtz, tens of Kb
How Make URL containing accession code or use search API
Possible output Usually revised .pbd/.mmcif, e.g. after Molecular Replacement

Experimental data at synchrotrons (mature infrastructures)

These are mature infrastructures with sophisticated data management and processing. Data collected at synchrotrons and neutron sources is often stored in iCAT repositories. TODO STFC.

Where Usually in iCAT repository
What Diffraction images, 100s of 1-10 Mb files
How iCAT UI to request staging from tape
Possible output Merged reflections, a few Mb.

Experimental data at EM centres

These are terabyte datasets. TODO STFC, INFN, CSIC discuss plan for EM data.

Where eBIC, CSIC, etc
What Several 100s of movies per day, 1.5 GB each (800 movies/day = 1.2TB/day). Microscopes could render 1 movie per minute, only during acquisition time. Some time is invested in setting/configuring, loading or screening areas.
How The key question!
Possible output Particle images, a few Mb

Experimental data at other facilities

These often offer a USB interface. Instruct’s Data Management Policy says “storage of data is the responsibility of the User to whom it belongs ... Instruct Centres are not required to take responsibility for storing data beyond the immediate acquisition visit or the time taken for post experimental analysis if the latter is also provided by the Centre. However, Instruct Centres aspire to offer an archive to store data, especially in cases where the data volume makes this more practical that transferring the data ...” See D6.2.


Where Experimental facility e.g. CERM
What Various, Kb to 1Gb
How Plug in a USB drive
Possible output Usually reduced data. For NMR: spectra, Gb.

Data linked from publications

Increasingly, journals require that experimental data is open and that a paper contains links to it. These data are often but not always in one of the repositories mentioned here. TODO We will investigate the possibility of using the Europe PMC API to find cited data.


Where Zenodo, B2SHARE, university repository, and as above
What Any
How Resolve doi
Possible output Any

EMDB and EMPIAR

TODO discuss with CSIC

Other

TODO STFC We will investigate other data sources as resources permit.