- 1 D4.4 Overview of external datasets, strategy of access methods, and implications on the portal architecture
- 2 Task 4.3: Programmatic Access to datasets
- 3 Relevant datasets:
- 4 Strategies
- 5 PDB and PDB-REDO
- 6 Experimental data at synchrotrons (mature infrastructures)
- 7 Experimental data at EM centres
- 8 Experimental data at other facilities
- 9 Data linked from publications
- 10 EMDB and EMPIAR
- 11 Other
D4.4 Overview of external datasets, strategy of access methods, and implications on the portal architecture
Due April 2017, responsibility STFC. Part of:
Task 4.3: Programmatic Access to datasets
Leader: STFC Participants: LUNA, INFN Many services rely on existing external data sets. The infrastructure will have to be capable of leveraging this external data upon users’ request. The task will review the relevant datasets to be made available, and it will define architecture and appropriate interfaces to access them, including eventual strategies to make “caching” copies of the data. It will be built on the metadata services to be offered in WP6. Together with T4.4 security issues (authorization and user identity delegation in particular) will be addressed. Unlike the other tasks of WP4, T4.3 brings functionality that is not widely present in current portal solutions. Therefore, it starts later (year 2) in order to build on experience and intermediate outcomes of WP6.
- Experimental data at synchrotrons
- Experimental data at EM centres
- Experimental data at other facilities
- Data linked from publications
- EMDB and EMPIAR
File system level
- use driver to directly mount a dataset repository to the custom VM. The dataset will be available for analysis on filesystem level, caching is managed by driver.
- examples: WEBDAV mount.davfs driver do this job
- other FUSE based driver gives access on file system level
- Most datasets allows API (e.g. REST API) to access the datasets for further analysis. This is domain of web application and needs to be implemented by analytical tool
- Component of West-life project, e.g. Virtual Folder will have to download selected datasets and offer on file system level, local cache
- custom VM cache is limited to scrathc disk space (about 50 GB, 1-8 GB is used by installed OS and application)
- EGI Fedcloud allows creation of a cloud storage to be mounted on file system level strategy
- block storage with capacity of up to 2-5 TB can be mounted to custom VM instance and accessible from VM file system.
- such mounted cloud storage have to be created and data needs to be coppied/cached from it's original location.
PDB and PDB-REDO
Data from these sources will be available through the PDB REST API. Work in WP6 will provide a Web Component for use by service portals to support searching of these databases and reuse of the data in them.
In 2015 there were a total of 526,126,409 downloads from the PDB, mostly for Molecular Replacement pipelines. In addition there are synotpic studies: many papers are published that report on studies that begin by downloading the whole PDB, then running a program that analyses all the structures to obtain such generalized knowledge.
|Where||In an online database|
|What||.pdb/.mmcif plus .mtz, tens of Kb|
|How||Make URL containing accession code or use search API|
|Possible output||Usually revised .pbd/.mmcif, e.g. after Molecular Replacement|
Experimental data at synchrotrons (mature infrastructures)
These are mature infrastructures with sophisticated data management and processing. Data collected at synchrotrons and neutron sources is often stored in iCAT repositories. TODO STFC.
|Where||Usually in iCAT repository|
|What||Diffraction images, 100s of 1-10 Mb files|
|How||iCAT UI to request staging from tape|
|Possible output||Merged reflections, a few Mb.|
Experimental data at EM centres
These are terabyte datasets. TODO STFC, INFN, CSIC discuss plan for EM data.
|Where||eBIC, CSIC, etc|
|What||Several 100s of movies per day, 1.5 GB each (800 movies/day = 1.2TB/day). Microscopes could render 1 movie per minute, only during acquisition time. Some time is invested in setting/configuring, loading or screening areas.|
|How||The key question!|
|Possible output||Particle images, a few Mb|
Experimental data at other facilities
These often offer a USB interface. Instruct’s Data Management Policy says “storage of data is the responsibility of the User to whom it belongs ... Instruct Centres are not required to take responsibility for storing data beyond the immediate acquisition visit or the time taken for post experimental analysis if the latter is also provided by the Centre. However, Instruct Centres aspire to offer an archive to store data, especially in cases where the data volume makes this more practical that transferring the data ...” See D6.2.
|Where||Experimental facility e.g. CERM|
|What||Various, Kb to 1Gb|
|How||Plug in a USB drive|
|Possible output||Usually reduced data. For NMR: spectra, Gb.|
Increasingly, journals require that experimental data is open and that a paper contains links to it. These data are often but not always in one of the repositories mentioned here. TODO We will investigate the possibility of using the Europe PMC API to find cited data.
|Where||Zenodo, B2SHARE, university repository, and as above|
EMDB and EMPIAR
TODO discuss with CSIC
TODO STFC We will investigate other data sources as resources permit.