Task 7.3

From West-Life
Jump to: navigation, search

Planned approach

This task aims to leverage STFC's connection with IBM Research to develop a small number of prototypes demonstrating novel approaches to Big Data, in the context of integrative structural biology.

We aim to have a workshop in summer 2017 to finalise plans for these prototypes. Before then, we will use this wiki to work up ideas. MW spoke with Kirk Jordan of IBM Research on 14/10/2016 to confirm this approach. Realistically, the amount of help we get from IBM probably depends on whether a prototype overlaps with the interests of a particular IBM researcher.

This task is connected to M7.7 / M30 "Big Data software introduced" month 24, and D7.8 "Report on prototypes constructed using Big Data approaches" month 30.

Workshop planning

Host this at STFC Daresbury Lab, Cheshire, UK in order to involve as many IBM staff as possible.

West-Life attendees (tentative): Martyn (organiser), Chris, Tomas, Jose-Maria, Sameer

Dates: Needs to be well in advance of milestone 7.7 due end of October 2017.

Agenda: 1) Novel hardware e.g. Power8 and successors, 2) Machine learning applications, 3) NLP

Draft ideas

Watson based cognitive computing

Note, Watson is a set of technologies, so there is a choice here. The developing Cognitive Centre of Competence at Daresbury may be able to advise.

One possible task is to develop an ontology for structure interpretation, by text analysis of open access papers.

Machine learning

E.g. DeepPicker for cryoEM particle picking e.g. https://arxiv.org/pdf/1605.01838 uses a convolutional neural network.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559866/ "ANGLOR: A Composite Machine-Learning Algorithm for Protein Backbone Torsion Angle Prediction". Chris: This study chose an important problem but did not get very good results. One weakness is that they used a training set of just 500 molecules - you could use the whole PDB (less a test set). I also think that it was a mistake to aim for a single prediction. A Bayesian approach would yield a context-dependent Ramachandran plot. It could be validated by using it for structure validation, and it might help improve structure prediction.

Antony Vassileiou at Strathclyde "A Robust Machine Learning Approach for the Prediction of Allosteric Binding Sites" http://dx.doi.org/10.15129/d40aa95f-cd9a-47e2-abd4-a08381668b47

LBNL recently advertised a position including "Implement convolutional neural network (CNN) code on neuromorphic computing hardware such as the IBM TrueNorth chip. Example problems include the identification of positive diffraction events in X-ray free-electron laser diffraction, conformational classification in CryoEM single particle reconstruction, and the identification of 3D sections for CryoET subtomogram averaging."

Rosetta has a library of motifs for local folds. With modelling software it should be possible to expand this to a predictor for torsion angles from local sequence. Possible modelling methods are a neural net (specifically a Restricted Bolzmann Machine) and a CDHMM. The output could be represented as set of context-specific Ramachandran plots, which could be used for Bayesian refinement, or simulated annealing in ab initio methods.

An alternative approach to the same problem is to build a libray of common substructures, probably as maps rather than atomic coordinates.

Local sensitive hashing functions. Google does for 2D images. Can we do it for 3D volumes, for fold recognition?

Natural Language Processing

Review suggestion: http://openminted.eu/

Extraction of related metadata from the literature. STFC has a pilot project extracting relevant terms in the area of biofilms from open source articles in Europe PMC https://europepmc.org/

Issues include:

  • building a list of generic stopwords, e.g. "is" or "but"
  • building a list of domain-specific stopwords, e.g. "protein" is not useful if you have already narrowed down to structural biology
  • identify positive or negatives senses of a phrase
  • allow manual annotation to train the system
  • provide intelligent search system

Suggestion to build an NLP model for papers connected to PDB/EMDB entries:

  • Recognize sentences about the significance and interpretation of the structure. Extract keywords from such sentences and build an ontology.
  • Identify which figures are relevant to the structure (figures available through collaborative work with EuropePMC)
  • Understand figure legends (legends available through collaborative work with EuropePMC)
  • Other relevant proteins / small molecules (available at EuropePMC website)
  • Classify other relevant papers (EuropePMC classification of citations into "reviews" and "research publications")
  • Look at references with same authors, for further work on biochemical/mutational analysis of the same protein or complex. Mutational or ligand-binding studies (functional assignments) for positions within the 3D structure.

Note that a high proportion of the structural literature is open access. Collaboration with the IUCR may also be possible. See https://www.cmpe.boun.edu.tr/~ozgur/projects/biocontext.html

Large computations

Suggestion from Sameer: search for structural similarity of new deposition against existing PDB, using Gesamt from CCP4. Computationally challenging.

Hadoop or Spark

Apache Spark extends the popular Hadoop MapReduce for processing large datasets on compute clusters. One main feature seems to be holding intermediate results in (distributed) memory to minimise file i/o. It is also apparently more flexible and more user-friendly than Hadoop.

If we start from basic MapReduce, then that works with large datasets that can be represented as many key-value pairs, which are distributed over the compute cluster and manipulated locally. The STFC partner has implemented a MapReduce algorithm for kmer clustering https://github.com/kimosaby2001/MR-Inchworm/wiki using an MPI implementation of MapReduce. That doesn't feel like a natural way of dealing with single structural biology datasets (reflection data or particle images) but maybe for data at the level of full PDB?

A quick Google comes up with http://udspace.udel.edu/handle/19716/17204 "Enabling scalable data analysis for large computational structural biology datasets on large distributed memory systems supported by the MapReduce paradigm". This indeed seems to be about classifying large sets of structures.

Flash storage / SSD

SSD provides intermediate storage between high-capacity slow spinning disk and low-capacity fast volatile memory. STFC has a system with IBM ESS GS4 storage arrays, providing 96 x 800GB SSD. Large datasets can be placed on the SSD for fast access. This is the Power8 "Panther" system, which also has a mixture of CPU and GPU. There would be an interest from IBM in porting systems to this architecture. "Paragon" is coming next.

Data stream operations

Possibly useful for on-the-fly processing of diffraction or EM images as they are collected? There is something like this in Scipion Box?

cf. MinION sequencer streams reads for rapid alignment and analysis.