NCSI: Structural Prediction Using Machine Learning

Status	Completed
Mentor Name	Neranjan Edirisinghe
Mentor's XSEDE Affiliation	Campus Champion
Mentor Has Been in XSEDE Community	4-5 years
Project Title	Structural Prediction Using Machine Learning
Summary	While it is fundamentally accepted that structure affects function, predicting subsequent changes in function from perturbations to the structure poses a challenge. In a structure as complex as a protein, mutations in the molecule's structure often result in diminished or even no protein activity. The protein p53, for example, is responsible for the destruction of its own cell upon the detection of heavily damaged DNA and plays a critical role in the suppression of tumors. However, mutations in the protein have shown decreased activity and thus adversely affects the body's capability to resist tumor growth. Interestingly, some second mutations, referred to as "rescue mutations", have been shown to partially restore function to these mutated proteins and by extension their ability to suppress the spread of cancerous cells. By predicting which structural regions to alter in order to elicit functional change, the development of rescue mutations similar to p53 would allow for a more targeted approach to the development of treatments on a molecular basis. Molecular dynamic (MD) data has been generated for a protein, Cyclophilin A. This protein is responsible for the transition from cis to trans state of proline residues in protein backbones, playing a crucial role HIV development in hosts and immunosuppressant in organ transplants. Using the generated data for this system as well as the data from several mutated forms of this system, a supervised learning network will be developed to predict functional effects of structural changes. A scoring function will be used to extract structural information from the generated MD data, denoting intramolecular contact between different residue pairs. The data set is a SSV file representing a time series, representing contacts between two residues at certain specific frames in the simulation (Fig 1). As the data has been reduced to a binary state, information is given by the contact dynamics in the system. As residues contributing to primary and secondary protein structures will be in contact for the entire simulation, they will convey little information about the system's dynamics; conversely, distant residues will have no contact and contribute very little to the system's dynamical structure. Thus, only residue pairs with contacts occurring more than 10\% and less than 90\% of the time will be considered, referred to as dynamic contacts. By learning the dynamic structure of a controlled system, the subsequent structure of an altered system can be predicted and serves as a means to intelligently target mutations. Using the Keras framework in Python, a recurrent neural network will be used to predict the contacts state of the nth frame from previous data. As Keras is a highlevel framework, rapid prototyping allows for the quick testing and implementation of learning networks. The choice of a recurrent neural network is due to its ability to retain past information as context. More specifically, a LSTM network will be invoked due to its ability to better handle long-term dependencies in the data (Fig 2). Calculations will be conducted using the allotted nodes of Dr. Edirisinghe on the Bridges, Comet, and XStream systems.
Job Description	Weeks 1-2 The first two weeks will consist of gaining an intimate familiarity with the data and the architecture of recurrent neural networks. Developing an understanding of the underlying mathematics that drives the network will allow for a more precise control of the way the input data is interpreted. Papers will be read to learn about the current trends in recurrent neural networks and to avoid shortfalls in the project. Week 3 This time will be spent tinkering with the Python framework Keras and learning some of the more essential functions. Simple experimental models will be created to gain a deeper understanding of the data and form a better topology for the final model. Using the resources provided by Dr. Edirisinghe, simple test problems will be run on the allocated clusters to become familiar with the tools. Week 4 Building a working prototype of the network will be essential during this time frame. Fine-tuning the network for performance and correctness will come later as for now a functional network for data throughput will allow for a means to gauge the viability of the project. At this point in time, the data will be adequately formatted with an expected analysis in mind. Weeks 5-7 During this interval, the model will be refined and updated to encompass findings about the data from preliminary runs. The topology of the network will be altered to determine a more optimal layout. Several versions of the final network will be tested and analyzed, attempting to pinpointing the significance various structural components play on the analysis of the data. Week 8 Onward Findings will be polished and visualized. The information gathered from the RNN model will be appraised to determine its viability in future research. Depending on the outcome, future work will be dedicated to the improvement of the model or a transition to a different area.
Computational Resources	initially Bridges, Comet, and XStream will be tested using CC allocation. Based on the testing we will apply for a startup allocation
Contribution to Community
Position Type	Apprentice
Training Plan	I will guide the student initially, basically RNN, LSTM and Keras and Tensorflow. Also, the student will attend XSEDE (PSC) on the Bigdata. He is already sufficiently familiar with MD simulations and the input data.
Student Prerequisites/Conditions/Qualifications	The student will also have access to local resources where he can develop and test codes.
Duration	Quarter
Start Date	01/08/2018
End Date	03/05/2018