We will investigate the performance and the scalability of various machine learning algorithms using Data from Mass Spectrometer on the Spark platform. This study will extend the scope of current mass spectrometer data processing and modeling and evaluate its impacts to healthcare.
Job Description
In this project, the student will work on extracting data from a database that stores the raw data from mass spectrometer procedures, determining the features to be used for data analysis, processing the data in order for the machine learning algorithms to recognize patterns of protein and peptides combinations that match certain type of strains of bacteria. The student will develop machine learning algorithms in the map-reduce framework on Spark, and implement and test the algorithms to ensure high speed processing of the large volume of raw data. The project is about mining of big data on molecules making up and produced by beneficial microbes found in human. We use a state-of-the-art mass spectrometer system, Bruker's Tims TOF Pro, to collect data and identify metabolites that are important to health, including our mind, mood and mental health. While the instrument is powerful in producing huge volume of data (about 3-5 G in a 20 minutes run), data mining becomes a bottleneck, which is needed to improve the instrument performance, answer biological questions, and direct the development of analytical methods. We have used the decision tree method and the rule-based method to analyze protein and peptides data and build a model to identify the strain of bacteria. We need to extend our program such that a large volume of data can be processed in a reasonable amount of time. We also need to test the performance of various machine learning algorithms in classifying the datasets. To this end, high performance machine learning models on the big data analysis platform Spark will be used to perform data classification and testing. The machine learning models will be implemented in the map-reduce platform for big data analysis.
Computational Resources
We will use a high performance computing allocation from a XSEDE constitutional institution to implement the project. In recent years, we used NCSA’s Blue Waters Allocation in education and research projects for student interns. We have a local cluster that can be used in daily development and testing activities.
Contribution to Community
Position Type
Apprentice
Training Plan
The student is expected to have some basic knowledge in statistical and machine learning and has the basic skills in programming. The student should be willing to study machine learning using the map-reduce platform on Spark. The faculty mentor used the Spark platform on Blue Waters Supercomputer in previous big data analytics courses, and can help the student to get started.