Distributed Artificial Intelligent Model Training and Evaluation
Summary
Machine Learning (ML) and in particular Neural Networks (NN) are currently being used for different image/video processing, speech recognition and other tasks. The goal of supervised NN is to classify raw input data according to the patterns learned from an input training set. Training and validation of NN is very computationally intensive. In this project we want to develop a NN infrastructure to accelerate model training, specifically tuning of hyper-parameters, and model inference or prediction using distributed systems techniques. With a single set of training data, our application will run different classifiers on different servers each running models with tweaked hyper-parameters. To give more control over the automation process the degree by which these hyper-parameters will be tweaked can be set by the user prior to running. To make our implementation robust to common distributed system failures (servers going down, lost of communication among some nodes, and others) we can use heartbeat/gossip style protocol for failure detection and recovery.
Job Description
The student will: - implement scripts in python scripts to run different Machine Learning models preferible using Keras/Tensorflow. - implement scripts in GoLang and or MPI to run code on a distributed System. Mentor already has some staring code for this project that student can expand -Take existing pathways and recalculate intermediates with hybrid functional -Run test and performance metrix -Their results can be added manuscript prepared to be sent to a peer-reviewed journal. The student will be a co-author on this article.
Computational Resources
We will need to request a Research/educational allocation on Expanse to be able to run on a real Distributed System, we will need the request to include GPU access. In our university we do not have a cluster, we do have a lab with computers with relatively new CPU+GPS but they are not connected together and we need to continuously "hack" the system to make it work as a distributed system.
Contribution to Community
The intent for this project is to attract more undergraduate students to HPC. Our students usually need to work during the summer, we are in an area where housing cost are high so to be able for our students to stay and do research they need some stipend. This project will hopefully encourage the participation on HPC research of groups traditionally underrepresented.
Position Type
Apprentice
Training Plan
The student will get training in how to access and use the xsede research allocation Student is at the Apprentice level so some python or any other scripting language is assumed; I can provide some videos from my previous classes on what is needed for python and Golang previously to the start of the grant so they can be up to speed when program starts