NCSI

   

Distributed Artificial Intelligent Model Training and Evaluation


Shodor > NCSI > XSEDE EMPOWER > XSEDE EMPOWER Positions > Distributed Artificial Intelligent Model Training and Evaluation

Status
Completed
Mentor NameMaria Pantoja
Mentor's XSEDE AffiliationEducation Allocation
Mentor Has Been in XSEDE Community1-2 years
Project TitleDistributed Artificial Intelligent Model Training and Evaluation
SummaryMachine Learning (ML) and in particular Neural Networks (NN) are currently being used for different image/video processing, speech recognition and other tasks. The goal of supervised NN is to classify raw input data according to the patterns learned from an input training set. Training and validation of NN is very computationally intensive. In this project we want to develop a NN infrastructure to accelerate model training, specifically tuning of hyper-parameters, and model inference or prediction using distributed systems techniques. With a single set of training data, our application will run different classifiers on different servers each running models with tweaked hyper-parameters. To give more control over the automation process the degree by which these hyper-parameters will be tweaked can be set by the user prior to running. To make our implementation robust to common distributed system failures (servers going down, lost of communication among some nodes, and others) we can use heartbeat/gossip style protocol for failure detection and recovery.
Job DescriptionThe student will:
- implement scripts in python scripts to run different Machine Learning models preferible using Keras/Tensorflow.
- implement scripts in GoLang and or MPI to run code on a distributed System. Mentor already has some staring code for this project that student can expand
-Take existing pathways and recalculate intermediates with hybrid functional
-Run test and performance metrix
-Their results can be added manuscript prepared to be sent to a peer-reviewed journal. The student will be a co-author on this article.
Computational ResourcesWe will need to request a Research/educational allocation on Expanse to be able to run on a real Distributed System, we will need the request to include GPU access. In our university we do not have a cluster, we do have a lab with computers with relatively new CPU+GPS but they are not connected together and we need to continuously "hack" the system to make it work as a distributed system.
Contribution to CommunityThe intent for this project is to attract more undergraduate students to HPC. Our students usually need to work during the summer, we are in an area where housing cost are high so to be able for our students to stay and do research they need some stipend. This project will hopefully encourage the participation on HPC research of groups traditionally underrepresented.
Position TypeApprentice
Training PlanThe student will get training in how to access and use the xsede research allocation
Student is at the Apprentice level so some python or any other scripting language is assumed; I can provide some videos from my previous classes on what is needed for python and Golang previously to the start of the grant so they can be up to speed when program starts
Student Prerequisites/Conditions/QualificationsProgramming skills preferible in python
DurationSemester
Start Date03/01/2022
End Date05/22/2022

Not Logged In. Login