NCSI

   

Establishing the Electronic Genome of Functionalized Polycyclic Aromatic Hydrocarbons (PAHs) using High-throughput DFT and Machine Learning


Shodor > NCSI > XSEDE EMPOWER > XSEDE EMPOWER Positions > Establishing the Electronic Genome of Functionalized Polycyclic Aromatic Hydrocarbons (PAHs) using High-throughput DFT and Machine Learning

Status
Completed
Mentor NameBohdan Schatschneider
Mentor's XSEDE AffiliationPrevious research allocation recipient and currently applying for more time
Mentor Has Been in XSEDE Community4-5 years
Project TitleEstablishing the Electronic Genome of Functionalized Polycyclic Aromatic Hydrocarbons (PAHs) using High-throughput DFT and Machine Learning
SummaryThis project has two objectives: 1) generate large amounts of structural and electronic properties data in order to establish the electronic genomes of many functionalized PAHs and 2) utilized the calculated data to generate quantitative structure property relationships using a variety of machine learning algorithms.
Job DescriptionTwo undergraduate researchers (UGR) will generate all structural permutations of a given functional group (e.g., OH, NH2, Me, Et, NO2, CN, etc) on a specified PAH backbone (e.g. benzene, naphthalene, anthracene, pyrene, etc) using the ChemDraw software suite (tens of thousands of structures). Structures will be translated to a "study table" that will automatically queue the structures on the XSEDE resource and retrieve the necessary data (structural and properties descriptors) from the density functional theory (DFT) outputs. The UGRs will run electronic structure calculations using the ORCA software suite on the Open Science Grid resource. The UGRs will then save the metadata and process the DFT derived descriptors. Structural and electronic properties information from the calculations will then be used to develop quantitative structure properties relationships (QSPRs) using a variety of machine learning algorithms such as ordinary least squares linear regression (OLS), artificial neuronal networks (ANNs), support vector models (SVMs), and random forests (RFs) within the R-project software suite.
Computational ResourcesWe have applied for a startup allocation on the Open Science Grid resource to run high-throughput single-molecule electronic-structure calculations using density functional theory (DFT) in the ORCA software suite. All structural permutations will be made for a given functional group on a specific backbone using a script on computers in our lab. Then these structures will be loaded into a study table and queued to run in ORCA on the Open Science Grid.
Contribution to Community
Position TypeIntern
Training PlanOne of the positions necessary is for the production of source code for the repeated calculations and data extraction associated with the high-throughput routines. This work will be coded by our current programing specialist, William Riddle (working on a BS in Computer Science). So, no training will be needed for this part of the project. The second position for this project will deal with establishing the correct DFT parameters need to produce benchmarkable electronic properties. This job will be dedicated to Noryn Rosario, who is experienced in the use of the ORCA DFT software suite. Now, both William and Noryn will be involved in the development of the QSPR models using the ML algorithms in R-Project. My collaborator, Dr. Carsten Lange will train the two UGRs in the implementation of the R-Project software.
Student Prerequisites/Conditions/Qualifications
DurationSummer
Start Date06/01/2019
End Date08/20/2019

Not Logged In. Login