NCSI

   

Genetic Diversity Modeling in the Atlantic Forest Biodiversity Hotspot


Shodor > NCSI > XSEDE EMPOWER > XSEDE EMPOWER Positions > Genetic Diversity Modeling in the Atlantic Forest Biodiversity Hotspot

Status
Completed
Mentor NameAna Carnaval
Mentor's XSEDE AffiliationResearch Allocation
Mentor Has Been in XSEDE CommunityLess than 1 year
Project TitleGenetic Diversity Modeling in the Atlantic Forest Biodiversity Hotspot
SummaryThis project seeks to predict the amount and the geographic distribution of genetic diversity within species of birds and amphibians of the Atlantic Forest by modeling where, in the landscape, species will encounter obstacles to gene flow that promote genetic differentiation. For that, we will generate models of genetic diversity and divergence from predictor environmental and ecological variables that have the potential to explain the location of barriers to gene flow in the ecosystem. Here, we propose to use machine learning (M) methods that incorporate climatic and life history species-specific data to this end. The ML framework should allow us to extensively navigate the large amount of available genetic data for these groups and to infer mechanistic relationships between predictor and response variables.
Job DescriptionThe student will be expected to:
- search the extensive genetic diversity literature that describes patterns and levels of genetic variation in birds and amphibians of the Atlantic Forest of Brazil.
- download DNA sequences of all species of birds and amphibians available in public databases and the grey literature
- align sequences and estimate genetic population parameters that describe genetic structure (Fsts) and diversity (haplotypic and genotypic diversity)
- search the literature for ecological and morphological descriptions of the species of birds and amphibians for which genetic data are available
- build a database of biological traits for those species, including body size, wing span, diet, reprodutive strategies, foraging strata, habitat.
- identify and download GIS layers providing environmental descriptions of the Atlantic Forest, based on bioclimatic variables representing averages and extremes in temperature and precipitation (eg WorldClim database) over the past 50 years. If available, download layers describing projections of climate into the deeper past (the last 100,000, which are known to have left significant structure on present-day levels of genetic diversity)
- explore multiple Machine Learning algorithms to assess which genetic diversity parameters are best predicted by a subset of environmental descriptors and biological characteristics
- if the models are good enough, use them to predict areas of high genetic diversity shifts based on environmental data and biological descriptions of species for which no genetic data are available.
Computational ResourcesThe Carnaval laboratory has five computers for student use, including a Dell with an Intel Xeon W-2155 3.3GHz processor, 4.5GHz Turbo (64GB), an Apple G5 with 2 quad-core 2.26GHz Intel Xeon processors and 12 GB RAM, 2 TB hard disk space and a 24” Apple Flat Panel Display; one Apple 24" iMac with a 2.93GHz Intel processor, 4 GB RAM and 650 GB hard disk; one 3.4 GHz Dell computer with 4 GB and a 19" LCD display; one 3.39 GHz Dell computer with 5 GB RAM and a 19" LCD display. Computers have Microsoft Office, ArcvView, Adobe Creative Suite, Sequencher, CodonCodeAligner, GoogleEarthPro, PAUP, and Geneious installed. Additional GIS-based, phylogenetic and population genetic software to be used in this study is free (e.g. ArcGIS, DIVA-GIS, Arlequin, Fluctuate, DnaSP, msBayes, RAXML, MrBayes, Structure). Latest versions will be downloaded and installed before the data are analyzed. The laboratory is equipped with an 802.11n wireless network and a Brother 9840CDW double-side color Multi-Function Center. Additionally, the Carnaval lab has access to two recently acquired Titan X499 - Dual CPUs Intel Xeon E5-2600 v4 Broadwell-EP Series 3D / CAD Workstation PCs for heavy data analyses. These each come with a 2x Intel Xeon E5-2630 v4 Broadwell-EP 2.2GHz (3.1GHz TB) 85W 25MB L3 (20 Cores / 40 Threads Total). For RAM, one comes with 128GB (4 x 32GB) whereas the second comes with 256GB (4 x 64GB).


HPC Cluster Access: In 2005, the Chancellor of CUNY designated the years 2005 to 2015 as the “Decade of Science”, renewing the University’s commitment to creating a healthy pipeline to science, mathematics, technology, and engineering fields by advancing science at the highest levels, training students to teach in these areas, and encouraging young people, particularly women and minorities, to study in these disciplines. CUNY as a whole has established a new program for cluster hiring of faculty with expertise in computational science and cyber-infrastructure. As part of the Decade of Science, the Chancellor of CUNY created the CUNY HPCC. The rationale for this was to increase synergy across the research and educational communities by moving away from isolated pockets of cluster computing to a larger, more capable, better supported HPCC, which could leverage shared experiences and provide cutting edge technology that would allow individual researchers to expand beyond the limitations of their local computer laboratory and would help them to recognize that they are part of an important and synergist community. The CUNY HPCC was opened in June 2007 and is located on CSI’s 200 acre park-like campus. A list of the CUNY HPCC systems is appended. The HPCC is connected by a 1 gigabit dedicated line to CUNYnet. The CUNYnet provides the HPCC with interconnectivity to all the CUNY campuses, the New York State Education and Research Network (NYSERnet), the Internet, and other national research networks.
CUNY is committed to provide additional HPC capability and expanded network connectivity to keep pace with its growth in computational science. CSI completed a $400,000 upgrade to the facility, which includes installing 1,400 square feet of new raised floor, new air conditioning systems, and additional electrical power. It occupies a 4,500 square foot area in a building that has three megawatts of available electric power. CSI is investing at least an additional $3 million to upgrade this facility. The upgraded facility will be capable of supporting systems requiring one megawatt of power (net of cooling systems). Funding is also in place for designing a new 90,000 sq. ft. computational science building to be built to U.S. Building Council’s LEED environmental gold standard and is expected to be completed in 2014.
The HPCC is operated on a 24x7 basis. The operations staff includes full-time system programmers, network and communication specialists, and computer security specialists. The full-time staff is supplemented by part-time personnel who are generally graduate students working under the direction of fulltime specialists.
High Performance Computing Systems:
a) Dell PowerEdge 1850 consisting of one head node and 96 compute nodes. Each compute node has two sockets for Intel 2.86 GHz Woodcrest dual-core processors, i.e., four cores per node. It has a total of 384 cores available for user computations. The three-hundred-eighty-four processors have 2 Gbytes of memory per core (four cores with a total of 8 Gbytes to a node). The interconnect network is Gbit Ethernet.
b) Dell PowerEdge 1950 consists of one head node and 16 compute nodes. Eight of the compute nodes have two sockets for Intel 3.0 GHz quad-core Harpertown processors, i.e., eight cores per node or a total of 64 cores. Another 8 compute nodes are single socket Intel 2.86 GHz Woodcrest dual-core processors. The 80 processors each have 2 Gbytes of memory per core. Each node has a 300 GByte disk drive for user temporary files. The interconnect network is Gbit Ethernet.
c) Dell PowerEdge system and consists of 240 cores of quad-core processors. The 240 processors have 2 Gbytes of memory per core (eight cores with a total of 16 Gbyte of memory to a node). The interconnect network is Infiniband (10 Gbit/second).
d) A cluster that consists of 360 cores of Intel 2.93 GHz quad-core Intel Core 7 (Nehalem) processors with a 1600 MHz front side bus. The 360 processors have 3 Gbytes of memory per core (eight cores with a total of 24 Gbyte of memory to a node). The interconnect network is dual rail Infiniband (20 Gbit/second).
e) A software development system that includes two Nehalem-based servers supporting a NVidia S1070 Tesla system with four graphics processors (960 cores).
f) An upgrade to the system described in (d), scheduled for a March 2010 installation, will consist of an additional 48 nodes (96 sockets or 384 cores) of Intel 2.93 GHz quad-core Intel Core 7 (Nehalem) processors with a 1600 MHz front side bus. The 384 processors will have 3 Gbytes of memory per core (eight cores with a total of 24 Gbyte of memory to a node). The interconnect network will be quad rail Infiniband (40 Gbit/second). The 96 sockets will be connected to 96 NVidia Fermi graphics processing units via PCI-E-16. This system enhancement is funded under NSF award CNS-0855217.
Systems (a) through (f) are or will be located in the CUNY High Performance Computing Center at the College of Staten Island and are operated as a shared resource facility for the entire CUNY community. The systems are operational 24x7 except for periods of preventative or emergency maintenance.

Contribution to CommunityIf successful, this project will establish a pipeline for bioinformatics research that allows the immersive training of undergraduate students into environmental, ecological and genetic big data analysis, and which can be easily transferred to other institutions. Insights from the modeling exercise will be used to guide conservation on the ground by mapping areas of high genetic diversity and divergence, and allow scientists to predict the location of high genetic diversity and turnover even in the absence of genetic sampling.
Position TypeApprentice
Training PlanStudent will meet with PI Carnaval and a PhD student mentor (Rilquer Mascarenhas) weekly to be trained in
- efficient modes of data search and download
- data management, storage and curation through the use of a relational database system
- essential population genetic parameter estimation (especially genetic structure and diversity metrics)
- regression analysis based on deep learning, using a multilayer neural network based approach to estimate population genetic parameters from the environmental and biological descriptors

Weekly meetings will alternate between revisions of concepts and readings, hands-on activities with the data, presentation of results, and check-ups for updates and assessment of what needs to be adapted. The student will also be required to participate in PI Carnaval's weekely lab meetings, where she will be exposed to similar work developed by senoir graduate students and fellow undergraduates and attend regular research presentations and discussions.
The student will also meet with the PI and her Phd mentor once a week to review her educational goals and milestones and to discuss plans about where to present her work for the lab group and a broader audience. The PI will support the student's participation in a regional or international scientific meeting (more likely that of the International Biogeography Society or the Society for the Study of Evolution), where she will be able to present her work, obtain feedback from the scientific community, and network with future job opportunities in mind.
Student Prerequisites/Conditions/QualificationsCommitment and interest in biodiversity informatics. Excellent organizational skills, and demonstrated ability to work and curate our genetic databases. Ability to use gui-interface software with independence and ease.
DurationSemester
Start Date01/03/2022
End Date03/25/2022

Not Logged In. Login