Studying evolutionary changes in pre and post-Hispanic Mexico through ancient DNA analysis
Highlighted article:
- Villa-Islas V, Izarraras-Gomez A, Larena M, …, Ávila-Arcos MC. Demographic history and genetic structure in pre-Hispanic Central Mexico. Science. 2023 May 12;380(6645):eadd6142.
Other articles linked to the project:
- François O, Jay F. Factor analysis of ancient population genomic samples. Nature communications. 2020 Sep 16;11(1):4661.
- Cury J, Sanchez T, Bray E, Medina-Tretmanis J, Avila-Arcos MC, Huerta-Sanchez E, Charpiat G, Jay F. Inferring effective population sizes of bacterial populations while accounting for unknown recombination and selection: a deep learning approach. ECML-PKDD2022 workshop on Machine Learning for Microbial Genomics. 2022.
- (peer-reviewed 8 pages conference paper without proceedings)
- Conference link
- Medina-Tretmanis J, Jay F*, Ávila-Arcos MC*, Huerta-Sanchez E*. Simulation-based benchmarking of ancient haplotype inference for detecting population structure. Human Population Genetics and Genomics. 2024 Mar 19;4(1).
Software:
- TFA: R package implementing a factor analysis algorithm for temporal or ancient DNA, adjusting individual scores for the effect of allele frequency drift through time, and providing estimates of ancestry proportions. (repository)
Outreach:
- Radio show: “Une population fantôme découverte dans le génome des civilisations préhispaniques”, entretien radiophonique pour Le Journal des sciences, France Culture, May 18th, 2023
- Press: “Visualiser les relations génétiques entre les populations anciennes” by Clara Barrau, October 28th 2020 Journal CNRS
Context
This work is part of a large collaborative project that includes UNAM (Mexico), U Brown (US), and LISN (France) to study evolutionary changes in human hosts and their pathogens before, during, and after first contact in the New World. The 3 co-PIs were financed by the Human Frontier Science Program (HFSF). The past history of Mexican individuals is complex and understudied. They went through a series of events that shaped their genome. Among those was the massive colonization by Europeans that started in the 15th century. Isolated populations, like those in the Americas, had their whole environment changed by this colonization. The arrival of Europeans and domesticated animals introduced new pathogens, and it has been speculated that a large proportion of Native Americans died because they lacked immunity. We use ancient DNA to study the changes in both pathogen and human genetic diversity before and after European colonization to characterize how population structure changed and how pathogens may have influenced the evolution of specific genetic loci during this transition.
Contribution
To this aim, the first step is to precisely characterize the genetic diversity of populations pre-dating the European colonization. In our paper “Demographic history and genetic structure in pre-Hispanic Central Mexico” we overcome the degradation challenges posed by warm climates and offer new insights for previously underlooked human populations, highlighting the genetic legacy of central Mexican populations into modern Indigenous groups. This study analyzed ancient DNA from pre-Hispanic Mexico, unraveled the size decline and demographic structure of these populations, and revealed contributions from an unsequenced and unidentified ghost population, challenging established genetic histories. It also evidenced genetic continuity without population replacement despite the severe droughs that occurred in the 10th century. To reach these findings multiple tools were applied, such as TFA which we have developed in collaboration with TIMC (France). TFA is the first method that can visualize ancient samples in a reduced space and estimate their ancestry coefficients without being confounded by temporal drift (a well-known evolutionary force).
The HFSP project's main object of study is the ancient DNA of humans and pathogens, which usually consists of degraded parcellar and prone-to-error genetic data. In two additional studies, we thus investigated the robustness of existing tools, and novel ones developed by our lab, in relation to data quality and ancient DNA specificities. We particularly focused on phasing inference and its impact on population structure detection (Medina-Tretmanis et al. 2024), and on likelihood-free inference based on neural networks (Cury et al. ECML workshop 2022). This is of major importance as many tools widely used in population genetics are directly applied to paleogenetics with no evaluation of their performance in this context. Further quantifying the uncertainty around estimates, as we proposed, is thus an important step to mitigate drastic misconclusions.
Impact
The highlighted publication appeared in Science in 2023 and the 3 related works in Nature Communication 2020, Human Population Genetics and Genomics 2024 and in a workshop of ECML-PKDD 2022 (peer-reviewed paper no proceedings). These findings were the topic of a radio show (interview in Le journal des sciences, France Culture) and the methodology was highlighted in an outreach article (Le journal du CNRS). More broadly, the HFSP project produced and will keep producing extremely valuable genomic data from a region and time period that are understudied. The tools developped are also all publicly available.
Genomics and AI
Highlighted production:
- Radio show “Génomique et IA : les liaisons fructueuses”, entretien radiophonique pour La méthode scientifique, France Culture, January 12th, 2022
Linked publications (2 selected):
- Yelmen B, Decelle A, Ongaro L, Marnetto D, Tallec C, Montinaro F, Furtlehner C, Pagani L, Jay F. Creating artificial human genomes using generative neural networks. PLoS genetics. 2021 Feb 4;17(2):e1009303.
- Sanchez T, Cury J, Charpiat G, Jay F. Deep learning for population size history inference: Design, comparison and combination with approximate Bayesian computation. Molecular Ecology Resources. 2021 Nov;21(8):2645-60.
Linked software:
- DNADNA, an open-source software which aims to foster the development, reproducibility, and sharing of novel deep neural networks in the population genetic field.
- Website
- Sanchez T*, Bray EM*, Jobic P, Guez J, Letournel AC, Charpiat G, Cury J°, Jay F°. dnadna: a deep learning framework for population genetics inference. Bioinformatics. 2023 Jan 1;39(1):btac765.
Linked press articles:
- “Sur la piste des génomes artificiels”, interview by S Escalón, Journal du CNRS 22/11/21 JDC306
- “Une intelligence artificielle fabrique de l’ADN pour la première fois” by S Gavilan, Science et Vie 22/02/21
- “La première intelligence artificielle capable de créer des génomes humains” by C Gaubert, Sciences&Avenir 12/02/2021
Associated keynotes in:
- RECOMB-Genetics 2023
- ISMB/ECCB SFBI symposium 2023
- MCEB 2023
- SMBE 2021 symposium on Machine-learning applications in population genetics and phylogenomics
- JOBIM 2020
Context
As population genetic datasets keep increasing in size, it is common to observe millions of genomic markers sequenced for hundreds of individuals, opening the possibility of answering intricate biological questions. However, extracting relevant information from these genomic datasets is not trivial due to their size, the complexity of the underlying mechanisms, and sometimes impossible due to privacy rules that govern several human genome databases. Since 2017, the bioinfo team has been a pioneer in introducing deep learning approaches at different levels of population genetics for parameter inference, data visualization, or generation.
Contribution
Through the past years we have designed tailored neural architectures for evolutionary inference (demographic and selective past histories) and high dimensional genomic data generation. We have paid particular attention to the comparison of this novel approaches with well-established methods (Sanchez et al 2021) and have developed a software for facilitating sharing and reproducibility of neural networks in the field of population genetic (DNADNA, Sanchez*, Bray* et al 2022). Regarding highly realistic genomic generation (i.e. fake genomes that can be merged with real genomes in a single usable dataset) no methodologies pre-existed in population genetics and we have worked towards establishing evaluation standards (Yelmen et al 2021 and following works).
Impact
This line of work led to multiple publications, such as the original Yelmen et al 2021 paper which has obtained 98 citations according to Google Scholar. It also attracted the attention of the outreach press (Sciences et Vie, Sciences&Avenir, Journal du CNRS). Finally, it was the topic of an interview on France Culture's radio program "La Méthode Scientifique" (January 2022). The episode entitled “Génomique et AI: les liaisons fructueuses” explores the potential links between genomics and artificial intelligence, with, for example, the use of generative AI to produce DNA sequences (interview at the radio), as well as a report on the research in deep-learning-based inference for microbial populations conducted at LISN.
Reproducible Science
- Article:
- C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M. R. Crusoe, K. Peters, and D. Schober. FAIR Computational Workflows. Journal of Data Intelligence, 2:108 – 121, 2020.
- PDF version of the article (from HAL)
- HAL: hal-04402238v1
- DOI: 10.1162/dint_a_00033
- Associated keynotes in
- Radio show “Cause commune” on "La reproductibilité des environnements logiciels pour la recherche" - September 13th 2022
- Launch of the French node of the Reproducible Research Network
- Members of the steering committee: Sarah Cohen-Boulakia, Arnaud Legrand, Frédéric Lemoine, Céline Robert, Nicolas Rougier
- Website of the network
- Organization of the kick off days 2023
Context
Over the last twenty years, reproducibility has emerged as a major concern in a number of fields, initially in social and psychological sciences and later extending to areas like pre-clinical research, life sciences, and computational sciences. While wet biological experiments inherently exhibit variability due to multiple factors like phenomena complexity, measurement methods, and sample diversity, it has long been assumed that computational biology and bioinformatics analyses were inherently reproducible owing to their machine-driven automation. However, many factors can impede this reproducibility, including insufficient method documentation, data accessibility not adhering to FAIR principles, technical hurdles like variations in operating systems, hardware configurations (e.g., HPC scheduler), diverse software environments (e.g., tool and library versions), and the influence of stochastic algorithms. These problems raise major scientific challenges.
Contribution
The bioinfo group has been highly engaged in advancing reproducible science through multiple avenues.
We have delved into the issue of reproducibility through computational workflows, which encapsulate the intricate multi-step processes involved in data collection, preparation, analysis, predictive modeling, and simulation, culminating in the creation of new data products. Specifically, we have focused on enhancing the FAIRness of workflows. Interestingly, workflows inherently align with the FAIR data principles by adhering to established metadata standards, generating metadata during data processing, and meticulously tracking data provenance. These attributes facilitate quality assessment and enable secondary data utilization. Moreover, workflows are digital entities in their own right, necessitating the formulation of FAIR principles tailored to their unique characteristics, considering aspects such as their composition of executable software steps, provenance, and development challenges.
We have actively advocated for reproducible science through various channels, including engagements with the scientific community through keynotes and invited talks, as well as outreach efforts targeting a broader audience via a dedicated radio show on the subject.
Furthermore, we have played a pivotal role in the establishment of the French Reproducibility Network, an interdisciplinary national initiative comprising researchers interested in exploring factors contributing to research robustness, conducting training activities, and disseminating best practices and recommendations. Our network serves as the French node within a larger international network of reproducible research, encompassing 21 nodes globally.
Impact
Since its release in 2020 up to May 2024, the FAIR workflow paper has obtained 147 citations according to Google Scholar.
The reproducibility network has attracted over 300 registered researchers, with attendance ranging from 130 to 170 participants at each kick-off event. Notably, the network receives support from the Ministry of Research and Higher Education.
Integration of all the clinical trials on COVID-19
- Article
- V. T. Nguyen, P . Rivière, P. Ripoll, J. Barnier, R. Vuillemot, G. Ferrand, S. Cohen-Boulakia, P. Ravaud, and I. Boutron. Research response to coronavirus disease 2019 needed better coordination and collaboration: a living mapping of registered trials. Journal of Clinical Epidemiology, 130:107–116, 2021.
- DOI: 10.1016/j.jclinepi.2020.10.010
- HAL: hal-02995875
- PDF version of the article (from HAL)
- Web site of the COVID-NMA project
- Web page where the data we have integrated can be visualized
Context
The Bioinfo team has played a pivotal role in the COVID-NMA international initiative.
CRESS (Centre of Research in Epidemiology and Statistics) and Cochrane, in collaboration with the World Health Organization (WHO) provide meta-reviews based on clinical trial results to recommend current treatments for various pathologies.
During the pandemic, the influx of clinical trials increased dramatically, surging from a handful to hundreds per week. In response, CNRS issued a call for volunteers to expedite the collection, integration, and analysis of numerous treatments concurrently tested against the virus (later including vaccines). Under the leadership of CRESS, COVID-NMA aimed to dynamically map all registered COVID-19 clinical trials, assisting funders and researchers in strategizing future trials.
S. Cohen-Boulakia has been in charge on the data integration facet of COVID-NMA and led a team comprising fifteen engineers, researchers, and master's students from LISN, LIMOS, and LIRIS. Their mandate involved extracting, analyzing, integrating, and loading highly diverse data from clinical trials sourced from five major clinical registries.
Contribution
We developed a data warehouse containing all the clinical trials associated with COVID-19. We designed and implemented a semi-automatic process to extract information from five registries. We developed web scrapers and several wrappers to feed the warehouse. We designed rules based on epidemiologist expertise to uniformize and integrate such very heterogeneous datasets, exploiting their complementary aspects and minimize redundancy. Throughout this process, we implemented features such as database historization, provenance tracking, knowledge base management for treatment/intervention ontologies, and used Natural Language Processing (NLP) techniques for entity recognition of interventions.
Impact
This effort has effectively synthesized data from 4,634 trials, aiding the World Health Organization (WHO) and numerous stakeholders from 12 major institutions in making informed decisions regarding COVID-19 treatments and vaccines.
In terms of publications, this endeavor has resulted in ten publications by the COVID-NMA consortium, with one specifically focusing on integration and analysis, published in the Journal of Clinical Epidemiology. As of May 2024, this paper has obtained 26 citations according to Google Scholar.
Furthermore, this work has been acknowledged with a CNRS crystal medal, honoring all engineers involved, including two engineers from the LISN in the SEME team.