2024-2025 Evaluation campaign - Group E

SD department - Data Science

Portfolio of team BioInfo
Bioinformatics
Bioinformatique

Studying evolutionary changes in pre and post-Hispanic Mexico through ancient DNA analysis

Highlighted article:

Other articles linked to the project:

Software:

Outreach:

Context

This work is part of a large collaborative project that includes UNAM (Mexico), U Brown (US), and LISN (France) to study evolutionary changes in human hosts and their pathogens before, during, and after first contact in the New World. The 3 co-PIs were financed by the Human Frontier Science Program (HFSF). The past history of Mexican individuals is complex and understudied. They went through a series of events that shaped their genome. Among those was the massive colonization by Europeans that started in the 15th century. Isolated populations, like those in the Americas, had their whole environment changed by this colonization. The arrival of Europeans and domesticated animals introduced new pathogens, and it has been speculated that a large proportion of Native Americans died because they lacked immunity. We use ancient DNA to study the changes in both pathogen and human genetic diversity before and after European colonization to characterize how population structure changed and how pathogens may have influenced the evolution of specific genetic loci during this transition.

Contribution

To this aim, the first step is to precisely characterize the genetic diversity of populations pre-dating the European colonization. In our paper “Demographic history and genetic structure in pre-Hispanic Central Mexico” we overcome the degradation challenges posed by warm climates and offer new insights for previously underlooked human populations, highlighting the genetic legacy of central Mexican populations into modern Indigenous groups. This study analyzed ancient DNA from pre-Hispanic Mexico, unraveled the size decline and demographic structure of these populations, and revealed contributions from an unsequenced and unidentified ghost population, challenging established genetic histories. It also evidenced genetic continuity without population replacement despite the severe droughs that occurred in the 10th century. To reach these findings multiple tools were applied, such as TFA which we have developed in collaboration with TIMC (France). TFA is the first method that can visualize ancient samples in a reduced space and estimate their ancestry coefficients without being confounded by temporal drift (a well-known evolutionary force).

The HFSP project's main object of study is the ancient DNA of humans and pathogens, which usually consists of degraded parcellar and prone-to-error genetic data. In two additional studies, we thus investigated the robustness of existing tools, and novel ones developed by our lab, in relation to data quality and ancient DNA specificities. We particularly focused on phasing inference and its impact on population structure detection (Medina-Tretmanis et al. 2024), and on likelihood-free inference based on neural networks (Cury et al. ECML workshop 2022). This is of major importance as many tools widely used in population genetics are directly applied to paleogenetics with no evaluation of their performance in this context. Further quantifying the uncertainty around estimates, as we proposed, is thus an important step to mitigate drastic misconclusions.

Impact

The highlighted publication appeared in Science in 2023 and the 3 related works in Nature Communication 2020, Human Population Genetics and Genomics 2024 and in a workshop of ECML-PKDD 2022 (peer-reviewed paper no proceedings). These findings were the topic of a radio show (interview in Le journal des sciences, France Culture) and the methodology was highlighted in an outreach article (Le journal du CNRS). More broadly, the HFSP project produced and will keep producing extremely valuable genomic data from a region and time period that are understudied. The tools developped are also all publicly available.

Genomics and AI

Highlighted production:

Linked publications (2 selected):

Linked software:

Linked press articles:

Associated keynotes in:

Context

As population genetic datasets keep increasing in size, it is common to observe millions of genomic markers sequenced for hundreds of individuals, opening the possibility of answering intricate biological questions. However, extracting relevant information from these genomic datasets is not trivial due to their size, the complexity of the underlying mechanisms, and sometimes impossible due to privacy rules that govern several human genome databases. Since 2017, the bioinfo team has been a pioneer in introducing deep learning approaches at different levels of population genetics for parameter inference, data visualization, or generation.

Contribution

Through the past years we have designed tailored neural architectures for evolutionary inference (demographic and selective past histories) and high dimensional genomic data generation. We have paid particular attention to the comparison of this novel approaches with well-established methods (Sanchez et al 2021) and have developed a software for facilitating sharing and reproducibility of neural networks in the field of population genetic (DNADNA, Sanchez*, Bray* et al 2022). Regarding highly realistic genomic generation (i.e. fake genomes that can be merged with real genomes in a single usable dataset) no methodologies pre-existed in population genetics and we have worked towards establishing evaluation standards (Yelmen et al 2021 and following works).

Impact

This line of work led to multiple publications, such as the original Yelmen et al 2021 paper which has obtained 98 citations according to Google Scholar. It also attracted the attention of the outreach press (Sciences et Vie, Sciences&Avenir, Journal du CNRS). Finally, it was the topic of an interview on France Culture's radio program "La Méthode Scientifique" (January 2022). The episode entitled “Génomique et AI: les liaisons fructueuses” explores the potential links between genomics and artificial intelligence, with, for example, the use of generative AI to produce DNA sequences (interview at the radio), as well as a report on the research in deep-learning-based inference for microbial populations conducted at LISN.

Reproducible Science

Context

Over the last twenty years, reproducibility has emerged as a major concern in a number of fields, initially in social and psychological sciences and later extending to areas like pre-clinical research, life sciences, and computational sciences. While wet biological experiments inherently exhibit variability due to multiple factors like phenomena complexity, measurement methods, and sample diversity, it has long been assumed that computational biology and bioinformatics analyses were inherently reproducible owing to their machine-driven automation. However, many factors can impede this reproducibility, including insufficient method documentation, data accessibility not adhering to FAIR principles, technical hurdles like variations in operating systems, hardware configurations (e.g., HPC scheduler), diverse software environments (e.g., tool and library versions), and the influence of stochastic algorithms. These problems raise major scientific challenges.

Contribution

The bioinfo group has been highly engaged in advancing reproducible science through multiple avenues.

We have delved into the issue of reproducibility through computational workflows, which encapsulate the intricate multi-step processes involved in data collection, preparation, analysis, predictive modeling, and simulation, culminating in the creation of new data products. Specifically, we have focused on enhancing the FAIRness of workflows. Interestingly, workflows inherently align with the FAIR data principles by adhering to established metadata standards, generating metadata during data processing, and meticulously tracking data provenance. These attributes facilitate quality assessment and enable secondary data utilization. Moreover, workflows are digital entities in their own right, necessitating the formulation of FAIR principles tailored to their unique characteristics, considering aspects such as their composition of executable software steps, provenance, and development challenges.

We have actively advocated for reproducible science through various channels, including engagements with the scientific community through keynotes and invited talks, as well as outreach efforts targeting a broader audience via a dedicated radio show on the subject.

Furthermore, we have played a pivotal role in the establishment of the French Reproducibility Network, an interdisciplinary national initiative comprising researchers interested in exploring factors contributing to research robustness, conducting training activities, and disseminating best practices and recommendations. Our network serves as the French node within a larger international network of reproducible research, encompassing 21 nodes globally.

Impact

Since its release in 2020 up to May 2024, the FAIR workflow paper has obtained 147 citations according to Google Scholar.

The reproducibility network has attracted over 300 registered researchers, with attendance ranging from 130 to 170 participants at each kick-off event. Notably, the network receives support from the Ministry of Research and Higher Education.

Integration of all the clinical trials on COVID-19

Context

The Bioinfo team has played a pivotal role in the COVID-NMA international initiative.

CRESS (Centre of Research in Epidemiology and Statistics) and Cochrane, in collaboration with the World Health Organization (WHO) provide meta-reviews based on clinical trial results to recommend current treatments for various pathologies.

During the pandemic, the influx of clinical trials increased dramatically, surging from a handful to hundreds per week. In response, CNRS issued a call for volunteers to expedite the collection, integration, and analysis of numerous treatments concurrently tested against the virus (later including vaccines). Under the leadership of CRESS, COVID-NMA aimed to dynamically map all registered COVID-19 clinical trials, assisting funders and researchers in strategizing future trials.

S. Cohen-Boulakia has been in charge on the data integration facet of COVID-NMA and led a team comprising fifteen engineers, researchers, and master's students from LISN, LIMOS, and LIRIS. Their mandate involved extracting, analyzing, integrating, and loading highly diverse data from clinical trials sourced from five major clinical registries.

Contribution

We developed a data warehouse containing all the clinical trials associated with COVID-19. We designed and implemented a semi-automatic process to extract information from five registries. We developed web scrapers and several wrappers to feed the warehouse. We designed rules based on epidemiologist expertise to uniformize and integrate such very heterogeneous datasets, exploiting their complementary aspects and minimize redundancy. Throughout this process, we implemented features such as database historization, provenance tracking, knowledge base management for treatment/intervention ontologies, and used Natural Language Processing (NLP) techniques for entity recognition of interventions.

Impact

This effort has effectively synthesized data from 4,634 trials, aiding the World Health Organization (WHO) and numerous stakeholders from 12 major institutions in making informed decisions regarding COVID-19 treatments and vaccines.

In terms of publications, this endeavor has resulted in ten publications by the COVID-NMA consortium, with one specifically focusing on integration and analysis, published in the Journal of Clinical Epidemiology. As of May 2024, this paper has obtained 26 citations according to Google Scholar.

Furthermore, this work has been acknowledged with a CNRS crystal medal, honoring all engineers involved, including two engineers from the LISN in the SEME team.