2024-2025 Evaluation campaign - Group E

STL department - Human Language Science and Technology

Portfolio of team SEME
Semantics and Information Extraction
SEMantique et Extraction d’information

Study: NLP4NLP

This study is a bibliometric analysis of 55 years (1965 to 2020) of research in automatic speech and natural language processing from the 34 major conferences and journals in the field of speech and natural language processing. It covers the whole range of topics addressed at NLP conferences and identifies the major changes during this period related to the use of AI methods, especially Deep Learning, and the turnover in the international scientific community.

With an automatic analysis of the full content of close to 90,000 articles (67,000 authors, 590,000 references, and around 380 million words), it is the first of its kind for the domain address, the period covered, and the number of documents analyzed. Undertaken at the initiative of J. Mariani (M3) with G. Francopoulo (TAGMATICA company), P. Paroubek (SEME), and F. Vernier (AVIZ), the study was performed over a period of 8 years (2013-2021), in two phases. The first one covered the initial 50-year span (1965 to 2015), which resulted in the publication of two long synthesis articles providing the first integrated comprehensive picture of the field viewed through the lens of its major publications [1][2]. It was followed by an updated study over the subsequent 5 years (2016-2020) in order to take into account the deep neural revolution that swept the field [3][4].

Context

The observation that the two communities which had aggregated, since the early history of computers, around the study respectively of speech and of text had each its own family of conferences and journals was the initial motivation for J. Mariani to assemble a working group with the aim of drawing a comprehensive global picture of the field of NLP. With the rapid progress in science supported by constant technological advances of computers, the two fields were growing, reaching out to new applications areas, new media like image/video, or addressing new forms of language communication, e.g. with sign languages.

If the opportunities for inter- or trans-disciplinary collaboration were becoming more numerous, the rapidly increasing number of articles published made harder the building of the picture of the current state of NLP. However, the community had itself built the tools that would make this enterprise possible with the progress in text analysis, language understanding, information retrieval/extraction, etc., and the data in electronic form was also available. This is how the working group undertook the building of what was at that time the largest electronic corpus of scientific articles addressing all modalities of language communication: speech, text, and sign for drawing a comprehensive picture of the whole domain of NLP with the support of the automatic tools and language resources provided by the domain itself.

Contribution

Through the publication of the synthesis articles, the study has contributed to providing a global picture of the various works and research directions that the community had addressed during its evolution over the period. The study was made explicit, with the support of quantitative information and content analysis charting the evolution of the various topics addressed by researchers: the adoption of best practices and the development of new algorithms and ideas.

Furthermore, the statistics drawn from the metadata and the bibliographic references gave the means to identify sub-communities and their links to quantitatively identify pivotal points in the history of scientific practice, like, for instance, the steady increase in the number of co-authors per paper.

The study showed the sudden rise at some points in the number of publications, the slow evolution of gender bias in authorship, and the emergence of new countries among the top contributors to the field. It has also demonstrated that if automatic methods are of great help to process large amounts of bibliographic data, they do not suffice to provide a reliable picture of the field, because of the inherent noise or ambiguity in the data (particularly the proper names) which makes necessary the checking of the results by a domain experts. Although a little part of the corpus collected during the experiment is not freely available because of copyright restrictions, the main part of it is freely accessible through the sites of a few scientific associations of the domain and constitutes a valuable resource for the community.

Impact

All the articles synthesizing the results of the study were freely accessible online on the Frontiers website. The first two articles (a first phase that covers 1965 to 2015), which were published simultaneously, have attracted respectable attention from the community, as the viewing and download statistics provided by Frontiers show:

The result of the second phase had somehow less but decent impact, probably because the time span covered was shorter (5 years) and the novelty aspect somehow abated since it was an update with respect to the initial two articles, justified by the revolution brought by deep neural approaches in the field. Globally the feedback collected during conferences from all kinds of domain actors when communicating about the progress of the study was very positive.

Article: CharacterBERT

Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum, Jun'ichi Tsujii. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters. International Conference on Computational Linguistics, Dec. 2020, Barcelona (online), Spain, pages 6903-6915. DOI: 10.18653/v1/2020.coling-main.609 - HAL: hal-03100665

Context

The Transformer-based neural language models that are currently used quasi-universally to address natural language processing tasks can only handle a limited number of different input words. To cope with this constraint, methods have been designed to create an optimal, limited-size vocabulary of word pieces for the text corpus that is used to pre-train the language model. Any word can then be segmented into these word pieces, which become the units known by the language model. One of the problems caused by this word-piece segmentation principle is that the word pieces that are statistically optimal for the pre-training corpus may not be optimal for texts in a different, specialized domain. This shows up intuitively by observing that some domain-specific, technical words are segmented into word pieces that probably carry initially a meaning that is unrelated to these technical words. We thus looked for alternate methods to deal with the limited vocabulary constraint in technical domains.

During his PhD at LIMSI, Hicham El Boukkouri performed a 3-month research visit at the National Institute of Advanced Industrial Science and Technology (AIST, Tokyo, Japon). There, he was granted access to computational resources of the AI Bridging Cloud Infrastructure (ABCI), one of the world's first large-scale Open AI Computing Infrastructure, slightly before access to CNRS's Jean Zay GPU infrastructure was opened to research teams in France. This enabled him to carry out the resource-intensive experiments needed to train language models, which would have been possible only later at LIMSI.

Contribution

Our contribution is to replace word-piece segmentation with a method that relates any word to its component characters. This method relies on a character-based convolutional neural network that represents entire words by consulting their characters. We rely on that upon the Character-CNN module of the ELMo language model.

In this paper:

This work has only focused on the English language and the medical (clinical and biomedical) domain. Instantiation has been observed in other languages, as illustrated below.

Impact

The paper has been cited 174 times at the date of writing according to Google Scholar, and the CharacterBERT models were downloaded over 100 times from the Hugging Face repository in the single month preceding this writing. This high number of citations and downloads reflects the breadth of reuse of the CharacterBERT models and code by other teams, for instance, to analyze the NArabizi variety of written Arabic used in social media, to detect sentiment and offensive language in Dravidian languages or to detect heart disease risk factors from electronic health records. This illustrates the expected qualities of CharacterBERT models: better robustness in resource-scarce and noisy scenarios, including in the medical domain.

Article: Scalar Adjective Identification and Multilingual Ranking

Aina Garí Soler and Marianna Apidianaki. Scalar Adjective Identification and Multilingual Ranking. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4653–4660, Online. Association for Computational Linguistics, Jun. 2021. DOI: 10.18653/v1/2021.naacl-main.370 - HAL: hal-04414448v1

Context

Scalar adjectives modify entities (e.g., wonderful view, delicious meal, etc.) and can be positioned on a scale: good<wonderful - tasty<delicious. They are relevant for several NLP tasks such as natural language inference (NLI) and common-sense reasoning. While existing work has been done on English, no dataset exists for languages other than English. In this paper, presented at the NAACL 2021 conference, we worked on producing multilingual datasets of scalar adjectives and applying existing word representations to assess the usefulness of those datasets.

Contribution

From existing English datasets of scalar adjectives (DeMelo, Wilkinson, and Crowd), we manually translated and ranked scalar adjectives into French, Spanish, and Greek languages. The two human translators are native speakers of Greek and Spanish. Two datasets are produced:

Impact

This work makes available new datasets for languages other than English to promote research on scalar adjectives. In the era of Large Language Models, the existence of manually curated linguistic resources is really relevant for training and evaluating models in such fine-grained NLP tasks.

European project: Unidive

The UniDive project (Universality, diversity, and idiosyncrasy in language technology) is a 4-years project, funded by the European Cooperation in Science & Technology (COST) action, is coordinated by Agata Savary and aims to reconcile language diversity (among dozens of languages) with rapid progress in language technology, whether this diversity is intra-language or inter-language. It gathers about 250 interdisciplinary experts (linguists, computational linguists, computer scientists, psycholinguists, and industrials) from almost 40 countries.

Context

COST (European Cooperation in Science and Technology) is a funding organization for research and innovation networks. Our Actions help connect research initiatives across Europe and beyond and enable researchers and innovators to grow their ideas in any science and technology field by sharing them with their peers. COST Actions are bottom-up networks with a duration of four years that boost research, innovation and careers.

This COST action emerged from three pre-existing initiatives involved in universality, i.e. defining cross-linguistically consistent and applicable language descriptions. These 3 initiatives are called Universal Dependencies (UD), PARSEME, and UniMorph, and address linguistic phenomena at the level of morphosyntax, idiomaticity, and morphology, respectively. They develop notably multilingual annotation guidelines in which they define categories and values hypothesized as universal. These emerging standards are partly incompatible and competing.

Steps towards converging them were taken in two Dagstuhl Seminars on Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics (on 30-31 August 2021 and 8-12 May 2023).

The UniDive COST action is a major effort towards further unification and extension of these universality-oriented efforts.

Contribution

On the networking side, UniDive brings together Universal Dependencies, PARSEME, and UniMorph, which contribute fine-grained, high-quality language descriptions in very many languages (over 160 in the case of UD), most of them under-resourced. Within UniDive these resources are being maintained and enhanced, their consistency increases, and new languages are being involved thanks to the networking effect. UniDive also leverages the COST inclusiveness policy by promoting young researchers, attending to gender balance, and integrating increasingly many experts from less research-intensive countries. As a result we actively contribute to increasing to account for language diversity within language technology.

On the theoretical side, UniDive is contributing to the formalization of the notion of linguistic diversity and its quantification. It also is the major NLP initiative in universality-driven language modeling.

On the practical side, we coordinate inter alia the development of fine-grained unified language resources and tools.

Impact

UniDive is in its second year of duration, so its impact is not yet easy to measure. Expected mid- and long-term impact includes:

Seminars: Impact of digital technology

In this series of seminars for academics and the general public, we raise awareness of the impact of digital technology on the environment.

Context

Since 2017 and the Transformers architecture, NLP work has widely used statistical models produced on large amounts of data, usually general language data. Models have been produced for each language. In addition, to improve the performance of available models, fine-tuning is carried out on data from a specialty domain. The initial creation of models and their adaptation to other languages and specialty domains is costly in terms of time and energy. In this context of increasing digital resources, we gave several seminars analyzing the impact of digital technology.

Contribution

We have carried out several studies on the impact of digital technology on the carbon footprint of NLP work and the use of digital tools. They help to disseminate knowledge on the subject and can lead to research collaborations.

Impact

These seminars provide an opportunity to communicate on the environmental impact of digital technology, and AI in particular. Raising public awareness, training PhD students, and scientific collaborations.