Study: NLP4NLP
This study is a bibliometric analysis of 55 years (1965 to 2020) of research in automatic speech and natural language processing from the 34 major conferences and journals in the field of speech and natural language processing. It covers the whole range of topics addressed at NLP conferences and identifies the major changes during this period related to the use of AI methods, especially Deep Learning, and the turnover in the international scientific community.
With an automatic analysis of the full content of close to 90,000 articles (67,000 authors, 590,000 references, and around 380 million words), it is the first of its kind for the domain address, the period covered, and the number of documents analyzed. Undertaken at the initiative of J. Mariani (M3) with G. Francopoulo (TAGMATICA company), P. Paroubek (SEME), and F. Vernier (AVIZ), the study was performed over a period of 8 years (2013-2021), in two phases. The first one covered the initial 50-year span (1965 to 2015), which resulted in the publication of two long synthesis articles providing the first integrated comprehensive picture of the field viewed through the lens of its major publications [1][2]. It was followed by an updated study over the subsequent 5 years (2016-2020) in order to take into account the deep neural revolution that swept the field [3][4].
- [1] Jospeh Mariani, Gil Francopoulo and Patrick Paroubek (2019). The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing, Frontiers in Research Metrics and Analytics, Frontiers Media, vol. 3. DOI: 10.3389/frma.2018.00036, HAL: hal-02413751
- [2] Joseph Mariani, Gil Francopoulo, Patrick Paroubek and Frédéric Vernier (2019). The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing, Frontiers in Research Metrics and Analytics, Frontiers Media, vol. 3. DOI: 10.3389/frma.2018.00037. HAL: hal-02413749
- [3] Joseph Mariani, Gil Francopoulo, Patrick Paroubek and Frédéric Vernier (2023). NLP4NLP+5: The Deep (R)evolution in Speech and Language Processing, Procs. of O-COCOSDA 2023, the 26th Conference of the Oriental COCOSDA - International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques, Delhi, India. HAL: hal-04415467
- [4] Joseph Mariani, Gil Francopoulo and Patrick Paroubek (2018). Le corpus NLP4NLP pour l'analyse bibliométrique de 50 années de recherches en traitement automatique de la parole et du langage naturel, Document numérique - Revue des sciences et technologies de l'information. Série Document numérique, Hermès, vol. 20, pp 31-78. HAL: hal-02413744
Context
The observation that the two communities which had aggregated, since the early history of computers, around the study respectively of speech and of text had each its own family of conferences and journals was the initial motivation for J. Mariani to assemble a working group with the aim of drawing a comprehensive global picture of the field of NLP. With the rapid progress in science supported by constant technological advances of computers, the two fields were growing, reaching out to new applications areas, new media like image/video, or addressing new forms of language communication, e.g. with sign languages.
If the opportunities for inter- or trans-disciplinary collaboration were becoming more numerous, the rapidly increasing number of articles published made harder the building of the picture of the current state of NLP. However, the community had itself built the tools that would make this enterprise possible with the progress in text analysis, language understanding, information retrieval/extraction, etc., and the data in electronic form was also available. This is how the working group undertook the building of what was at that time the largest electronic corpus of scientific articles addressing all modalities of language communication: speech, text, and sign for drawing a comprehensive picture of the whole domain of NLP with the support of the automatic tools and language resources provided by the domain itself.
Contribution
Through the publication of the synthesis articles, the study has contributed to providing a global picture of the various works and research directions that the community had addressed during its evolution over the period. The study was made explicit, with the support of quantitative information and content analysis charting the evolution of the various topics addressed by researchers: the adoption of best practices and the development of new algorithms and ideas.
Furthermore, the statistics drawn from the metadata and the bibliographic references gave the means to identify sub-communities and their links to quantitatively identify pivotal points in the history of scientific practice, like, for instance, the steady increase in the number of co-authors per paper.
The study showed the sudden rise at some points in the number of publications, the slow evolution of gender bias in authorship, and the emergence of new countries among the top contributors to the field. It has also demonstrated that if automatic methods are of great help to process large amounts of bibliographic data, they do not suffice to provide a reliable picture of the field, because of the inherent noise or ambiguity in the data (particularly the proper names) which makes necessary the checking of the results by a domain experts. Although a little part of the corpus collected during the experiment is not freely available because of copyright restrictions, the main part of it is freely accessible through the sites of a few scientific associations of the domain and constitutes a valuable resource for the community.
Impact
All the articles synthesizing the results of the study were freely accessible online on the Frontiers website. The first two articles (a first phase that covers 1965 to 2015), which were published simultaneously, have attracted respectable attention from the community, as the viewing and download statistics provided by Frontiers show:
- more than 21,000 views for the first article (which places it in the top 3% of the most-read Frontiers articles), with 1,500 downloads
- and more than 15,000 for the second one, with 1,000 downloads
The result of the second phase had somehow less but decent impact, probably because the time span covered was shorter (5 years) and the novelty aspect somehow abated since it was an update with respect to the initial two articles, justified by the revolution brought by deep neural approaches in the field. Globally the feedback collected during conferences from all kinds of domain actors when communicating about the progress of the study was very positive.
Article: CharacterBERT
Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum, Jun'ichi Tsujii. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters. International Conference on Computational Linguistics, Dec. 2020, Barcelona (online), Spain, pages 6903-6915. DOI: 10.18653/v1/2020.coling-main.609 - HAL: hal-03100665
Context
The Transformer-based neural language models that are currently used quasi-universally to address natural language processing tasks can only handle a limited number of different input words. To cope with this constraint, methods have been designed to create an optimal, limited-size vocabulary of word pieces for the text corpus that is used to pre-train the language model. Any word can then be segmented into these word pieces, which become the units known by the language model. One of the problems caused by this word-piece segmentation principle is that the word pieces that are statistically optimal for the pre-training corpus may not be optimal for texts in a different, specialized domain. This shows up intuitively by observing that some domain-specific, technical words are segmented into word pieces that probably carry initially a meaning that is unrelated to these technical words. We thus looked for alternate methods to deal with the limited vocabulary constraint in technical domains.
During his PhD at LIMSI, Hicham El Boukkouri performed a 3-month research visit at the National Institute of Advanced Industrial Science and Technology (AIST, Tokyo, Japon). There, he was granted access to computational resources of the AI Bridging Cloud Infrastructure (ABCI), one of the world's first large-scale Open AI Computing Infrastructure, slightly before access to CNRS's Jean Zay GPU infrastructure was opened to research teams in France. This enabled him to carry out the resource-intensive experiments needed to train language models, which would have been possible only later at LIMSI.
Contribution
Our contribution is to replace word-piece segmentation with a method that relates any word to its component characters. This method relies on a character-based convolutional neural network that represents entire words by consulting their characters. We rely on that upon the Character-CNN module of the ELMo language model.
In this paper:
- We provide preliminary evidence that general-domain wordpiece vocabularies are not suitable for specialized domain applications.
- We propose CharacterBERT, a new variant of BERT that produces word-level contextual representations by consulting characters.
- We evaluate CharacterBERT on multiple specialized medical tasks and show that it outperforms BERT without requiring a wordpiece vocabulary.
- We exhibit signs of improved robustness to noise and misspellings in favor of CharacterBERT.
- We enable the reproducibility of our experiments by sharing our pre-training and fine-tuning codes. Furthermore, we also share our pre-trained representation models to benefit the NLP community. The model and code are shared on the Hugging Face repository of language models (https://huggingface.co/helboukkouri/character-bert).
This work has only focused on the English language and the medical (clinical and biomedical) domain. Instantiation has been observed in other languages, as illustrated below.
Impact
The paper has been cited 174 times at the date of writing according to Google Scholar, and the CharacterBERT models were downloaded over 100 times from the Hugging Face repository in the single month preceding this writing. This high number of citations and downloads reflects the breadth of reuse of the CharacterBERT models and code by other teams, for instance, to analyze the NArabizi variety of written Arabic used in social media, to detect sentiment and offensive language in Dravidian languages or to detect heart disease risk factors from electronic health records. This illustrates the expected qualities of CharacterBERT models: better robustness in resource-scarce and noisy scenarios, including in the medical domain.
Article: Scalar Adjective Identification and Multilingual Ranking
Aina Garí Soler and Marianna Apidianaki. Scalar Adjective Identification and Multilingual Ranking. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4653–4660, Online. Association for Computational Linguistics, Jun. 2021. DOI: 10.18653/v1/2021.naacl-main.370 - HAL: hal-04414448v1
Context
Scalar adjectives modify entities (e.g., wonderful view, delicious meal, etc.) and can be positioned on a scale: good<wonderful - tasty<delicious. They are relevant for several NLP tasks such as natural language inference (NLI) and common-sense reasoning. While existing work has been done on English, no dataset exists for languages other than English. In this paper, presented at the NAACL 2021 conference, we worked on producing multilingual datasets of scalar adjectives and applying existing word representations to assess the usefulness of those datasets.
Contribution
From existing English datasets of scalar adjectives (DeMelo, Wilkinson, and Crowd), we manually translated and ranked scalar adjectives into French, Spanish, and Greek languages. The two human translators are native speakers of Greek and Spanish. Two datasets are produced:
- SCAL-REL is a monolingual balanced dataset of scalar adjectives and relational adjectives. This dataset of 886 adjectives in English can be used to train models to classify adjectives and to identify scalar adjectives. Each line corresponds to one adjective in the dataset with information of class (scalar, relational) and of set (train, dev, test). The file relational_sentences.pkl contains the 10 sentences per adjective used in our experiments.
- MULTI-SCALE is a multilingual benchmark for scalar adjective ranking (e.g., bad<awful<terible<horrible has been translated into malo<terrible<horrible<horroroso in Spanish) composed of 108 sets of scalar adjectives. We then applied monolingual and multilingual contextual language model representations (multilingual BERT, FlauBERT for French, Spanish BERT, and GREEK-BERT) on this benchmark to assess its usefulness and to provide baseline results. The folders follow the same structure as the English scales in the original data folder. The file all_translations.csv contains the whole dataset including the original English scales. Each file sentences_[LANG].pkl contain the sentences used in our experiments for each of the languages in the dataset.
Impact
This work makes available new datasets for languages other than English to promote research on scalar adjectives. In the era of Large Language Models, the existence of manually curated linguistic resources is really relevant for training and evaluating models in such fine-grained NLP tasks.
European project: Unidive
The UniDive project (Universality, diversity, and idiosyncrasy in language technology) is a 4-years project, funded by the European Cooperation in Science & Technology (COST) action, is coordinated by Agata Savary and aims to reconcile language diversity (among dozens of languages) with rapid progress in language technology, whether this diversity is intra-language or inter-language. It gathers about 250 interdisciplinary experts (linguists, computational linguists, computer scientists, psycholinguists, and industrials) from almost 40 countries.
Context
COST (European Cooperation in Science and Technology) is a funding organization for research and innovation networks. Our Actions help connect research initiatives across Europe and beyond and enable researchers and innovators to grow their ideas in any science and technology field by sharing them with their peers. COST Actions are bottom-up networks with a duration of four years that boost research, innovation and careers.
This COST action emerged from three pre-existing initiatives involved in universality, i.e. defining cross-linguistically consistent and applicable language descriptions. These 3 initiatives are called Universal Dependencies (UD), PARSEME, and UniMorph, and address linguistic phenomena at the level of morphosyntax, idiomaticity, and morphology, respectively. They develop notably multilingual annotation guidelines in which they define categories and values hypothesized as universal. These emerging standards are partly incompatible and competing.
Steps towards converging them were taken in two Dagstuhl Seminars on Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics (on 30-31 August 2021 and 8-12 May 2023).
The UniDive COST action is a major effort towards further unification and extension of these universality-oriented efforts.
Contribution
On the networking side, UniDive brings together Universal Dependencies, PARSEME, and UniMorph, which contribute fine-grained, high-quality language descriptions in very many languages (over 160 in the case of UD), most of them under-resourced. Within UniDive these resources are being maintained and enhanced, their consistency increases, and new languages are being involved thanks to the networking effect. UniDive also leverages the COST inclusiveness policy by promoting young researchers, attending to gender balance, and integrating increasingly many experts from less research-intensive countries. As a result we actively contribute to increasing to account for language diversity within language technology.
On the theoretical side, UniDive is contributing to the formalization of the notion of linguistic diversity and its quantification. It also is the major NLP initiative in universality-driven language modeling.
On the practical side, we coordinate inter alia the development of fine-grained unified language resources and tools.
Impact
UniDive is in its second year of duration, so its impact is not yet easy to measure. Expected mid- and long-term impact includes:
- Better understanding of language universals and providing a unified practical framework that allows implementing and testing theoretical advances in this field
- Establishment of de facto standards and best practices for the development, maintenance, reproducibility, and evaluation of language resources and computational models derived from them
- Creation of national or regional projects and spin-off initiatives coordinated by Action's participants to pursue more specific goals (e.g. the development of a corpus or tools for a particular language)
- Long-lasting collaboration among multilingual and interdisciplinary experts from many countries
- Driving mentalities and practices of the scientific NLP community toward a better account of language diversity
Seminars: Impact of digital technology
In this series of seminars for academics and the general public, we raise awareness of the impact of digital technology on the environment.
- SCAI, Sorbonne Université - December 3rd, 2021. Anne-Laure Ligozat. Whatever the cost? Calculating the environmental impact of scientific calculus.
- Centre Borelli, ENS Paris-Saclay - March 16th, 2022. Anne-Laure Ligozat and Gauthier Roussilhe. Environmental crisis and digitalization, the case of AI.
- Journées LCG-France - June 2022. Anne-Laure Ligozat. Invited talk. Impact environnemental des infrastructures et des services de calcul
- La Gaîté Lyrique - June 9th, 2022. Gauthier Roussilhe, Benjamin Sonntag, and Anne-Laure Ligozat. Invited talk. NØ LAB : Mode économie d'énergie. Rencontre autour de l'empreinte énergétique du numérique.
- Inria-DFKI European Summer School on AI - August 2022. Anne-Laure Ligozat. Sustainable AI
- MIAI days - December 6th, 2022. Anne-Laure Ligozat. Panel. L'intelligence artificielle dans le contexte des transitions énergétique et écologiques
- Workshop Data Center Sustainability. Best practices and future scenarios - December 2022. Anne-Laure Ligozat. Invited talk Computing the carbon footprint of a supercomputer
- GreenDays - March 28th, 2023. Anne-Laure Ligozat. Invited talk. Côté obscur de l'IA : quels bénéfices réels de l'IA pour faire face aux crises environnementales ?
- SustaiNLP workshop 2023 - July 13th, 2023. Anne-Laure Ligozat. Invited talk. How to make NLP sustainable?
- Carrefour Pathologie - November 2023. Clément Morand. Invited talk. IA: évaluation des impacts et intégration des préoccupations environnementales aux travaux d’IA en médecine
Context
Since 2017 and the Transformers architecture, NLP work has widely used statistical models produced on large amounts of data, usually general language data. Models have been produced for each language. In addition, to improve the performance of available models, fine-tuning is carried out on data from a specialty domain. The initial creation of models and their adaptation to other languages and specialty domains is costly in terms of time and energy. In this context of increasing digital resources, we gave several seminars analyzing the impact of digital technology.
Contribution
We have carried out several studies on the impact of digital technology on the carbon footprint of NLP work and the use of digital tools. They help to disseminate knowledge on the subject and can lead to research collaborations.
Impact
These seminars provide an opportunity to communicate on the environmental impact of digital technology, and AI in particular. Raising public awareness, training PhD students, and scientific collaborations.