2024-2025 Evaluation campaign - Group E

STL department - Human Language Science and Technology

Portfolio of team LIPS
Langue Interaction Parole et Signes
Langue Interaction Parole et Signes

The ViQuAE dataset

The ViQuAE dataset was created for the Knowledge-based Visual Question Answering task focusing on Entities (KVQAE). In 2019, Shah et al. initiated the task but offered a dataset with limited question diversity and focused primarily on person-named entities. The proposed dataset encompasses a broad spectrum of entity types, leading to diverse visual representations, coupled with manually crafted questions.

Context

The development of the ViQuAE dataset took place in the context of the ANR MEERQAT project which aims to tackle the problem of analyzing ambiguous visual and textual content by learning and combining their representations. Integrating various modalities, such as images and text, to extract pertinent information poses a complex and enduring challenge due to the differing semantic levels of these modalities. Designed as a benchmark, the dataset serves to monitor advancements in KVQAE systems. This task offers a clearly defined objective, facilitating straightforward evaluation and making it ideal for tracking improvements in the quality of multimodal entity representations.

Contribution

The dataset, associated with a baseline model for multimodal question-answering system, was published at the SIGIR 2022 conference (available here).

Paul Lerner, Olivier Ferret, Camille Guinaudeau, Hervé Le Borgne, Romaric Besançon, Juan G. Moreno and Jesus Lovon. ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.

The dataset is shared with the community on the Huggingface platform.

Impact

During the past month, the dataset has been downloaded 200 times and has been referenced in scholarly works 17 times. Notably, research papers from institutions like Google DeepMind and Meta in collaboration with Carnegie Mellon University prominently utilized this dataset. Its utilization has significantly contributed to advancing research in multimodal question-answering, a field that was relatively underdeveloped previously, as well as multimodal representations' quality evaluation.

The MEDIA SLU Recipe

The French MEDIA Spoken Language Understanding (SLU) dataset, distributed since 2005, is used as a benchmark dataset for a large number of research works.

We proposed an in-depth work on the MEDIA corpus for some corrections, related to segmentation, annotation, and the publication of the MEDIA recipe for speech understanding on speechBrain an already popular open-source and all-in-one conversational AI toolkit based on PyTorch.

Context

The MEDIA dataset has been shown as being the most challenging one among those accessible to the research community. Distributed by ELRA, this corpus is free for academic research since 2020. Unfortunately, the MEDIA dataset is not really used beyond the French research community.

Contribution

To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and integrated to speechBrain. More, a significant amount of data collected during the construction of the MEDIA corpus in the 2000s was never used until now: we present the first results reached on this subset — also included in the MEDIA SpeechBrain recipe.

It is available on the GitHub platform.

This work was the subject of a publication at LREC 2022 conference available here.

Impact

We expect a growing community will use our recipe to start working on the MEDIA corpus. While the research community is more and more interested by SLU problems and benchmarks, MEDIA stays one of the most challenging corpus, even in the era of deep learning. For this reason, this corpus constitutes a relevant dataset to investigate new solutions that can have a real impact on such human/machine application.

"AZee" representation model for Sign Language

Context

AZee is an approach to formal representation of Sign Language discourse. We chose to highlight it because it significantly developed over the period, thanks in part to funding from 3 projects (1 European H2020 and 2 national PSPC), from a mostly theoretical proposition to being at the core of multiple software demonstrations, in particular the state-of-the-art input for Sign synthesis with avatars used internationally, and the basis for the recent work on graphical representation (AZVD).

Contributions

In the course of the period, four main contributions were made to foster the AZee system.

Impact

The first two contributions listed above came with an observed impact. The AZee expression corpus now provides an reference for the community, which otherwise did not yet exist. It is already being used by colleagues in other institutions. Before the AZee–Paula bridge, SL synthesis was based on sign sequences, whereas using AZee allows to control all additional and necessary body gestures in parallel with fine synchronisation. This has been noticed by the language and scientific community, and our proposed system now constitutes the state of the art.

Paper related to the "Excellence" grant for OTELO research project

Mathilde Hutin, Adèle Jatteau, Yaru Wu, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker.
A corpus-based study of the distribution of word-final schwa in Standard French and what it teaches us about its phonological status.
Isogloss Open Journal of Romance Linguistics, 7:1–27, 2021.
[DOI: 10.5565/rev/isogloss.152] [hal-04039142]

The paper is a case study of the impact that large scale corpora and speech technologies can have on answering classical linguistic question such as the behavior of schwa in contemporary French.

Context

This research has been carried out in the framework of the OTELO project, awarded with the "Excellence" grant in 2020 (joint funding by MSH Paris Saclay and DATAIA Institute). In this project we took advantage of corpora gathered in the framework of speech technology projects, and of speech recognition systems and knwoledge databases to answer linguistic questions and to enrich the study of variation that may in turn be fruitful for speech tehnology applications.

Contribution

The paper selected to illustrate the project underlines the benefit of approaches that involve large amounts of data and interdisciplinary methodology to answer in-depth research questions, for both language technologies and linguistic theory. Based on more than 100h of speech, the linguistic phenomenon investigated, here the final schwa, is analyzed as a function of sociolinguistics, orthography, phonotactics and phonetics, with the help of speech recognition methodology. The conclusions are that word-final schwa is impacted by speech style, gender, orthography, phonotactics (i.e., the number of adjacent consonants and their sonority profile), and the phonological properties of the codas.

Impact

The cited paper, as well as the numerous other scientific contributions within the OTELO project provide insights into the study of variation in speech from a multidomensional perspective that takes into account well-documented linguistic phenomena, speaking styles, different language levels from segments to words and speaker profiles extract from a knowledge database. These achievements have provided the basis for expanded research directions thanks to two ongoing ANR projects (DIPVAR and VOLI).

Paper related to Expressive Speech

Albert Rilliard, Christophe d’Alessandro, and Marc Evrard.
Paradigmatic variation of vowels in expressive speech: Acoustic description and dimensional analysis.
Journal of the Acoustical Society of America, 143(1):109–122, 2018. Joint with M3 Team.
[DOI: 10.1121/1.5018433] - [hal-01914497]

The paper is a case study of the impact that large scale corpora and speech technologies can have on answering classical linguistic question such as the behavior of schwa in contemporary French.

Context

This paper was written within the framework of a collaborative project with industrial partners interested in expressive characteristics of voices, to create audiovisual avatars. This contribution focuses on the acoustic description of voice changes linked to different emotional arousals.

Contribution

It shows the voice source (the patterns of air streaming through the glottis) may be summarized in a few parameters that correlates with psychological characteristics of the expressed emotions. A main dimension of vocal variation is linked to vocal effort, that explain most of the change in vocal folds vibrations. A secondary dimension is linked to vocal effort independent changes in fundamental frequency, and a third to supplementary noise (breathiness). The difference between effort-induced and independent frequency changes are important for the distinguishing the arousal dimension from the dominance dimension. A better understanding of vocal characteristic is fundamental for models of vocal expressivity; this has implications for human machine interactions, as well as for the description of vocal changes, pathology and therapy.

Impact

This paper served as a basis for several other works on the relation between vocal effort and other dimensions of voice changes during spoken communications. It lead to a better understanding of the importance of subglottal pressure for the production of fundamental frequency changes that are not perceived as variations in vocal pitch. This work was also a building block for the current VERS project that target a better description of vocal effort and voice strength, that are important measurements for phenomenons as varied as voice fatigue, expression of affects, or the pragmatics of spoken interaction.