The ViQuAE dataset
The ViQuAE dataset was created for the Knowledge-based Visual Question Answering task focusing on Entities (KVQAE). In 2019, Shah et al. initiated the task but offered a dataset with limited question diversity and focused primarily on person-named entities. The proposed dataset encompasses a broad spectrum of entity types, leading to diverse visual representations, coupled with manually crafted questions.
Context
The development of the ViQuAE dataset took place in the context of the ANR MEERQAT project which aims to tackle the problem of analyzing ambiguous visual and textual content by learning and combining their representations. Integrating various modalities, such as images and text, to extract pertinent information poses a complex and enduring challenge due to the differing semantic levels of these modalities. Designed as a benchmark, the dataset serves to monitor advancements in KVQAE systems. This task offers a clearly defined objective, facilitating straightforward evaluation and making it ideal for tracking improvements in the quality of multimodal entity representations.
Contribution
The dataset, associated with a baseline model for multimodal question-answering system, was published at the SIGIR 2022 conference (available here).
Paul Lerner, Olivier Ferret, Camille Guinaudeau, Hervé Le Borgne, Romaric Besançon, Juan G. Moreno and Jesus Lovon. ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.
The dataset is shared with the community on the Huggingface platform.
Impact
During the past month, the dataset has been downloaded 200 times and has been referenced in scholarly works 17 times. Notably, research papers from institutions like Google DeepMind and Meta in collaboration with Carnegie Mellon University prominently utilized this dataset. Its utilization has significantly contributed to advancing research in multimodal question-answering, a field that was relatively underdeveloped previously, as well as multimodal representations' quality evaluation.
The MEDIA SLU Recipe
The French MEDIA Spoken Language Understanding (SLU) dataset, distributed since 2005, is used as a benchmark dataset for a large number of research works.
We proposed an in-depth work on the MEDIA corpus for some corrections, related to segmentation, annotation, and the publication of the MEDIA recipe for speech understanding on speechBrain an already popular open-source and all-in-one conversational AI toolkit based on PyTorch.
Context
The MEDIA dataset has been shown as being the most challenging one among those accessible to the research community. Distributed by ELRA, this corpus is free for academic research since 2020. Unfortunately, the MEDIA dataset is not really used beyond the French research community.
Contribution
To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and integrated to speechBrain. More, a significant amount of data collected during the construction of the MEDIA corpus in the 2000s was never used until now: we present the first results reached on this subset — also included in the MEDIA SpeechBrain recipe.
It is available on the GitHub platform.
This work was the subject of a publication at LREC 2022 conference available here.
Impact
We expect a growing community will use our recipe to start working on the MEDIA corpus. While the research community is more and more interested by SLU problems and benchmarks, MEDIA stays one of the most challenging corpus, even in the era of deep learning. For this reason, this corpus constitutes a relevant dataset to investigate new solutions that can have a real impact on such human/machine application.
"AZee" representation model for Sign Language
Context
AZee is an approach to formal representation of Sign Language discourse. We chose to highlight it because it significantly developed over the period, thanks in part to funding from 3 projects (1 European H2020 and 2 national PSPC), from a mostly theoretical proposition to being at the core of multiple software demonstrations, in particular the state-of-the-art input for Sign synthesis with avatars used internationally, and the basis for the recent work on graphical representation (AZVD).
Contributions
In the course of the period, four main contributions were made to foster the AZee system.
- The first is a reference corpus, made available as an extention, published in 2022, to the 40 brèves corpus (see example entry).
- The second contribution is in SL synthesis from AZee. In collaboration with DePaul University, we reached synthesis of a full AZee discourse expression with their avatar Paula, a result published in 2022 (see rendered video). We also started an in-house PhD work on synthesis from AZee with Blender, to tackle the lower level of description also provided by AZee.
- Thirdly, based on a study of spontaneous productions by signers representing their language with diagrams published in 2020, we developed the AZVD proposition for a graphical Sign Language representation. The goal is to maximise its potential for adoption by the signing community, while keeping it entirely synthesisable by construction (see example AZVD representing the full "2B-JP" entry of the 40 brèves). To allow users to experience and discuss the AZVD approach, and ultimately assess it as a standardised graphical form for Sign Language representation, we implemented a software prototype to support AZVD editing, an effort supported by the European project EASIER.
- Lastly, in the context of the Rosetta project, AZee was also used to explore example-based machine translation (EBMT). A corpus of alignments was created between French texts—mostly news titles or items, linear in nature—and the AZee expressions for the signed LSF translations—of hierarchical nature. An algorithm based on substitution was proposed and published in 2023.
Impact
The first two contributions listed above came with an observed impact. The AZee expression corpus now provides an reference for the community, which otherwise did not yet exist. It is already being used by colleagues in other institutions. Before the AZee–Paula bridge, SL synthesis was based on sign sequences, whereas using AZee allows to control all additional and necessary body gestures in parallel with fine synchronisation. This has been noticed by the language and scientific community, and our proposed system now constitutes the state of the art.
Paper related to the "Excellence" grant for OTELO research project
Mathilde Hutin, Adèle Jatteau, Yaru Wu, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker.
A corpus-based study of the
distribution of word-final schwa in Standard French and what it teaches us about its phonological status.
Isogloss Open Journal of Romance Linguistics, 7:1–27, 2021.
[DOI: 10.5565/rev/isogloss.152] [hal-04039142]
The paper is a case study of the impact that large scale corpora and speech technologies can have on answering classical linguistic question such as the behavior of schwa in contemporary French.
Context
This research has been carried out in the framework of the OTELO project, awarded with the "Excellence" grant in 2020 (joint funding by MSH Paris Saclay and DATAIA Institute). In this project we took advantage of corpora gathered in the framework of speech technology projects, and of speech recognition systems and knwoledge databases to answer linguistic questions and to enrich the study of variation that may in turn be fruitful for speech tehnology applications.
Contribution
The paper selected to illustrate the project underlines the benefit of approaches that involve large amounts of data and interdisciplinary methodology to answer in-depth research questions, for both language technologies and linguistic theory. Based on more than 100h of speech, the linguistic phenomenon investigated, here the final schwa, is analyzed as a function of sociolinguistics, orthography, phonotactics and phonetics, with the help of speech recognition methodology. The conclusions are that word-final schwa is impacted by speech style, gender, orthography, phonotactics (i.e., the number of adjacent consonants and their sonority profile), and the phonological properties of the codas.
Impact
The cited paper, as well as the numerous other scientific contributions within the OTELO project provide insights into the study of variation in speech from a multidomensional perspective that takes into account well-documented linguistic phenomena, speaking styles, different language levels from segments to words and speaker profiles extract from a knowledge database. These achievements have provided the basis for expanded research directions thanks to two ongoing ANR projects (DIPVAR and VOLI).
Paper related to Expressive Speech
Albert Rilliard, Christophe d’Alessandro, and Marc Evrard.
Paradigmatic variation of vowels in expressive speech: Acoustic description and dimensional analysis.
Journal of the Acoustical Society of America, 143(1):109–122, 2018. Joint with M3 Team.
[DOI: 10.1121/1.5018433] - [hal-01914497]
The paper is a case study of the impact that large scale corpora and speech technologies can have on answering classical linguistic question such as the behavior of schwa in contemporary French.
Context
This paper was written within the framework of a collaborative project with industrial partners interested in expressive characteristics of voices, to create audiovisual avatars. This contribution focuses on the acoustic description of voice changes linked to different emotional arousals.
Contribution
It shows the voice source (the patterns of air streaming through the glottis) may be summarized in a few parameters that correlates with psychological characteristics of the expressed emotions. A main dimension of vocal variation is linked to vocal effort, that explain most of the change in vocal folds vibrations. A secondary dimension is linked to vocal effort independent changes in fundamental frequency, and a third to supplementary noise (breathiness). The difference between effort-induced and independent frequency changes are important for the distinguishing the arousal dimension from the dominance dimension. A better understanding of vocal characteristic is fundamental for models of vocal expressivity; this has implications for human machine interactions, as well as for the description of vocal changes, pathology and therapy.
Impact
This paper served as a basis for several other works on the relation between vocal effort and other dimensions of voice changes during spoken communications. It lead to a better understanding of the importance of subglottal pressure for the production of fundamental frequency changes that are not perceived as variations in vocal pitch. This work was also a building block for the current VERS project that target a better description of vocal effort and voice strength, that are important measurements for phenomenons as varied as voice fatigue, expression of affects, or the pragmatics of spoken interaction.