Challenge Sessions

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

Currently, to reduce the dependency on large labeled data sets, self-supervised learning (SSL) has emerged as a popular approach in speech processing. However, what attributes make SSL capable in various conditions and tasks is under-explored. The goal for the SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning is to benchmark in multiple aspects the capability of Self-Supervised Learning (SSL) representations on speech under a standard and comprehensive framework, which is designed to provide more comparable results and analysis. With the challenge, we hope to work jointly with the community to understand the mechanism and efficacy of popular SSL techniques in various conditions and further inspire innovation in the area. The evaluation framework of the SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning is similar to what introduced in SUPERB Benchmark, where various SSL representations are fine-tuned on various speech processing tasks with consistent recipes. This challenge includes 10 tasks, from SUPERB Benchmark and SUPERB-SG Benchmark, to measure the Content, Speaker, Paralinguistics, Semantics and Generation capabilities in SSL representations. To encourage innovating for gains beyond accuracy, such as computation efficiency and low memory-footprint, we employ diverse metrics, including memory usage and number of operations. We do NOT provide an overall metric across tasks, accuracy, and computation and memory efficiency to rank submissions for two reasons: 1) motivate the holistic understanding of SSL techniques’ attributes but not army racing for accuracy or a single metric, 2) welcome submissions for subsets of tasks such that more researchers can participate.

The first COG-MHEAR Audio-Visual Speech Enhancement (AVSEC) Challenge

Human performance in everyday noisy situations is known to be dependent upon both aural and visual senses that are contextually combined by the brain’s multi-level integration strategies. The multimodal nature of speech is well established, with listeners known to unconsciously lip-read to improve the intelligibility of speech in a real noisy environment. It has been shown that the visual aspect of speech has a potentially strong impact on the ability of humans to focus their auditory attention on a particular stimulus. The aim of the first AVSEC Challenge is to bring together the wider computer vision, hearing and speech research communities to explore novel approaches to multimodal speech-in-noise processing. Both raw and pre-processed AV datasets – derived from TED talk videos – will be made available to participants for training and development of audio-visual models to perform speech enhancement and speaker separation at SNR levels that will be significantly more challenging than typically used in audio-only scenarios. Baseline models will be provided along with scripts for objective evaluation. Challenge evaluation will utilise established objective measures such as STOI and PESQ as well as subjective intelligibility tests with human subjects.