This year, we are experimenting with a new approach to sessions. We have organized all sessions to include papers from all the areas to maximize authors’ likelihood of seeing other posters in their area. We hope that this will allow more interactions between participants. We will collect feedback to hear your opinion.
The poster ID is as follows: [Day]-[Session]-[Poster Number]-[Topic],
where:
[Day] is the day of the conference (1,2, 3 or 4),
[Session] is 1 for monring and 2 for afternoon sessions,
[Poster Number] is the number of the poster within the session and
[Topic] is the technical area of the work as follows
Poster ID | Paper Title | Paper ID |
1-1-1-MLP | Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition | 13 |
1-1-2-ASR | ASBERT: ASR-SPECIFIC SELF-SUPERVISED LEARNING WITH SELF-TRAINING | 71 |
1-1-3-ASR | SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION: A REGULARIZATION-FREE APPROACH | 134 |
1-1-4-ASR | G-AUGMENT: SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR | 240 |
1-1-5-MLP | How Do Phonological Properties Affect Bilingual Automatic Speech Recognition? | 329 |
1-1-6-MLP | Scaling Up Deliberation for Multilingual ASR | 109 |
1-1-7-ASR | Context-aware Neural Confidence Estimation for Rare Word Speech Recognition | 278 |
1-1-8-ASR | Flickering reduction with partial hypothesis reranking for streaming ASR | 86 |
1-1-9-ASR | InterDecoder: Using Attention Decoders as Intermediate Regularization for CTC-based Speech Recognition | 187 |
1-1-10-SLP | Automatic Rating of Spontaneous Speech for Low-Resource Languages | 199 |
1-1-11-SLP | Mixture of Domain Experts for Language Understanding: An Analysis of Modularity, Task Performance, and Memory Tradeoffs | 25 |
1-1-12-SES | MULTI-STAGE PROGRESSIVE AUDIO BANDWIDTH EXTENSION | 69 |
1-1-13-SES | JOINT OPTIMIZATION OF DIFFUSION PROBABILISTIC-BASED MULTICHANNEL SPEECH ENHANCEMENT WITH FAR-FIELD SPEAKER VERIFICATION | 243 |
1-1-14-ANA | Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation | 175 |
1-1-15-MLS | Speed-Robust Keyword Spotting via Soft Self-Attention on Multi-Scale Features | 79 |
1-1-16-ASR | CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS | 363 |
1-1-17-TLP | Fine Grained Spoken Document Summarization Through Text Segmentation | 7 |
1-1-18-MMP | Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection | 42 |
1-1-19-MMP | Towards visually prompted keyword localisation for zero-resource spoken languages | 178 |
1-1-20-EMR | SPEECH EMOTION RECOGNITION WITH COMPLEMENTARY ACOUSTIC REPRESENTATIONS | 315 |
1-1-21-TTS | WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration | 141 |
1-1-22-TTS | On granularity of prosodic representations in expressive text-to-speech | 258 |
1-1-23-TTS | Can we use Common Voice to train a Multi-Speaker TTS system? | 81 |
1-1-24-MLS | Distilling Sequence-to-Sequence Voice Conversion Models For Streaming Conversion Applications | 180 |
1-1-25-MLS | AUTOMATIC PREDICTION OF INTELLIGIBILITY OF WORDS AND PHONEMES PRODUCED ORALLY BY JAPANESE LEARNERS OF ENGLISH | 355 |
1-1-26-SUP | On the Utility of Self-supervised Models for Prosody-related Tasks | 313 |
Poster ID | Paper Title | Paper ID |
1-2-1-ASR | JOIST: A Joint Speech and Text Streaming Model For ASR | 23 |
1-2-2-MLP | Code-switched language modelling using a code predictive LSTM in under-resourced South African languages | 76 |
1-2-3-ASR | A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR | 147 |
1-2-4-ASR | Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR | 252 |
1-2-5-MLP | IMPROVING LUXEMBOURGISH SPEECH RECOGNITION WITH CROSS-LINGUAL SPEECH REPRESENTATIONS | 342 |
1-2-6-ASR | Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR | 121 |
1-2-7-ASR | E-Branchformer: Branchformer with Enhanced merging for speech recognition | 310 |
1-2-8-ASR | CONFORMER-BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH KD COMPRESSION AND TWO-PASS ARCHITECTURE | 169 |
1-2-9-ASR | Accelerator-Aware Training for Transducer-based Speech Recognition | 220 |
1-2-10-SLP | A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION | 85 |
1-2-11-SLP | On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding | 111 |
1-2-12-SES | Spatial-DCCRN: DCCRN Equipped with Frame-level Angle Feature and Hybrid Filtering for Multi-channel Speech Enhancement | 70 |
1-2-13-SES | IMPROVED NORMALIZING FLOW-BASED SPEECH ENHANCEMENT USING AN ALL-POLE GAMMATONE FILTERBANK FOR CONDITIONAL INPUT REPRESENTATION | 245 |
1-2-14-ANA | VSAMETER: EVALUATION OF A NEW OPEN-SOURCE TOOL TO MEASURE VOWEL SPACE AREA AND RELATED METRICS | 237 |
1-2-15-SLR | FREQUENCY AND MULTI-SCALE SELECTIVE KERNEL ATTENTION FOR SPEAKER VERIFICATION | 136 |
1-2-16-DIA | Joint speaker diarisation and tracking in switching state-space model | 10 |
1-2-17-TLP | AN ANALYSIS OF THE EFFECTS OF DECODING ALGORITHMS ON FAIRNESS IN OPEN-ENDED LANGUAGE GENERATION | 63 |
1-2-18-MMP | Exploiting information from native data for non-native automatic pronunciation assessment | 123 |
1-2-19-MLP | Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition | 270 |
1-2-20-EMR | A ZERO-SHOT APPROACH TO IDENTIFYING CHILDREN’S SPEECH IN AUTOMATIC GENDER CLASSIFICATION | 322 |
1-2-21-TTS | GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models | 167 |
1-2-22-TTS | Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy | 273 |
1-2-23-RES | STOP: A DATASET FOR SPOKEN TASK ORIENTED SEMANTIC PARSING | 120 |
1-2-24-MLS | SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning | 204 |
1-2-25-MLS | PEPPANET: EFFECTIVE MISPRONUNCIATION DETECTION AND DIAGNOSIS LEVERAGING PHONETIC, PHONOLOGICAL, AND ACOUSTIC CUES | 368 |
Time | Sponsor | Title |
17:00 - 17:20 | QNRF | Dr. Ali Alaboudy, “Introducing QNRF Funding Programs: Digital Technology Track” |
17:20 - 17:40 | Fadi Biadsy, “Speech Model Personalization: From Research to Production” | |
17:40 - 18:00 | Amazon | Björn Hofmeister, “All-Neural ASR - The Next Challenges“ |
Poster ID | Paper Title | Paper ID |
2-1-1-ASR | Untied Positional Encodings for Efficient Transformer-based Speech Recognition | 29 |
2-1-2-ASR | Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio | 92 |
2-1-3-ASR | Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition | 150 |
2-1-4-ASR | Damage Control during Domain Adaptation for Transducer Based Automatic Speech Recognition | 254 |
2-1-5-ASR | PADA: PRUNING ASSISTED DOMAIN ADAPTATION FOR SELF-SUPERVISED SPEECH REPRESENTATIONS | 361 |
2-1-6-ASR | MFCCA: Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario | 137 |
2-1-7-ASR | Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition | 157 |
2-1-8-MMP | TRANSFORMER-BASED LIP-READING WITH REGULARIZED DROPOUT AND RELAXED ATTENTION | 84 |
2-1-9-ASR | Residual Adapters for Targeted Updates in RNN-Transducer Based Speech Recognition System | 227 |
2-1-10-SLP | Response Timing Estimation for Spoken Dialog Systems based on Syntactic Completeness Prediction | 309 |
2-1-11-SLP | Weak-Supervised Dysarthria-invariant Features for Spoken Language Understanding using an FHVAE and Adversarial Training | 194 |
2-1-12-SES | Exploring WavLM on Speech Enhancement | 149 |
2-1-13-SES | Adaptive-FSN: Integrating full-band extraction and adaptive sub-band encoding for monaural speech enhancement | 247 |
2-1-14-ANA | INVESTIGATING THE IMPORTANT TEMPORAL MODULATIONS FOR DEEP-LEARNING-BASED SPEECH ACTIVITY DETECTION | 276 |
2-1-15-SLR | AN ATTENTION-BASED BACKEND ALLOWING EFFICIENT FINE-TUNING OF TRANSFORMER MODELS FOR SPEAKER VERIFICATION | 179 |
2-1-16-DIA | Diarisation using location tracking with agglomerative clustering | 11 |
2-1-17-TLP | N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS | 236 |
2-1-18-MMP | SpeechCLIP: Integrating Speech with Pre-trained Vision and Language Model | 146 |
2-1-19-EMR | Distribution-based Emotion Recognition in Conversation | 22 |
2-1-20-TTS | StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models | 43 |
2-1-21-TTS | Learning accent representation with multi-level VAE towards controllable speech synthesis | 185 |
2-1-22-TTS | vTTS: visual-text to speech | 314 |
2-1-23-MLP | FLEURS: FEW-SHOT LEARNING EVALUATION OF UNIVERSAL REPRESENTATIONS OF SPEECH | 133 |
2-1-24-MLS | Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection | 216 |
2-1-25-SUP | Improving generalizability of distilled self-supervised speech processing models under distorted settings | 53 |
2-1-26-SES | AVSE CHALLENGE: AUDIO-VISUAL SPEECH ENHANCEMENT CHALLENGE | 374 |
Poster ID | Paper Title | Paper ID |
2-2-1-ASR | IMPROVED NOISY ITERATIVE PSEUDO-LABELING FOR SEMI-SUPERVISED SPEECH RECOGNITION | 65 |
2-2-2-ASR | GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION | 94 |
2-2-3-ASR | Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition | 166 |
2-2-4-ASR | NAM+: TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR | 279 |
2-2-5-DIA | Continual Self-supervised Domain Adaptation for End-to-end Speaker Diarization | 4 |
2-2-6-ASR | Modular Hybrid Autoregressive Transducer | 143 |
2-2-7-ASR | How Does Pre-trained Wav2Vec 2.0 Perform on Domain-Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications | 164 |
2-2-8-MLP | Improving Semi-supervised E2E ASR using CycleGAN and Inter-domain Losses | 115 |
2-2-9-ASR | Internal Language Model Personalization of E2E Automatic Speech Recognition Using Random Encoder Features | 256 |
2-2-10-SLP | Building Markovian Generative Architectures over Pretrained LM Backbones for Efficient Task-Oriented Dialog Systems | 330 |
2-2-11-SLP | NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING | 226 |
2-2-12-SES | TEA-PSE 2.0: SUB-BAND NETWORK FOR REAL-TIME PERSONALIZED SPEECH ENHANCEMENT | 172 |
2-2-13-SES | EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers | 316 |
2-2-14-SLR | Flow-ER: a Flow-based Embedding Regularization Strategy for Robust Speech Representation Learning | 3 |
2-2-15-SLR | UNSUPERVISED DOMAIN ADAPTATION OF NEURAL PLDA USING SEGMENT PAIRS FOR SPEAKER VERIFICATION | 277 |
2-2-16-DIA | Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization | 93 |
2-2-17-MLS | TDOA ESTIMATION OF SPEECH SOURCE IN NOISY REVERBERANT ENVIRONMENTS | 312 |
2-2-18-MMP | YFACC: A Yorùbá Speech-Image Dataset for Cross-lingual Keyword Localisation through Visual Grounding | 153 |
2-2-19-MLP | MULTILINGUAL SPEECH EMOTION RECOGNITION WITH MULTI-GATING MECHANISM AND NEURAL ARCHITECTURE SEARCH | 113 |
2-2-20-TTS | Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech | 50 |
2-2-21-TTS | Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion | 217 |
2-2-22-MLP | Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using $\beta$-VAE | 335 |
2-2-23-RES | Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition | 274 |
2-2-24-MLS | Phoneme Segmentation Using Self-Supervised Speech Models | 268 |
2-2-25-SUP | Exploring Efficient-tuning Methods in Self-supervised Speech Models | 106 |
Poster ID | Paper Title | Paper ID |
2-3-1-DEMO | ISPEAK: INTERACTIVE SPOKEN LANGUAGE UNDERSTANDING SYSTEM FOR CHILDREN WITH SPEECH AND LANGUAGE DISORDERS | DEMO |
2-3-2-DEMO | LUX-ASR: BUILDING AN ASR SYSTEM FOR THE LUXEMBOURGISH LANGUAGE | DEMO |
2-3-3-DEMO | ON-DEVICE STREAMING TARGET-SPEAKER ASR WITH NEURAL TRANSDUCER | DEMO |
2-3-4-DEMO | VOICE-ENABLED AUDIOVISUAL AGENT FOR QUESTION ANSWERING IN ENGLISH AND ARABIC | DEMO |
Time | Sponsor | Title |
17:30 - 17:50 | Apptek | Mohammad Zeineldeen, “Fully Automatic Video Dubbing at AppTek” |
17:50 - 18:00 | 3M | Dr. Jing Su, “From pre-trained language models to practical medical scribing solutions” |
18:00 - 18:10 | LXT | Martha Hakvoort, “Powering AI innovation with High-quality data” |
18:10 - 18:20 | DataForce | Dr Dorota Iskra, “Data-Centric Approach to AI” |
18:20 - 18:40 | SCAI | Dr. Areeb Alowisheq, “SCAI: unlocking value with AI” |
Poster ID | Paper Title | Paper ID |
3-1-1-ASR | Towards End-to-end Unsupervised Speech Recognition | 66 |
3-1-2-MLP | Exploring a unified ASR for multiple south Indian languages leveraging multilingual acoustic and language models | 97 |
3-1-3-ASR | Monotonic segmental attention for automatic speech recognition | 197 |
3-1-4-ASR | STREAMING, FAST AND ACCURATE ON-DEVICE INVERSE TEXT NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION | 323 |
3-1-5-ASR | DUAL LEARNING FOR LARGE VOCABULARY ON-DEVICE ASR | 27 |
3-1-6-ASR | STREAMING BILINGUAL END TO END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX | 190 |
3-1-7-ASR | End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation | 222 |
3-1-8-ASR | Fully Unsupervised Training of Few-Shot Keyword Spotting | 127 |
3-1-9-ASR | Learning a Dual-Mode Speech Recognition Model via Self-Pruning | 287 |
3-1-10-SLP | Improving Noise Robustness for Spoken Content Retrieval using semi-supervised ASR and N-best transcripts for BERT-based ranking models | 170 |
3-1-11-SLP | A STUDY ON THE INTEGRATION OF PRE-TRAINED SSL, ASR, LM AND SLU MODELS FOR SPOKEN LANGUAGE UNDERSTANDING | 264 |
3-1-12-SES | LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION | 177 |
3-1-13-ANA | A MULTI-MODAL ARRAY OF INTERPRETABLE FEATURES TO EVALUATE LANGUAGE AND SPEECH PATTERNS IN DIFFERENT NEUROLOGICAL DISORDERS | 107 |
3-1-14-SLR | THE CLEVER HANS EFFECT IN VOICE SPOOFING DETECTION | 20 |
3-1-15-SLR | INVESTIGATING ACTIVE-LEARNING-BASED TRAINING DATA SELECTION FOR SPEECH SPOOFING COUNTERMEASURE | 284 |
3-1-16-DIA | BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications | 162 |
3-1-17-TLP | Efficient Text Analysis with Pre-trained Neural Network Models | 300 |
3-1-18-MMP | ON THE USE OF MODALITY-SPECIFIC LARGE-SCALE PRE-TRAINED ENCODERS FOR MULTIMODAL SENTIMENT ANALYSIS | 154 |
3-1-19-EMR | Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora | 188 |
3-1-20-TTS | SIMD-SIZE AWARE WEIGHT REGULARIZATION FOR FAST NEURAL VOCODING ON CPU | 64 |
3-1-21-TTS | Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech | 219 |
3-1-22-TTS | Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation | 352 |
3-1-23-MLS | An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition | 17 |
3-1-24-TLP | Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition | 275 |
3-1-25-SUP | On Compressing Sequences for Self-Supervised Speech Models | 238 |
Poster ID | Paper Title | Paper ID |
JSALT 2022 Report: Eighth Frederick Jelinek Memorial Summer Workshop | ||
JSALT 2022 Report: Speech Translation for Under-Resourced Languages | ||
JSALT 2022 Report: Multilingual and Code-Switching Speech Recognition | ||
JSALT 2022 Report: Leveraging Pre-Training Models for Speech Processing | ||
4-2-1-SUP | SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning | 373 |
Poster ID | Paper Title | Paper ID |
4-1-1-ASR | Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition | 67 |
4-1-2-ASR | HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch | 100 |
4-1-3-ASR | Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models | 235 |
4-1-4-ASR | Personalization of CTC Speech Recognition Models | 328 |
4-1-5-MLP | A Truly Multilingual First Pass and Monolingual Second Pass Streaming On-Device ASR System | 108 |
4-1-6-ASR | UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS | 269 |
4-1-7-ASR | Learning mask scalars for improved robust automatic speech recognition | 293 |
4-1-8-ASR | An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition | 163 |
4-1-9-ASR | Macro-block dropout for improved regularization in training end-to-end speech recognition models | 348 |
4-1-10-SLP | On the Efficiency of Integrating Self-supervised Learning and Meta-learning for User-defined Few-shot Keyword Spotting | 214 |
4-1-11-SES | End-to-End Multi-speaker ASR with Independent Vector Analysis | 15 |
4-1-12-SES | A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction With Improved Training | 184 |
4-1-13-ANA | Efficient dynamic filter for robust and low computational feature extraction | 148 |
4-1-14-SLR | HOW TO BOOST ANTI-SPOOFING WITH X-VECTORS | 78 |
4-1-15-SLR | A COMPREHENSIVE STUDY ON SELF-SUPERVISED DISTILLATION FOR SPEAKER REPRESENTATION LEARNING | 311 |
4-1-16-DIA | Low-Latency Speech Separation Guided Diarization for Telephone Conversations | 241 |
4-1-17-TLP | Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems | 308 |
4-1-18-MMP | An Analysis of Semantically-Aligned Speech-Text Embeddings | 174 |
4-1-19-EMR | Combining Contrastive and Non-Contrastive Losses for Fine-Tuning Pretrained Models in Speech Analysis | 218 |
4-1-20-TTS | Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss | 77 |
4-1-21-TTS | Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows | 234 |
4-1-22-RES | MASC: Massive Arabic Speech Corpus | 39 |
4-1-23-MLS | PHONE-LEVEL PRONUNCIATION SCORING FOR L1 USING WEIGHTED-DYNAMIC TIME WARPING | 35 |
4-1-24-MLS | PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0 | 340 |
4-1-25-SUP | Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations | 250 |
Poster ID | Paper Title | Authors | Session |
1-1-2-ASR | ASBERT: ASR-SPECIFIC SELF-SUPERVISED LEARNING WITH SELF-TRAINING | Hyung Yong Kim (42dot); Byeong-Yeol Kim (42dot); Seung Woo Yu (42dot); Youshin Lim (42dot); Yunkyu Lim (42dot); Hanbin Lee (42dot) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-3-ASR | SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION: A REGULARIZATION-FREE APPROACH | Kai Zhen (Amazon); Martin Radfar (Amazon); Hieu D Nguyen (Amazon); Grant Strimel (Amazon); Athanasios Mouchtaris (Amazon); Nathan Susanj (Amazon) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-4-ASR | G-AUGMENT: SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR | Yuan Wang (Google); Ekin D Cubuk (Google Brain); Andrew Rosenberg (Google LLC); Shuyang Cheng (Waymo LLC); Ron J Weiss (Google, Inc.); Bhuvana Ramabhadran (Google); Pedro Moreno (Google); Quoc Le (Google Brain); Daniel S Park (Google Brain) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-7-ASR | Context-aware Neural Confidence Estimation for Rare Word Speech Recognition | David Qiu (Google); Tsendsuren Munkhdalai (Google LLC); Yanzhang He (Google); Khe C Sim (Google Inc.) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-8-ASR | Flickering reduction with partial hypothesis reranking for streaming ASR | Antoine Bruguier (Google); David Qiu (Google); Trevor strohman (Google); Yanzhang He (Google) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-9-ASR | InterDecoder: Using Attention Decoders as Intermediate Regularization for CTC-based Speech Recognition | Tatsuya Komatsu (LINE Corporation); Yusuke Fujita (LINE Corporation) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-16-ASR | CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS | Vasista Sai Lodagala (Indian Institute of Technology, Madras); Sreyan Ghosh (University of Maryland, College Park); S Umesh (IIT Chennai) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-1-ASR | JOIST: A Joint Speech and Text Streaming Model For ASR | Tara Sainath (Google); Rohit Prabhavalkar (Google); Ankur Bapna (Google Research); Yu Zhang (Google); Zhouyuan Huo (Google ); Zhehuai Chen (Google); Bo Li (Google); Weiran Wang (Google); Trevor Strohman (Google, Inc.) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-3-ASR | A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR | Ke-Han Lu (National Taiwan University of Science and Technology); Kuan-Yu CHEN (National Taiwan University of Science and Technology) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-4-ASR | Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR | Zhehuai Chen (Google); Ankur Bapna (Google Research); Andrew Rosenberg (Google LLC); Yu Zhang (Google); Bhuvana Ramabhadran (Google); Pedro Moreno (Google); Nanxin Chen (Google) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-6-ASR | Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR | Yusuke Fujita (LINE Corporation); Tatsuya Komatsu (LINE Corporation); Yusuke Kida (LINE Corp) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-7-ASR | E-Branchformer: Branchformer with Enhanced merging for speech recognition | Kwangyoun Kim (ASAPP); Felix Wu (ASAPP); Yifan Peng (Carnegie Mellon University); Jing Pan (ASAPP); Prashant Sridhar (ASAPP); Kyu Jeong Han (ASAPP); Shinji Watanabe (Carnegie Mellon University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-8-ASR | CONFORMER-BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH KD COMPRESSION AND TWO-PASS ARCHITECTURE | Jinhwan Park (Samsung Research); Sichen Jin (Samsung); Junmo Park (Samsung Research); Sungsoo Kim (Samsung Electronics); Dhairya Sandhyana (Samsung Research); Changheon Lee (Samsung Electronics); Myoungji Han (Samsung Electronics); Jungin Lee (Samsung Electronics); Seokyeong Jung (Samsung Electronics); Chang Woo Han (Samsung Reserch); Chanwoo Kim (Samsung Electronics) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-9-ASR | Accelerator-Aware Training for Transducer-based Speech Recognition | Rupak Vignesh Swaminathan (Amazon.com); Suhaila Mumtaj Shakiah (Amazon); Hieu D Nguyen (Amazon); Raviteja chinta (Amazon.com); Tariq Afzal (Amazon.com); Nathan Susanj (Amazon.com); Athanasios Mouchtaris (Amazon.com); Grant Strimel (Amazon.com); Ariya Rastrow (Amazon Alexa) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-1-ASR | Untied Positional Encodings for Efficient Transformer-based Speech Recognition | Lahiru T Samarakoon (Fano Labs, Hong Kong); Ivan Fung (Fano Labs, Hong Kong) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-2-ASR | Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio | Yan Gao (University of Cambridge); Javier Fernandez-Marques (Samsung AI, Cambridge); Titouan Parcollet (); Pedro Gusmao (University of Cambridge); Nicholas Lane (University of Cambridge and Samsung AI) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-3-ASR | Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition | Peng Shen (NICT); Xugang Lu (NICT); Hisashi Kawai (NICT) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-4-ASR | Damage Control during Domain Adaptation for Transducer Based Automatic Speech Recognition | Somshubra Majumdar (NVIDIA); Shantanu Acharya (NVIDIA); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-5-ASR | PADA: PRUNING ASSISTED DOMAIN ADAPTATION FOR SELF-SUPERVISED SPEECH REPRESENTATIONS | Vasista Sai Lodagala (Indian Institute of Technology, Madras); Sreyan Ghosh (University of Maryland, College Park); S Umesh (IIT Chennai) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-6-ASR | MFCCA: Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario | Fan Yu (Northwestern Polytechnical University); Shiliang Zhang (Alibaba Group); Pengcheng Guo (Northwestern Polytechnical University); Yuhao Liang (Northwestern Polytechnical University); Zhihao Du (Speech Lab, Alibaba Group); Yuxiao Lin (Zhejiang University); Lei Xie (Northwestern Polytechnical University) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-7-ASR | Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition | Aleksandr Laptev (NVIDIA, ITMO University); Boris Ginsburg (NVIDIA) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-9-ASR | Residual Adapters for Targeted Updates in RNN-Transducer Based Speech Recognition System | Sungjun Han (University of Stuttgart); Deepak Baby (Amazon Alexa); Valentin Mendelev (Amazon Alexa) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-1-ASR | IMPROVED NOISY ITERATIVE PSEUDO-LABELING FOR SEMI-SUPERVISED SPEECH RECOGNITION | Tian Li (Shumei AI Research Institute); Qingliang Meng (Shumei AI Research Institute); Yujian Sun (Shumei AI Research Institute) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-2-ASR | GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION | Aparna Khare (Amazon); Minhua Wu (Amazon Inc.); Saurabhchand Bhati (Johns Hopkins University ); Jasha Droppo (Amazon Inc.); Roland Maas (Amazon Inc.) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-3-ASR | Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition | Jakob Poncelet (KU Leuven); Hugo Van hamme (KU Leuven) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-4-ASR | NAM+: TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR | Zelin Wu (Google LLC); Tsendsuren Munkhdalai (Google LLC); Golan Pundak (Google); Khe C Sim (Google Inc.); David Li (Google LLC); Pat Rondon (Google LLC); Tara Sainath (Google) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-6-ASR | Modular Hybrid Autoregressive Transducer | Zhong Meng (Google LLC); Tongzhou Chen (Google); Rohit Prabhavalkar (Google); Yu Zhang (Google); Yuan Wang (Google); Kartik Audhkhasi (Google); Jesse Emond (Google LLC); Trevor Strohman (Google LLC); Bhuvana Ramabhadran (Google); W. Ronny Huang (Google); Ehsan Variani (Google); Yinghui Huang (Google); Pedro Moreno (Google) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-7-ASR | How Does Pre-trained Wav2Vec 2.0 Perform on Domain-Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications | Juan Pablo Zuluaga Gomez (Idiap Research Institute); Amrutha Prasad (Idiap Research Institute); Iuliia Nigmatulina (Idiap Research Institute); Seyyed Saeed Sarfjoo (Idiap Research Institute); Petr Motlicek (Idiap); Matthias Kleinert (DLR); Hartmut Helmke (DLR); Oliver Ohneiser (DLR); Qingran Zhan (Beijing Institute of Technology) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-9-ASR | Internal Language Model Personalization of E2E Automatic Speech Recognition Using Random Encoder Features | Adam Stooke (Google); Khe C Sim (Google Inc.); Mason Chua (Google); Tsendsuren Munkhdalai (Google LLC); Trevor Strohman (Google) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-1-ASR | Towards End-to-end Unsupervised Speech Recognition | Alexander H Liu (MIT); Wei-Ning Hsu (Massachusetts Institute of Technology); Michael Auli (Facebook); Alexei Baevski (Facebook AI Research) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-3-ASR | Monotonic segmental attention for automatic speech recognition | Albert Zeyer (RWTH Aachen University); Robin Schmitt (RWTH Aachen University); Wei Zhou (RWTH Aachen University); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-4-ASR | STREAMING, FAST AND ACCURATE ON-DEVICE INVERSE TEXT NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION | Yashesh Gaur (Microsoft); Nick Kibre (Microsoft); JIAN XUE (Microsoft Corporation); Kangyuan Shu (Microsoft); Yuhui Wang (Microsoft); Issac Alphonso (Microsoft); Jinyu Li (Microsoft); Yifan Gong (Microsoft) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-5-ASR | DUAL LEARNING FOR LARGE VOCABULARY ON-DEVICE ASR | Charles C Peyser (Google Inc.); W. Ronny Huang (Google); Tara Sainath (Google); Rohit Prabhavalkar (Google); Michael Picheny (NYU); Kyunghyun Cho (New York University) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-6-ASR | STREAMING BILINGUAL END TO END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX | Aditya R Patil (Microsoft); Vikas V Joshi (Microsoft); Purvi Agrawal (Microsoft); Rupesh Mehta (Microsoft) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-7-ASR | End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation | Yoshiki Masuyama (Tokyo Metropolitan University); Xuankai Chang (Carnegie Mellon University); Samuele Cornell (Università Politecnica delle Marche); Shinji Watanabe (Carnegie Mellon University); Nobutaka Ono (Tokyo Metropolitan University) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-8-ASR | Fully Unsupervised Training of Few-Shot Keyword Spotting | Minchan Kim (Seoul National University); Dongjune Lee (Seoul National University); Sung Hwan Mun (Seoul National University); Min Hyun Han (Seoul National University); Nam Soo Kim (Seoul National University) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-9-ASR | Learning a Dual-Mode Speech Recognition Model via Self-Pruning | Chunxi Liu (Meta AI); Yuan Shangguan (Meta AI); Haichuan Yang (Meta); Yangyang Shi (Facebook); Raghuraman Krishnamoorthi (Facebook); Ozlem Kalinli (Meta AI) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-1-ASR | Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition | Ji Won Yoon (Seoul National University); Beom Jun Woo (Seoul National University); Sunghwan Ahn (Seoul National University); Hyeonseung Lee (Seoul National University); Nam Soo Kim (Seoul National University) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-2-ASR | HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch | Tina Raissi (RWTH Aachen University); Wei Zhou (RWTH Aachen University); Simon Berger (RWTH Aachen University); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-3-ASR | Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models | Vrunda N Sukhadia (Indian Institute Of Technology Madras); S Umesh (IIT Chennai) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-4-ASR | Personalization of CTC Speech Recognition Models | Saket Dingliwal (Amazon); Monica Sunkara (Amazon); Sravan Babu Bodapati (Amazon); Srikanth Ronanki (Amazon); Jeff Farris (Amazon); Katrin Kirchhoff (Amazon) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-6-ASR | UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS | Shaan Bijwadia (Google); Shuo-yiin Chang (Google); Tara Sainath (Google); Bo Li (Google); Chao Zhang (Google); Yanzhang He (Google) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-7-ASR | Learning mask scalars for improved robust automatic speech recognition | Arun Narayanan (Google Inc.); James Walker (Google Llc.); SANKARAN PANCHAPAGESAN (Google, LLC); Nathan Howard (Google Llc.); Yuma Koizumi (Google) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-8-ASR | An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition | Niko Moritz (Meta); Frank Seide (Meta); Duc Le (Meta); Jay Mahadeokar (Meta AI); Christian Fuegen (Facebook) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-9-ASR | Macro-block dropout for improved regularization in training end-to-end speech recognition models | Chanwoo Kim (Samsung Electronics); Sathish Indurti (Samsung Research); Jinhwan Park (Samsung Research); Wonyong Sung (Seoul national university) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-10-SLP | Automatic Rating of Spontaneous Speech for Low-Resource Languages | Yaroslav Getman (Aalto University); Ragheb Al-Ghezi (Aalto University); Ekaterina Voskoboinik (Aalto University); Mittul Singh (Silo AI); Mikko Kurimo (Aalto University) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-11-SLP | Mixture of Domain Experts for Language Understanding: An Analysis of Modularity, Task Performance, and Memory Tradeoffs | Benjamin Kleiner (AWS AI Labs); Jack FitzGerald (Amazon Alexa Artificial Intelligence); Haidar Khan (Amazon Alexa AI); Gokhan Tur ( Amazon Alexa AI) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-10-SLP | A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION | Anupama Chingacham (Saarland University); Vera Demberg (Dept. of Mathematics and Computer Science, Saarland University); Dietrich Klakow (Saarland University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-11-SLP | On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding | Gaëlle Laperrière (LIA - Avignon University); Valentin Pelloin (LIUM, Le Mans Université); Mickael Rouvier (LIA - Avignon University); Themos Stafylakis (Omilia - Conversational Intelligence); Yannick Estève (LIA - Avignon University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-10-SLP | Response Timing Estimation for Spoken Dialog Systems based on Syntactic Completeness Prediction | Jin Sakuma (Waseda University); Shinya Fujie (Chiba Institute of Technology); Tetsunori Kobayashi (Waseda University) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-11-SLP | Weak-Supervised Dysarthria-invariant Features for Spoken Language Understanding using an FHVAE and Adversarial Training | Jinzi Qi (KULeuven); Hugo Van hamme (KU LEUVEN) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-10-SLP | Building Markovian Generative Architectures over Pretrained LM Backbones for Efficient Task-Oriented Dialog Systems | Hong Liu (Tsinghua University); Yucheng Cai (tsinghua university); Zhijian Ou (Tsinghua University); Yi Huang (China Mobile Research); Junlan Feng (China Mobile Research) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-11-SLP | NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING | Mohan LI (Toshiba Europe Ltd); Rama S Doddipatla (Toshiba Europe LTD) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-10-SLP | Improving Noise Robustness for Spoken Content Retrieval using semi-supervised ASR and N-best transcripts for BERT-based ranking models | Yasufumi Moriya (Dublin City University); Gareth Jones (Dublin City University) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-11-SLP | A STUDY ON THE INTEGRATION OF PRE-TRAINED SSL, ASR, LM AND SLU MODELS FOR SPOKEN LANGUAGE UNDERSTANDING | Yifan Peng (Carnegie Mellon University); Siddhant Arora (Carnegie Mellon University); Yosuke Higuchi (Waseda University); Yushi Ueda (Carnegie Mellon University); Sujay Kumar (Carnegie Mellon University); Karthik Ganesan (Carnegie Mellon University); Siddharth Dalmia (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-10-SLP | On the Efficiency of Integrating Self-supervised Learning and Meta-learning for User-defined Few-shot Keyword Spotting | Yuan-Kuei Wu (National Taiwan University); Wei-Tsung Kao (National Taiwan University); Hung-yi Lee (National Taiwan University); Chia-Ping Chen (intelliGo Technology inc.); Zhi-Sheng Chen (intelliGo Technology inc.); Yu-Pao Tsai (intelliGo Technology inc.) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-12-SES | MULTI-STAGE PROGRESSIVE AUDIO BANDWIDTH EXTENSION | liang wen (samsung electronics); Lizhong Wang (Samsung); Ying Zhang (Samsung Electronics); Kwang Pyo Choi (Samsung Electronics) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-13-SES | JOINT OPTIMIZATION OF DIFFUSION PROBABILISTIC-BASED MULTICHANNEL SPEECH ENHANCEMENT WITH FAR-FIELD SPEAKER VERIFICATION | Sandipana Dowerah (Inria); romain serizel (Université de Lorraine); Denis Jouvet (LORIA); Mohammad Mohammadamini (Laboratoire Informatique d’Avignon, University of Avignon); Driss Matrouf (Laboratoire Informatique d’Avignon, University of Avignon) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-12-SES | Spatial-DCCRN: DCCRN Equipped with Frame-level Angle Feature and Hybrid Filtering for Multi-channel Speech Enhancement | Shubo Lv (Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University); Yihui Fu (Northwestern Polytechnical University); Yukai Ju (Northwestern Polytechnical University); Lei Xie (NWPU); Weixin Zhu (Tencent); Wei Rao (Tencent); Yannan Wang (Tencent) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-13-SES | IMPROVED NORMALIZING FLOW-BASED SPEECH ENHANCEMENT USING AN ALL-POLE GAMMATONE FILTERBANK FOR CONDITIONAL INPUT REPRESENTATION | Martin Strauss (International Audio Laboratories Erlangen); Matteo Torcoli (International Audio Laboratories Erlangen); Bernd Edler (International Audio Laboratories Erlangen) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-12-SES | Exploring WavLM on Speech Enhancement | Hyungchan Song (Gwangju Institute of Science and Technology); Sanyuan Chen (Harbin Institute of Technology); Zhuo Chen (Microsoft); Yu Wu (Microsoft Research Asia); Takuya Yoshioka (Microsoft); Min Tang (Microsoft); Jong Won Shin (Gwangju Institute of Science and Technology); Shujie Liu (Microsoft Research Asia) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-13-SES | Adaptive-FSN: Integrating full-band extraction and adaptive sub-band encoding for monaural speech enhancement | Yu-Sheng Tsao (National Taiwan Normal University); Kuan-Hsun Ho (NTNU); Jeih-weih Hung (National Chi Nan University); Berlin Chen (National Taiwan Normal University) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-26-SES | AVSE CHALLENGE: AUDIO-VISUAL SPEECH ENHANCEMENT CHALLENGE | Andrea L Aldana (Edinburgh University); Cassia Valentini (University of Edinburgh); Ondrej Klejch (University of Edinburgh); Mandar Gogate (Edinburgh Napier University ); Kia K Dashtipour (Edinburgh Napier University); Amir Hussein (Edinburgh Napier University); Peter Bell (University of Edinburgh ) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-12-SES | TEA-PSE 2.0: SUB-BAND NETWORK FOR REAL-TIME PERSONALIZED SPEECH ENHANCEMENT | Yukai Ju (Northwestern Polytechnical University); Shimin Zhang (Northwestern Polytechnical University); Wei Rao (Tencent); Yannan Wang (Tencent); Tao Yu (Tencent); Lei Xie (NWPU); Shi-dong Shang (tencent) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-13-SES | EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers | Soumi Maiti (CMU); Yushi Ueda (CMU); Shinji Watanabe (CMU); chunlei zhang (Tencent AI Lab); Meng Yu (Tencent); Shixiong Zhang (Tencent); Yong Xu (Tecent) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-12-SES | LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION | Qinghua Liu (Tianjin University); Yating Huang (Institute of Automation, Chinese Academy of Sciences (CASIA)); Yunzhe Hao (Institute of Automation,Chinese Academy of Science); Jiaming Xu (Institute of Automation Chinese Academy of Sciences); Bo Xu (Institute of Automation, Chinese Academy of Sciences) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-11-SES | End-to-End Multi-speaker ASR with Independent Vector Analysis | Robin Scheibler (LINE Corporation); Wangyou Zhang (Shanghai Jiao Tong University); Xuankai Chang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Yanmin Qian (Shanghai Jiao Tong University) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-12-SES | A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction With Improved Training | Wolfgang Mack (AudioLabs Erlangen); Emanuel Habets (AudioLabs Erlangen) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-14-ANA | Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation | Chendong Zhao (The Shenzhen International Graduate School, Tsinghua University, China); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Xiaoyang Qu (Ping An Technology (Shenzhen) Co., Ltd); Haoqian Wang (Tsinghua Shenzhen International Graduate School, Tsinghua University); Jing Xiao (Ping An Insurance (Group) Company of China) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-14-ANA | VSAMETER: EVALUATION OF A NEW OPEN-SOURCE TOOL TO MEASURE VOWEL SPACE AREA AND RELATED METRICS | Tianyu Cao (Johns Hopkins University); Laureano Moro-Velazquez (Johns Hopkins University); Piotr Żelasko (Meaning); Jesús Villalba (Johns Hopkins University); Najim Dehak (Johns Hopkins University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-14-ANA | INVESTIGATING THE IMPORTANT TEMPORAL MODULATIONS FOR DEEP-LEARNING-BASED SPEECH ACTIVITY DETECTION | Tyler Vuong (Carnegie Mellon University); Nikhil Madaan (Carnegie Mellon University); Rohan Panda (Carnegie Mellon University); Richard M Stern (Carnegie Mellon University) | Tue 10 Jan - Morning session (10:30-12:30) |
3-1-13-ANA | A MULTI-MODAL ARRAY OF INTERPRETABLE FEATURES TO EVALUATE LANGUAGE AND SPEECH PATTERNS IN DIFFERENT NEUROLOGICAL DISORDERS | Anna Favaro (Johns Hopkins University); Chelsie Motley (Johns Hopkins University); Tianyu Cao (Johns Hopkins University); Miguel Iglesias (Johns Hopkins University); Ankur Butala (Johns Hopkins University); Esther S. Oh (Johns Hopkins University); Robert Stevens (Johns Hopkins Hospital); Jesús Villalba (Johns Hopkins University); Najim Dehak (Johns Hopkins University); Laureano Moro-Velazquez (Johns Hopkins University) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-13-ANA | Efficient dynamic filter for robust and low computational feature extraction | Donghyeon Kim (Korea university); Jeong-gi Kwak (Korea University); Hanseok Ko (Korea University) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-2-15-SLR | FREQUENCY AND MULTI-SCALE SELECTIVE KERNEL ATTENTION FOR SPEAKER VERIFICATION | Sung Hwan Mun (Seoul National University); Jee-weon Jung (Naver Corporation); Min Hyun Han (Seoul National University); Nam Soo Kim (Seoul National University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-15-SLR | AN ATTENTION-BASED BACKEND ALLOWING EFFICIENT FINE-TUNING OF TRANSFORMER MODELS FOR SPEAKER VERIFICATION | Junyi Peng (Brno University of Technology); Oldrich Plchot (Brno University of Technology ); Themos Stafylakis (Omilia - Conversational Intelligence); Ladislav Mosner (Brno University of Technology ); Lukas Burget (Brno University of Technology ); Jan Cernocky (Brno University of Technology ) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-14-SLR | Flow-ER: a Flow-based Embedding Regularization Strategy for Robust Speech Representation Learning | Woo Hyun Kang (Computer Research Institute of Montreal); Jahangir Alam (Computer Research Institute of Montreal (CRIM), Montreal (Quebec) Canada); Abderrahim Fathan (Computer Research Institute of Montreal (CRIM), Montreal, Quebec, Canada) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-15-SLR | UNSUPERVISED DOMAIN ADAPTATION OF NEURAL PLDA USING SEGMENT PAIRS FOR SPEAKER VERIFICATION | İsmail Rasim Ülgen (Sestek - Boğaziçi University); Mustafa Levent Arslan (Sestek - Boğaziçi Üniversitesi) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-14-SLR | THE CLEVER HANS EFFECT IN VOICE SPOOFING DETECTION | Bhusan Chettri (Borac Solutions) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-15-SLR | INVESTIGATING ACTIVE-LEARNING-BASED TRAINING DATA SELECTION FOR SPEECH SPOOFING COUNTERMEASURE | Xin Wang (National Institute of Informatics); Junichi Yamagishi (National Institute of Informatics) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-14-SLR | HOW TO BOOST ANTI-SPOOFING WITH X-VECTORS | Xinyue Ma (Tsinghua University); Shanshan Zhang (Tencent Research); Shen Huang (Tencent Research); Ji Gao (Tencent Research); Ying Hu (Xinjiang University); Liang HE (Tsinghua University) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-15-SLR | A COMPREHENSIVE STUDY ON SELF-SUPERVISED DISTILLATION FOR SPEAKER REPRESENTATION LEARNING | Zhengyang Chen (Shanghai Jiao Tong University); Yao Qian (Microsoft); Bing Han (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University); Michael Zeng (Microsoft) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-2-16-DIA | Joint speaker diarisation and tracking in switching state-space model | Jeremy H. M. Wong (Institute for Infocomm Research); Yifan Gong (Microsoft) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-16-DIA | Diarisation using location tracking with agglomerative clustering | Jeremy H. M. Wong (Institute for Infocomm Research); Igor Abramovski (Microsoft); Xiong Xiao (Microsoft); Yifan Gong (Microsoft) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-5-DIA | Continual Self-supervised Domain Adaptation for End-to-end Speaker Diarization | Juan Manuel Coria (Université Paris-Saclay CNRS, LISN); Hervé Bredin (CNRS); Sahar Ghannay (Université Paris-Saclay CNRS, LISN); Sophie Rosset (LISN) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-16-DIA | Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization | Shota Horiguchi (Hitachi, Ltd.); Yuki Takashima (Hitachi, Ltd.); Shinji Watanabe (Carnegie Mellon University); Paola Garcia (Johns Hopkins University) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-16-DIA | BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications | Juan Pablo Zuluaga Gomez (Idiap Research Institute); Seyyed Saeed Sarfjoo (Idiap Research Institute); Amrutha Prasad (Idiap Research Institute); Iuliia Nigmatulina (Idiap Research Institute); Petr Motlicek (Idiap); Karel Ondrej (BUT); Oliver Ohneiser (DLR); Hartmut Helmke (DLR) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-16-DIA | Low-Latency Speech Separation Guided Diarization for Telephone Conversations | Giovanni Morrone (Università Politecnica delle Marche); Samuele Cornell (Università Politecnica delle Marche); Desh Raj (Johns Hopkins University); Luca Serafini (Università Politecnica delle Marche); Enrico Zovato (PerVoice S.p.A.); Alessio Brutti (FBK); Stefano Squartini (Università Politecnica delle Marche) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-17-TLP | Fine Grained Spoken Document Summarization Through Text Segmentation | Samantha Kotey (Trinity College Dublin); Rozenn Dahyot (Maynooth University); Naomi Harte (Trinity College Dublin) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-17-TLP | AN ANALYSIS OF THE EFFECTS OF DECODING ALGORITHMS ON FAIRNESS IN OPEN-ENDED LANGUAGE GENERATION | Jwala Dhamala (Amazon Alexa AI); Varun Kumar (Amazon Alexa ); Rahul Gupta (Amazon); Kai-Wei Chang (UCLA); Aram Galstyan (USC Information Sciences Institute) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-17-TLP | N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS | Lu Zeng (Amazon); Sree Hari Krishnan Parthasarathi (Amazon); Dilek Z Hakkani-Tur (Amazon Alexa AI) | Tue 10 Jan - Morning session (10:30-12:30) |
3-1-17-TLP | Efficient Text Analysis with Pre-trained Neural Network Models | Jia Cui (Tencent ); Heng Lu (Tencent AI Lab); Wenjie Wang (Emory University); Shiyin Kang (Tencent); Liqiang He (Tencent); Guangzhi Li (Tencent); Dong Yu (Tencent AI Lab) | Wed 11 Jan - Morning session (10:30-12:30) |
2-1-24-TLP | Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition | Sharman W Tan (Microsoft); Piyush Behre (Microsoft); Nick Kibre (Microsoft); Issac Alphonso (Microsoft); Shawn Chang () | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-17-TLP | Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems | Hiroaki Sugiyama (NTT); Masahiro Mizukami (NTT); Tsunehiro Arimoto (NTT); Hiromi Narimatsu (NTT); Yuya Chiba (NTT); Hideharu Nakajima (NTT); Toyomi Meguro (NTT) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-18-MMP | Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection | Xuanjun Chen (National Taiwan University); Haibin Wu (National Taiwan University); Hung-yi Lee (National Taiwan University); Helen Meng (The Chinese University of Hong Kong); Roger Jang () | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-19-MMP | Towards visually prompted keyword localisation for zero-resource spoken languages | Leanne Nortje (Stellenbosch University); Herman Kamper (Stellenbosch University) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-18-MMP | Exploiting information from native data for non-native automatic pronunciation assessment | Binghuai Lin (MIG, Tencent Science and Technology Ltd.); Liyuan wang (Tencent Technology Co., Ltd) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-8-MMP | TRANSFORMER-BASED LIP-READING WITH REGULARIZED DROPOUT AND RELAXED ATTENTION | Zhengyang Li (Technische Universität Carolo-Wilhelmina Braunschweig); Timo Lohrenz (Technische Universität Carolo-Wilhelmina Braunschweig); Matthias Dunkelberg (Technische Universität Carolo-Wilhelmina Braunschweig); Tim Fingscheidt (Technische Universität Carolo-Wilhelmina Braunschweig) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-18-MMP | SpeechCLIP: Integrating Speech with Pre-trained Vision and Language Model | Yi-Jen Shih (National Taiwan University); Hsuan-Fu Wang (Academia Sinica); Heng-Jui Chang (Massachusetts Institute of Technology); Layne Berry (University of Texas at Austin); Hung-yi Lee (National Taiwan University); David Harwath (The University of Texas at Austin) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-18-MMP | YFACC: A Yorùbá Speech-Image Dataset for Cross-lingual Keyword Localisation through Visual Grounding | Kayode K Olaleye (University of Stellenbosch); Dan Oneață (University Politehnica of Bucharest); Herman Kamper (Stellenbosch University) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-18-MMP | ON THE USE OF MODALITY-SPECIFIC LARGE-SCALE PRE-TRAINED ENCODERS FOR MULTIMODAL SENTIMENT ANALYSIS | Atsushi Ando (NTT Corporation); Ryo Masumura (NTT Corporation); Akihiko Takashima (NTT); Satoshi Suzuki (NTT Computer and Data Science Laboratories / The University of Electro-Communications); Naoki Makishima (NTT Corporation); Keita Suzuki (NTT Corporation); Takafumi Moriya (NTT Corporation); Takanori Ashihara (NTT Corporation); Hiroshi Sato (NTT Corporation) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-18-MMP | An Analysis of Semantically-Aligned Speech-Text Embeddings | Muhammad Huzaifah (Institute for Infocomm Research, ASTAR); Ivan Kukanov (Institute for Infocomm Research, ASTAR) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-1-MLP | Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition | Brady Houston (AWS AI Labs); Katrin Kirchhoff (Amazon) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-5-MLP | How Do Phonological Properties Affect Bilingual Automatic Speech Recognition? | Shelly Jain (International Institute of Information Technology, Hyderabad); Aditya Yadavalli (International Institute of Information Technology, Hyderabad); Sai Ganesh Mirishkar (IIIT Hyderabad); Anil Vuppala (International Institute of Information Technology Hyderabad) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-6-MLP | Scaling Up Deliberation for Multilingual ASR | Ke Hu (Google); Tara Sainath (Google); Bo Li (Google) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-2-MLP | Code-switched language modelling using a code predictive LSTM in under-resourced South African languages | Joshua Miles Jansen Van Vüren (Stellenbosch University); Thomas Niesler (Stellenbosch University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-5-MLP | IMPROVING LUXEMBOURGISH SPEECH RECOGNITION WITH CROSS-LINGUAL SPEECH REPRESENTATIONS | Le Minh Nguyen (University of Groningen); Shekhar Nayak (University of Groningen); Matt Coler (University of Groningen) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-19-MLP | Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition | Amir Hussein (Johns Hopkins University); Shammur Chowdhury (QCRI); Ahmed Abdelali (QCRI); Najim Dehak (Johns Hopkins University); Ahmed Ali (Qatar Computing Research Institute, HBKU); Sanjeev Khudanpur (Johns Hopkins University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-23-MLP | FLEURS: FEW-SHOT LEARNING EVALUATION OF UNIVERSAL REPRESENTATIONS OF SPEECH | Alexis Conneau (FAIR); Min Ma (Google Research); Simran Khanuja (Google); Yu Zhang (Google); Vera Axelrod (Google, Inc); Siddharth Dalmia (Carnegie Mellon University ); Jason Riesa (Google); Clara Rivera (Google); Ankur Bapna (Google Research) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-8-MLP | Improving Semi-supervised E2E ASR using CycleGAN and Inter-domain Losses | Chia-Yu Li (Institute for Natural Language Processing (IMS), University of Stuttgart); Ngoc Thang Vu (University of Stuttgart) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-19-MLP | MULTILINGUAL SPEECH EMOTION RECOGNITION WITH MULTI-GATING MECHANISM AND NEURAL ARCHITECTURE SEARCH | Zihan Wang (Columbia University); Qi Meng (Columbia University); Haifeng Lan (Columbia University ); xinrui zhang (Columbia University); Kehao Guo (Columbia University); Akshat Gupta (JPMorgan) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-22-MLP | Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using $\beta$-VAE | Hui Lu (The Chinese University of Hong Kong); Disong Wang (The Chinese University of Hong Kong); Xixin Wu (The Chinese University of Hong Kong); Zhiyong Wu (Tsinghua University); Xunying Liu (The Chinese University of Hong Kong); Helen Meng (The Chinese University of Hong Kong) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-2-MLP | Exploring a unified ASR for multiple south Indian languages leveraging multilingual acoustic and language models | ANOOP C. S. (Indian Institute of Science, Bengaluru); Ramakrishnan A G (INDIAN INSTITUTE OF SCIENCE) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-5-MLP | A Truly Multilingual First Pass and Monolingual Second Pass Streaming On-Device ASR System | Sepand Mavandadi (Google); Bo Li (Google); Chao Zhang (Google); Brian Farris (Google); Tara Sainath (Google); Trevor Strohman (Google) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-20-EMR | SPEECH EMOTION RECOGNITION WITH COMPLEMENTARY ACOUSTIC REPRESENTATIONS | Xiaoming Zhang (Nanjing University of Technology); Fan Zhang (IBM Massachusetts Labratory); Xiaodong Cui (IBM T. J. Watson Research Center); Wei Zhang (Wayfair) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-20-EMR | A ZERO-SHOT APPROACH TO IDENTIFYING CHILDREN’S SPEECH IN AUTOMATIC GENDER CLASSIFICATION | Amruta Saraf (Pindrop); Ganesh Sivaraman (Pindrop); Elie Khoury (Pindrop) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-19-EMR | Distribution-based Emotion Recognition in Conversation | Wen Wu (University of Cambridge); Chao Zhang (University of Cambridge); Phil Woodland (Machine Intelligence Laboratory, Cambridge University Department of Engineering) | Tue 10 Jan - Morning session (10:30-12:30) |
3-1-19-EMR | Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora | Yuanchao Li (University of Edinburgh); Yumnah Mohamied (University of Edinburgh); Peter Bell (University of Edinburgh ); Catherine Lai (University of Edinburgh) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-19-EMR | Combining Contrastive and Non-Contrastive Losses for Fine-Tuning Pretrained Models in Speech Analysis | Florian Lux (University of Stuttgart); Ching-Yi Chen (University of Stuttgart); Ngoc Thang Vu (University of Stuttgart) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-21-TTS | WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration | Yuma Koizumi (Google); Kohei Yatabe (Tokyo University of Agriculture and Technology); Heiga Zen (Google); Michiel Bacchiani (Google) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-22-TTS | On granularity of prosodic representations in expressive text-to-speech | Mikolaj Babianski (Amazon); Kamil Pokora (Amazon); Raahil Shah (Amazon); Rafał Sienkiewicz (Amazon); Daniel Korzekwa (Amazon); Viacheslav Klimkov (Amazon) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-23-TTS | Can we use Common Voice to train a Multi-Speaker TTS system? | Sewade O Ogun (Inria); Vincent Colotte (LORIA); Emmanuel Vincent (Inria) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-21-TTS | GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models | Matthew Baas (Stellenbosch University); Herman Kamper (Stellenbosch University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-22-TTS | Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy | Sarina Meyer (University of Stuttgart); Pascal Tilli (University of Stuttgart); Pavel Denisov (University of Stuttgart); Florian Lux (University of Stuttgart); Julia Koch (University of Stuttgart); Ngoc Thang Vu (University of Stuttgart) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-20-TTS | StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models | Yinghao A Li (Columbia University); Cong Han (Columbia Univeristy); Nima Mesgarani (Columbia University) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-21-TTS | Learning accent representation with multi-level VAE towards controllable speech synthesis | Jan Melechovsky (Singapore University of Technology and Design); Ambuj Mehrish (SUTD); Dorien Herremans (Singapore University of Technology and Design); Berrak Sisman (Singapore University of Technology and Design (SUTD)) | Tue 10 Jan - Morning session (10:30-12:30) |
2-1-22-TTS | vTTS: visual-text to speech | Yoshifumi Nakano (The University of Tokyo); Takaaki Saeki (The University of Tokyo); Shinnosuke Takamichi (The University of Tokyo); Katsuhito Sudoh (Nara Institute of Science and Techonology); Hiroshi Saruwatari (The University of Tokyo) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-20-TTS | Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech | Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm); Sebastian P Bayerl (Technische Hochschule Nürnberg Georg Simon Ohm); Hector Cordourier (Intel); Tobias Bocklet (TH Nürnberg ) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-21-TTS | Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion | Ding Ma (Nagoya University); Lester Phillip G Violeta (Nagoya University); Kazuhiro Kobayashi (Nagoya University); Tomoki Toda (Nagoya University) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-20-TTS | SIMD-SIZE AWARE WEIGHT REGULARIZATION FOR FAST NEURAL VOCODING ON CPU | Hiroki Kanagawa (NTT Corporation); Yusuke Ijima (NTT Corporation) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-21-TTS | Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech | Florian Lux (University of Stuttgart); Julia Koch (University of Stuttgart); Ngoc Thang Vu (University of Stuttgart) | Wed 11 Jan - Morning session (10:30-12:30) |
3-1-22-TTS | Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation | Rendi Chevi (Kata.ai); Radityo Eko Prasojo (Kata.ai); Alham Fikri Aji (Amazon); Andros Tjandra (Meta AI, US); Sakriani Sakti (Japan Advanced Institute of Science and Technology) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-20-TTS | Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss | Efthymios Georgiou (National Technical University of Athens); Kosmas Kritsis (Athena Research Center); Georgios Paraskevopoulos (National Technical University of Athens); Athanasios Katsamanis ("ATHENA R.C., Behavioral Signal Technologies"); Vassilis Katsouros (Athena Research Center); Alexandros Potamianos (National Technical University of Athens) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-21-TTS | Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows | Abdelhamid Ezzerg (Amazon); Thomas Merritt (Amazon); Kayoko Yanagisawa (Amazon); Piotr Bilinski (Amazon); Magdalena Proszewska (Jagiellonian University); Kamil Pokora (Amazon); Renard Korzeniowski (Amazon); Roberto Barra-Chicote (Amazon); Daniel Korzekwa (amazon) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-2-23-RES | STOP: A DATASET FOR SPOKEN TASK ORIENTED SEMANTIC PARSING | Paden Tomasello (Meta); Akshat Shrivastava (Meta); Daniel A Lazar (Meta); Po-chun Hsu (Meta); Duc Le (Meta); Adithya Sagar (Facebook AI); Ali Elkahky (Meta); Jade Copet (Meta); Wei-Ning Hsu (Massachusetts Institute of Technology); Yossi Adi (Facebook AI Research ); Robin Algayres (Meta); Tu Anh Nguyen (Meta); Emmanuel Dupoux (Facebook AI Research); Luke Zettlemoyer (Facebook); Abdel-rahman Mohamed (Facebook AI Research (FAIR)) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-2-23-RES | Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition | Injy Hamed (New York University Abu Dhabi; Stuttgart University); Amir Hussein (Johns Hopkins University); Oumnia Chellah (Stanford University); Shammur Chowdhury (QCRI); Hamdy Mubarak (Qatar Computing Research Institute, HBKU); Sunayana Sitaram (Microsoft Research); Nizar Habash (); Ahmed Ali (Qatar Computing Research Institute, HBKU) | Tue 10 Jan - Afternoon session (15:30-17:30) |
4-1-22-RES | MASC: Massive Arabic Speech Corpus | Mohammad Al-Fetyani (Appswave); Mohammad AlBarham (Appswave); Gheith A. Abandah (); Adham Alsharkawi (The University of Jordan); Maha Dawas (Planning and Statistics Authority) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
1-1-15-MLS | Speed-Robust Keyword Spotting via Soft Self-Attention on Multi-Scale Features | Chaoyue Ding (SenseTime Group Limited); Jiakui Li (SenseTime Group Limited); Martin Zong (SenseTime Group Limited); Baoxiang Li (SenseTime Group Limited) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-24-MLS | Distilling Sequence-to-Sequence Voice Conversion Models For Streaming Conversion Applications | Kou Tanaka (NTT corpration); Hirokazu Kameoka (NTT Communication Science Laboratories, NTT Corporation); Takuhiro Kaneko (NTT Corporation); Shogo Seki (NTT Corporation) | Mon 9 Jan - Morning session (10:30-12:30) |
1-1-25-MLS | AUTOMATIC PREDICTION OF INTELLIGIBILITY OF WORDS AND PHONEMES PRODUCED ORALLY BY JAPANESE LEARNERS OF ENGLISH | Nobuaki Minematsu (The University of Tokyo); Chuanbo Zhu (The University of Tokyo); Takuya Kunihara (The University of Tokyo); Daisuke Saito (The University of Tokyo); Noriko Nakanishi (Kobe Gakuin University) | Mon 9 Jan - Morning session (10:30-12:30) |
1-2-24-MLS | SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning | Zuheng Kang (Ping An Technology (Shenzhen) Co., Ltd); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Junqing Peng (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China) | Mon 9 Jan - Afternoon session (15:00-17:00) |
1-2-25-MLS | PEPPANET: EFFECTIVE MISPRONUNCIATION DETECTION AND DIAGNOSIS LEVERAGING PHONETIC, PHONOLOGICAL, AND ACOUSTIC CUES | Bi-Cheng Yan (National Taiwan Normal University ); Hsin-Wei Wang (National Taiwan Normal University); Berlin Chen (National Taiwan Normal University) | Mon 9 Jan - Afternoon session (15:00-17:00) |
2-1-24-MLS | Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection | Samuele Cornell (Università Politecnica delle Marche); Thomas Balestri (Amazon); Thibaud Senechal (Amazon) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-17-MLS | TDOA ESTIMATION OF SPEECH SOURCE IN NOISY REVERBERANT ENVIRONMENTS | Suliang Bu (University of Missouri); Tuo Zhao (University of Missouri); Yunxin Zhao (University of Missouri) | Tue 10 Jan - Afternoon session (15:30-17:30) |
2-2-24-MLS | Phoneme Segmentation Using Self-Supervised Speech Models | Luke Strgar (University of Texas, Austin); David Harwath (The University of Texas at Austin) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-23-MLS | An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition | Chao-Han Huck Yang (Georgia Institute of Technology ); I-Fan Chen (Amazon Inc.); Andreas Stolcke (Amazon); Sabato M Siniscalchi (Kore University of Enna); Chin-hui Lee (Georgia Institute of Technology) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-23-MLS | PHONE-LEVEL PRONUNCIATION SCORING FOR L1 USING WEIGHTED-DYNAMIC TIME WARPING | Aghilas SINI (Univ Rennes, CNRS, IRISA); Antoine Perquin (Univ Rennes, CNRS, IRISA); Damien Lolive (Univ Rennes, CNRS, IRISA); Arnaud Delhay (IRISA) | Thu 12 Jan - Morning session (10:30-12:30) |
4-1-24-MLS | PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0 | Stefano Bannò (University of Trento); Marco Matassoni (Fondazione Bruno Kessler) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
4-2-1-SUP | SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning | Tzu-hsun Feng (National Taiwan University); Annie Dong (Meta); Ching-Feng Yeh (Facebook); Shu-wen Yang (National Taiwan University); Tzu-Quan Lin (National Taiwan University); Jiatong Shi (Carnegie Mellon University); Kai-Wei Chang (National Taiwan University); Zili Huang (Johns Hopkins University); Haibin Wu (National Taiwan University); Xuankai Chang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Abdel-rahman Mohamed (Facebook AI Research (FAIR)); Shang-Wen Li (Meta); Hung-yi Lee (National Taiwan University) | Thu 12 Jan - JSALT 2022 Reports and Superb Challenge overview (8:30 - 10:00) |
1-1-26-SUP | On the Utility of Self-supervised Models for Prosody-related Tasks | Guan-Ting Lin (National Taiwan University); Chi Luen Feng (National Taiwan University); Wei-Ping Huang (National Taiwan University); Yuan Tseng (National Taiwan University); Chen An Li (National Taiwan University); Tzu-Han Lin (National Taiwan University ); Hung-yi Lee (National Taiwan University); Nigel Ward (UTEP) | Mon 9 Jan - Morning session (10:30-12:30) |
2-1-25-SUP | Improving generalizability of distilled self-supervised speech processing models under distorted settings | Kuan-Po Huang (National Taiwan University); YU-KUAN FU (NTU); Tsu-Yuan Hsu (National Taiwan University); Fabian Alejandro Ritter Gutierrez (National University of Singapore); Fan-Lin Wang (Academia Sinica); Liang-Hsuan Tseng (National Taiwan University); Yu Zhang (Google); Hung-yi Lee (National Taiwan University) | Tue 10 Jan - Morning session (10:30-12:30) |
2-2-25-SUP | Exploring Efficient-tuning Methods in Self-supervised Speech Models | Zih-Ching Chen (National Taiwan University); Chin-Lun Fu (National Taiwan University); Chih Ying Liu (National Taiwan University); Shang-Wen Li (AWS AI); Hung-yi Lee (National Taiwan University) | Tue 10 Jan - Afternoon session (15:30-17:30) |
3-1-25-SUP | On Compressing Sequences for Self-Supervised Speech Models | Yen Meng (National Taiwan University); Hsuan-Jui Chen (National Taiwan University); Jiatong Shi (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Paola Garcia (Johns Hopkins University); Hung-yi Lee (National Taiwan University); Hao Tang (The University of Edinburgh) | Wed 11 Jan - Morning session (10:30-12:30) |
4-1-25-SUP | Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations | Themos Stafylakis (Omilia - Conversational Intelligence); Ladislav Mošner (Brno University of Technology); Sofoklis Kakouros (University of Helsinki); Plchot Oldřich (Brno University of Technology); Lukas Burget (Brno University of Technology); Jan Honza Cernocky (Brno University of Technology) | Thu 12 Jan - Morning session (10:30-12:30) |
Poster ID | Paper Title | Authors | Session |
2-3-1-DEMO | ISPEAK: INTERACTIVE SPOKEN LANGUAGE UNDERSTANDING SYSTEM FOR CHILDREN WITH SPEECH AND LANGUAGE DISORDERS | Baihan Lin (Columbia University Irving Medical Center, New York, US), Xinxin Zhang (Elizabeth Seton Children’s Center, New York, US) | Tue 10 Jan - Demo session (15:30 -17:30) |
2-3-2-DEMO | LUX-ASR: BUILDING AN ASR SYSTEM FOR THE LUXEMBOURGISH LANGUAGE | Peter Gilles (University of Luxembourg, Luxembourg), Nina Hosseini-Kivanani (University of Luxembourg, Luxembourg), Leopold Hillah (University of Luxembourg, Luxembourg) | Tue 10 Jan - Demo session (15:30 -17:30) |
2-3-3-DEMO | ON-DEVICE STREAMING TARGET-SPEAKER ASR WITH NEURAL TRANSDUCER | Takafumi Moriya (NTT Corporation, Japan), Hiroshi Sato (NTT Corporation, Japan), Tsubasa Ochiai (NTT Corporation, Japan), Marc Delcroix (NTT Corporation, Japan), Taichi Asami (NTT Corporation, Japan) | Tue 10 Jan - Demo session (15:30 -17:30) |
2-3-4-DEMO | VOICE-ENABLED AUDIOVISUAL AGENT FOR QUESTION ANSWERING IN ENGLISH AND ARABIC | Oscar Saz (Emotech Ltd, London, UK), Ahmed Abdellah (Emotech Ltd, London, UK), Luca McArthur (Emotech Ltd, London, UK), Daniel McKenna (Emotech Ltd, London, UK), Simon Shelley (Emotech Ltd, London, UK), Xinyue Zhang (Emotech Ltd, London, UK) | Tue 10 Jan - Demo session (15:30 -17:30) |