Accepted Papers

  • Flow-ER: a Flow-based Embedding Regularization Strategy for Robust Speech Representation Learning; Kang, Woo Hyun*; Alam, Jahangir ; Fathan, Abderrahim
  • Continual Self-supervised Domain Adaptation for End-to-end Speaker Diarization; Coria, Juan Manuel*; Bredin, Hervé; Ghannay, Sahar; Rosset, Sophie
  • Fine Grained Spoken Document Summarization Through Text Segmentation; Kotey, Samantha*; Dahyot, Rozenn; Harte, Naomi
  • Joint speaker diarisation and tracking in switching state-space model; Wong, Jeremy H. M.*; Gong, Yifan
  • Diarisation using location tracking with agglomerative clustering; Wong, Jeremy H. M.*; Abramovski, Igor; Xiao, Xiong; Gong, Yifan
  • Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition; Houston, Brady*; Kirchhoff, Katrin
  • End-to-End Multi-speaker ASR with Independent Vector Analysis; Scheibler, Robin*; Zhang, Wangyou; Chang, Xuankai; Watanabe, Shinji; Qian, Yanmin
  • An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition; Yang, Chao-Han Huck*; Chen, I-Fan; Stolcke, Andreas; Siniscalchi, Sabato M; Lee, Chin-hui
  • THE CLEVER HANS EFFECT IN VOICE SPOOFING DETECTION; Chettri, Bhusan*
  • Distribution-based Emotion Recognition in Conversation; Wu, Wen*; Zhang, Chao; Woodland, Phil
  • JOIST: A Joint Speech and Text Streaming Model For ASR; Sainath, Tara*; Prabhavalkar, Rohit; Bapna, Ankur; Zhang, Yu; Huo, Zhouyuan; Chen, Zhehuai; Li, Bo; Wang, Weiran; Strohman, Trevor
  • Mixture of Domain Experts for Language Understanding: An Analysis of Modularity, Task Performance, and Memory Tradeoffs; Kleiner, Benjamin*; FitzGerald, Jack; Khan, Haidar; Tur, Gokhan
  • DUAL LEARNING FOR LARGE VOCABULARY ON-DEVICE ASR; Peyser, Charles C*; Huang, Ronny; Sainath, Tara; Prabhavalkar, Rohit; Picheny, Michael; Cho, Kyunghyun
  • Untied Positional Encodings for Efficient Transformer-based Speech Recognition; Samarakoon, Lahiru T*; Fung, Ivan
  • PHONE-LEVEL PRONUNCIATION SCORING FOR L1 USING WEIGHTED-DYNAMIC TIME WARPING; SINI, Aghilas*; Perquin, Antoine; Lolive, Damien; Delhay, Arnaud
  • MASC: Massive Arabic Speech Corpus; Al-Fetyani, Mohammad*; AlBarham, Mohammad; Abandah, Gheith A.; Alsharkawi, Adham; Dawas, Maha
  • Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection; Chen, Xuanjun*; Wu, Haibin; Lee, Hung-yi; Meng, Helen; Jang, Roger
  • StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models; Li, Yinghao A*; Han, Cong; Mesgarani, Nima
  • Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech; Wagner, Dominik*; Bayerl, Sebastian P; Cordourier, Hector; Bocklet, Tobias
  • Improving generalizability of distilled self-supervised speech processing models under distorted settings; Huang, Kuan-Po*; FU, YU-KUAN; Hsu, Tsu-Yuan; Ritter Gutierrez, Fabian Alejandro; Wang, Fan-Lin; Tseng, Liang-Hsuan; Zhang, Yu; Lee, Hung-yi
  • AN ANALYSIS OF THE EFFECTS OF DECODING ALGORITHMS ON FAIRNESS IN OPEN-ENDED LANGUAGE GENERATION; Dhamala, Jwala*; Kumar , Varun ; Gupta, Rahul; Chang, Kai-Wei; Galstyan, Aram
  • SIMD-SIZE AWARE WEIGHT REGULARIZATION FOR FAST NEURAL VOCODING ON CPU; Kanagawa, Hiroki*; Ijima, Yusuke
  • IMPROVED NOISY ITERATIVE PSEUDO-LABELING FOR SEMI-SUPERVISED SPEECH RECOGNITION; Li, Tian*; Meng, Qingliang; Sun, Yujian
  • Towards End-to-end Unsupervised Speech Recognition; Liu, Alexander H*; Hsu, Wei-Ning; Auli, Michael; Baevski, Alexei
  • Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition; Yoon, Ji Won; Woo, Beom Jun; Ahn, Sunghwan; Lee, Hyeonseung; Kim, Nam Soo*
  • MULTI-STAGE PROGRESSIVE AUDIO BANDWIDTH EXTENSION; wen, liang*; Wang, Lizhong; Zhang, Ying; Choi, Kwang Pyo
  • Spatial-DCCRN: DCCRN Equipped with Frame-level Angle Feature and Hybrid Filtering for Multi-channel Speech Enhancement; Lv, Shubo*; Fu, Yihui; Ju, Yukai; Xie, Lei; Zhu, Weixin; Rao, Wei; Wang, Yannan
  • ASBERT: ASR-SPECIFIC SELF-SUPERVISED LEARNING WITH SELF-TRAINING; kim, hyungyong; Kim, Byeong-Yeol*; Yu, Seung Woo; Lim, Youshin; Lim, Yunkyu; Lee, Hanbin
  • Code-switched language modelling using a code predictive LSTM in under-resourced South African languages; Jansen Van Vuren, Joshua M*; Niesler, Thomas
  • Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss; Georgiou, Efthymios*; Kritsis, Kosmas; Paraskevopoulos, Georgios; Katsamanis, Athanasios; Katsouros, Vassilis; Potamianos, Alexandros
  • HOW TO BOOST ANTI-SPOOFING WITH X-VECTORS; Ma, Xinyue*; Zhang, Shanshan; Huang, Shen; Gao, Ji; Hu, Ying; HE, Liang
  • Speed-Robust Keyword Spotting via Soft Self-Attention on Multi-Scale Features; Ding, Chaoyue*; Li, Jiakui; Zong, Martin; Li, Baoxiang
  • Can we use Common Voice to train a Multi-Speaker TTS system?; Ogun, Sewade O*; Colotte, Vincent; Vincent, Emmanuel
  • TRANSFORMER-BASED LIP-READING WITH REGULARIZED DROPOUT AND RELAXED ATTENTION; Li, Zhengyang*; Lohrenz, Timo; Dunkelberg, Matthias; Fingscheidt, Tim
  • A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION; Chingacham, Anupama*; Demberg, Vera; Klakow, Dietrich
  • Flickering reduction with partial hypothesis reranking for streaming ASR; Bruguier, Antoine*; Qiu, David; strohman, Trevor; He, Yanzhang
  • Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio; Gao, Yan*; Fernandez-Marques, Javier; Parcollet, Titouan; Gusmao, Pedro; Lane, Nicholas
  • Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization; Horiguchi, Shota*; Takashima, Yuki; Watanabe, Shinji; Garcia, Paola
  • GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION; Khare, Aparna*; Wu, Minhua; Bhati, Saurabhchand; Droppo, Jasha; Maas, Roland
  • Exploring a unified ASR for multiple south Indian languages leveraging multilingual acoustic and language models; C. S., ANOOP*; A G, Ramakrishnan
  • HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch; Raissi, Tina*; Zhou, Wei; Berger, Simon; Schlüter, Ralf; Ney, Hermann
  • Exploring Efficient-tuning Methods in Self-supervised Speech Models; Chen, Zih-Ching; Fu, Chin-Lun; Liu, Chih Ying; Li, Shang-Wen; Lee, Hung-yi*
  • A MULTI-MODAL ARRAY OF INTERPRETABLE FEATURES TO EVALUATE LANGUAGE AND SPEECH PATTERNS IN DIFFERENT NEUROLOGICAL DISORDERS; Favaro, Anna*; Motley, Chelsie; Cao, Tianyu; Iglesias, Miguel ; Butala, Ankur; Oh, Esther S. ; Stevens, Robert; Villalba, Jesús ; Dehak, Najim; Moro-Velazquez, Laureano
  • A Truly Multilingual First Pass and Monolingual Second Pass Streaming On-Device ASR System; Mavandadi, Sepand*; Li, Bo; Zhang, Chao; Farris, Brian; Sainath, Tara; Strohman‎, Trevor
  • Scaling Up Deliberation for Multilingual ASR; Hu, Ke*; Sainath, Tara; Li, Bo
  • On the Use of Semantically-Aligned Speech Representation for Spoken Language Understanding; Laperrière, Gaëlle; Pelloin, Valention; Rouvier, Mickael; Stafylakis, Themos; Estève, Yannick*
  • MULTILINGUAL SPEECH EMOTION RECOGNITION WITH MULTI-GATING MECHANISM AND NEURAL ARCHITECTURE SEARCH; Wang, Zihan*; Meng, Qi; Lan, Haifeng; Zhang, Xinrui; Guo, Kehao; Gupta, Akshat
  • Improving Semi-supervised E2E ASR using CycleGAN and Inter-domain Losses; Li, Chia-Yu*; Thang, Vu
  • STOP: A DATASET FOR SPOKEN TASK ORIENTED SEMANTIC PARSING; Tomasello, Paden*; Shrivastava, Akshat; Lazar, Daniel A; Hsu, Po-chun; Le, Duc; Sagar, Adithya; Elkahky, Ali; Copet, Jade; Hsu, Wei-Ning; Adi, Yossi; Algayres, Robin; Nguyen, Tu Anh; Dupoux, Emmanuel; Zettlemoyer, Luke; Mohamed, Abdel-rahman
  • Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR; Fujita, Yusuke*; Komatsu, Tatsuya; Kida, Yusuke
  • Exploiting information from native data for non-native automatic pronunciation assessment; Lin, Binghuai; wang, Liyuan*
  • Fully Unsupervised Training of Few-Shot Keyword Spotting; Kim, Minchan*; Lee, Dongjune; Mun, Sung Hwan; Han, Min Hyun; Kim, Nam Soo
  • FLEURS: FEW-SHOT LEARNING EVALUATION OF UNIVERSAL REPRESENTATIONS OF SPEECH; Conneau, Alexis; Ma, Min*; Khanuja, Simran; Zhang, Yu; Axelrod, Vera; Dalmia, Siddharth; Riesa, Jason; Rivera, Clara; Bapna, Ankur
  • SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION: A REGULARIZATION-FREE APPROACH; Zhen, Kai*; Radfar, Martin; Nguyen, Hieu D; Strimel, Grant ; Mouchtaris, Athanasios; Susanj, Nathan
  • FREQUENCY AND MULTI-SCALE SELECTIVE KERNEL ATTENTION FOR SPEAKER VERIFICATION; Mun, Sung Hwan*; Jung, Jee-weon; Han, Min Hyun; Kim, Nam Soo
  • MFCCA: Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario; Yu, Fan*; 张, 仕良; Guo, Pengcheng; Liang, Yuhao; Du, Zhihao; Lin, Yuxiao; Xie, Lei
  • WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration; Koizumi, Yuma*; Yatabe, Kohei; Zen, Heiga; Bacchiani, Michiel
  • Modular Hybrid Autoregressive Transducer; Meng, Zhong*; Chen, Tongzhou; Prabhavalkar, Rohit; Zhang, Yu; Wang, Yuan; Audhkhasi, Kartik; Emond, Jesse; Strohman, Trevor; Ramabhadran, Bhuvana; Huang, Ronny; Variani, Ehsan; Huang, Yinghui; Moreno, Pedro
  • SpeechCLIP: Integrating Speech with Pre-trained Vision and Language Model; Shih, Yi-Jen*; Wang, Hsuan-Fu; Chang, Heng-Jui; Berry, Layne; Lee, Hung-yi; Harwath, David
  • A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR; Lu, Ke-Han*; CHEN, Kuan-Yu
  • Efficient dynamic filter for robust and low computational feature extraction; Kim, Donghyeon*; Kwak, Jeong-gi; Ko, Hanseok
  • Exploring WavLM on Speech Enhancement; Song, Hyungchan*; Chen, Sanyuan; Chen, Zhuo; Wu, Yu; Yoshioka, Takuya; Tang, Min; Shin, Jong Won; Liu, Shujie
  • Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition; Shen, Peng*; Lu, Xugang; Kawai, Hisashi
  • YFACC: A Yoruba Speech-Image Dataset for Cross-lingual Keyword Localisation through Visual Grounding; Olaleye, Kayode K*; Oneață, Dan; Kamper, Herman
  • ON THE USE OF MODALITY-SPECIFIC LARGE-SCALE PRE-TRAINED ENCODERS FOR MULTIMODAL SENTIMENT ANALYSIS; Ando, Atsushi*; Masumura, Ryo; Takashima, Akihiko; Suzuki, Satoshi; Makishima, Naoki; Suzuki, Keita; Moriya, Takafumi; Ashihara, Takanori; Sato, Hiroshi
  • Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition; Laptev, Aleksandr*; Ginsburg, Boris
  • BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications; Zuluaga Gomez, Juan Pablo *; Sarfjoo, Seyyed Saeed; Prasad, Amrutha; Nigmatulina, Iuliia; Motlicek, Petr; Ondrej, Karel; Ohneiser, Oliver; Helmke, Hartmut
  • An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition; Moritz, Niko*; Seide, Frank; Le, Duc; Mahadeokar, Jay; Fuegen, Christian
  • How Does Pre-trained Wav2Vec2.0 Perform on Domain-Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications; Zuluaga Gomez, Juan Pablo *; Prasad, Amrutha; Nigmatulina, Iuliia; Sarfjoo, Seyyed Saeed; Motlicek, Petr; Kleinert, Matthias; Helmke, Hartmut; Ohneiser, Oliver; Zhan, Qingran
  • Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition; Poncelet, Jakob*; Van hamme, Hugo
  • GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models; Baas, Matthew*; Kamper, Herman
  • CONFORMER-BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH KD COMPRESSION AND TWO-PASS ARCHITECTURE; Park, Jinhwan*; Jin, Sichen; Park, Junmo; Kim, Sungsoo; Sandhyana, Dhairya ; Lee, Changheon; Han, Myoungji; Lee, Jungin; Jung, Seokyeong; Han, Chang Woo; Kim, Chanwoo
  • Improving Noise Robustness for Spoken Content Retrieval using semi-supervised ASR and N-best transcripts for BERT-based ranking models; Moriya, Yasufumi*; Jones, Gareth
  • TEA-PSE 2.0: SUB-BAND NETWORK FOR REAL-TIME PERSONALIZED SPEECH ENHANCEMENT; Ju, Yukai*; Zhang, Shimin; Rao, Wei; Wang, Yannan; Yu, Tao; Xie, Lei; Shang, Shi-dong
  • An Analysis of Semantically-Aligned Speech-Text Embeddings; Huzaifah, Muhammad*; Kukanov, Ivan
  • Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation; Zhao, Chendong*; Wang, Jianzong; Qu, Xiaoyang; Wang, Haoqian; Xiao, Jing
  • LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION; Liu, Qinghua; Huang, Yating; Hao, Yunzhe; Xu, Jiaming*; Xu, Bo
  • Towards visually prompted keyword localisation for zero-resource spoken languages; Nortje, Leanne*; Kamper, Herman
  • AN ATTENTION-BASED BACKEND ALLOWING EFFICIENT FINE-TUNING OF TRANSFORMER MODELS FOR SPEAKER VERIFICATION; Peng, Junyi*; Plchot, Oldrich; Stafylakis, Themos; Mosner, Ladislav; Burget, Lukas; Cernocky, Jan
  • Distilling Sequence-to-Sequence Voice Conversion Models For Streaming Conversion Applications; Tanaka, Kou*; Kameoka, Hirokazu; Kaneko, Takuhiro; Seki, Shogo
  • A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction With Improved Training; Mack, Wolfgang*; Habets, Emanuel
  • Learning accent representation with multi-level VAE towards controllable speech synthesis; Melechovsky, Jan*; Mehrish, Ambuj; Herremans, Dorien; Sisman, Berrak
  • INTER-DECODER: USING ATTENTION-DECODER LOSSES AS INTERMEDIATE REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION; Komatsu, Tatsuya*; Fujita, Yusuke
  • Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora; Li, Yuanchao*; Mohamied, Yumnah; Bell, Peter; Lai, Catherine
  • STREAMING BILINGUAL END TO END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX; Joshi, Vikas V*; Agrawal, Purvi; Mehta, Rupesh; Patil, Aditya
  • Weak-Supervised Dysarthria-invariant Features for Spoken Language Understanding using an FHVAE and Adversarial Training; Qi, Jinzi*; Hugo, Van hamme
  • Monotonic segmental attention for automatic speech recognition; Zeyer, Albert*; Schmitt, Robin; Zhou, Wei; Schlüter, Ralf; Ney, Hermann
  • Automatic Rating of Spontaneous Speech for Low-Resource Languages; Getman, Yaroslav*; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Singh, Mittul; Kurimo, Mikko
  • SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning; Kang, Zuheng*; Wang, Jianzong; Peng, Junqing; Xiao, Jing
  • On the Efficiency of Integrating Self-supervised Learning and Meta-learning for User-defined Few-shot Keyword Spotting; Wu, Yuan-Kuei*; Kao, Wei-Tsung; Lee, Hung-yi; Chen, Chia-Ping; Chen, Zhi-Sheng; Tsai, Yu-Pao
  • Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection; Cornell, Samuele*; Balestri, Thomas; Senechal, Thibaud
  • Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion; Ma, Ding*; Violeta, Lester Phillip G; Kobayashi, Kazuhiro; Toda, Tomoki
  • Combining Contrastive and Non-Contrastive Losses for Fine-Tuning Pretrained Models in Speech Analysis; Lux, Florian*; Chen, Ching-Yi; Thang, Vu
  • Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech; Lux, Florian*; Koch, Julia; Thang, Vu
  • Accelerator-Aware Training for Transducer-based Speech Recognition; Swaminathan, Rupak Vignesh*; Mumtaj Shakiah, Suhaila; Nguyen, Hieu D; chinta, Raviteja; Afzal, Tariq; Susanj, Nathan ; Mouchtaris, Athanasios ; Strimel, Grant; Rastrow, Ariya
  • End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation; Masuyama, Yoshiki*; Chang, Xuankai; Cornell, Samuele; Watanabe, Shinji; Ono, Nobutaka
  • NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING; LI, Mohan*; Doddipatla, Rama S
  • Residual Adapters for Targeted Updates in RNN-Transducer Based Speech Recognition System; Han, Sungjun; Baby, Deepak; Mendelev, Valentin*
  • Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows; Ezzerg, Abdelhamid*; Merritt, Thomas; Yanagisawa, Kayoko; Bilinski, Piotr; Proszewska, Magdalena; Pokora, Kamil; Korzeniowski, Renard; Barra-Chicote, Roberto; Korzekwa, Daniel
  • Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models; Sukhadia, Vrunda N*; Umesh, S
  • N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS; Zeng, Lu*; Parthasarathi, Sree Hari Krishnan; Hakkani-Tur, Dilek Z
  • VSAMETER: EVALUATION OF A NEW OPEN-SOURCE TOOL TO MEASURE VOWEL SPACE AREA AND RELATED METRICS; Cao, Tianyu*; Moro-Velazquez, Laureano; Żelasko, Piotr; Villalba, Jesús; Dehak, Najim
  • On Compressing Sequences for Self-Supervised Speech Models; Meng, Yen*; Chen, Hsuan-Jui; Shi, Jiatong; Watanabe, Shinji; Garcia, Paola; Lee, Hung-yi; Tang, Hao
  • G-AUGMENT: SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR; Wang, Yuan*; Cubuk, Ekin D; Rosenberg, Andrew; Cheng, Shuyang; Weiss, Ron J; Ramabhadran, Bhuvana; Moreno, Pedro; Le, Quoc; Park, Daniel S
  • Low-Latency Speech Separation Guided Diarization for Telephone Conversations; Morrone, Giovanni*; Cornell, Samuele; Raj, Desh; Serafini, Luca; Zovato, Enrico; Brutti, Alessio; Squartini, Stefano
  • JOINT OPTIMIZATION OF DIFFUSION PROBABILISTIC-BASED MULTICHANNEL SPEECH ENHANCEMENT WITH FAR-FIELD SPEAKER VERIFICATION; Dowerah, Sandipana*; serizel, romain; Jouvet, Denis; Mohammadamini, Mohammad; Matrouf, Driss
  • IMPROVED NORMALIZING FLOW-BASED SPEECH ENHANCEMENT USING AN ALL-POLE GAMMATONE FILTERBANK FOR CONDITIONAL INPUT REPRESENTATION; Strauss, Martin*; Torcoli, Matteo; Edler, Bernd
  • Adaptive-FSN: Integrating full-band extraction and adaptive sub-band encoding for monaural speech enhancement; TSAO, YU-SHENG*; Hsun, Ho Kuan; Hung, Jeih-weih; Chen, Berlin
  • Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations; Stafylakis, Themos*; Mošner, Ladislav; Kakouros, Sofoklis; Oldřich, Plchot; Burget, Lukas; Cernocky, Jan Honza
  • Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR; Chen, Zhehuai*; Bapna, Ankur; Rosenberg, Andrew; Zhang, Yu; Ramabhadran, Bhuvana; Moreno, Pedro; Chen, Nanxin
  • Damage Control during Domain Adaptation for Transducer Based Automatic Speech Recognition; Majumdar, Somshubra*; Acharya, Shantanu; Lavrukhin, Vitaly; Ginsburg, Boris
  • Internal Language Model Personalization of E2E Automatic Speech Recognition Using Random Encoder Features; Stooke, Adam *; Sim, Khe C; Chua, Mason; Munkhdalai, Tsendsuren; Strohman, Trevor
  • On granularity of prosodic representations in expressive text-to-speech; Babiański, Mikołaj*; Pokora, Kamil; Shah, Raahil; Sienkiewicz, Rafał; Korzekwa, Daniel; Klimkov, Viacheslav
  • A STUDY ON THE INTEGRATION OF PRE-TRAINED SSL, ASR, LM AND SLU MODELS FOR SPOKEN LANGUAGE UNDERSTANDING; Peng, Yifan*; Arora, Siddhant; Higuchi, Yosuke; Ueda, Yushi; Kumar, Sujay; Ganesan, Karthik; Dalmia, Siddharth; Chang, Xuankai; Watanabe, Shinji
  • Phoneme Segmentation Using Self-Supervised Speech Models; Strgar, Luke*; Harwath, David
  • UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS; Bijwadia, Shaan*; Chang, Shuo-yiin; Sainath, Tara; Li, Bo; Zhang, Chao; He, Yanzhang
  • Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition; Hussain, Amir*; Chowdhury, Shammur; Abdelali, Ahmed; Dehak, Najim; Ali, Ahmed; Khudanpur, Sanjeev
  • Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy; Meyer, Sarina*; Tilli, Pascal; Denisov, Pavel; Lux, Florian; Koch, Julia; Thang, Vu
  • Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition; Hamed, Injy*; Hussain, Amir; Chellah, Oumnia; Chowdhury, Shammur; Mubarak, Hamdy; Sitaram, Sunayana; Habash, Nizar; Ali, Ahmed
  • Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition; Tan, Sharman W; Behre, Piyush*; Kibre, Nick; Alphonso, Issac; Chang, Shawn
  • INVESTIGATING THE IMPORTANT TEMPORAL MODULATIONS FOR DEEP-LEARNING-BASED SPEECH ACTIVITY DETECTION; Vuong, Tyler*; Madaan, Nikhil; Panda, Rohan; Stern, Richard M
  • UNSUPERVISED DOMAIN ADAPTATION OF NEURAL PLDA USING SEGMENT PAIRS FOR SPEAKER VERIFICATION; Ülgen, İsmail Rasim*; Arslan, Mustafa Levent
  • Context-aware Neural Confidence Estimation for Rare Word Speech Recognition; Qiu, David*; Munkhdalai, Tsendsuren; He, Yanzhang; Sim, Khe C
  • NAM+: TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR; Wu, Zelin*; Munkhdalai, Tsendsuren; Pundak, Golan; Sim, Khe C; Li, David; Rondon, Pat; Sainath, Tara
  • INVESTIGATING ACTIVE-LEARNING-BASED TRAINING DATA SELECTION FOR SPEECH SPOOFING COUNTERMEASURE; Wang, Xin*; Yamagishi, Junichi
  • Learning a Dual-Mode Speech Recognition Model via Self-Pruning; Liu, Chunxi*; Shangguan, Yuan; Yang, Haichuan; Shi, Yangyang; Krishnamoorthi , Raghuraman ; Kalinli, Ozlem
  • Learning mask scalars for improved robust automatic speech recognition; Narayanan, Arun*; Walker, James; Panchapagesan, Sankaran; Howard, Nathan; Koizumi, Yuma
  • Efficient Text Analysis with Pre-trained Neural Network Models; Cui, Jia*; Lu, Heng; Wang, Wenjie; Kang, Shiyin; He, Liqiang; Li, Guangzhi; Yu, Dong
  • Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems; Sugiyama, Hiroaki*; Mizukami, Masahiro; Arimoto, Tsunehiro; Narimatsu, Hiromni; Chiba, Yuya; Nakajima, Hideharu; Meguro, Toyomi
  • Response Timing Estimation for Spoken Dialog Systems based on Syntactic Completeness Prediction; Sakuma, Jin*; Fujie, Shinya; Kobayashi, Tetsunori
  • E-Branchformer: Branchformer with Enhanced merging for speech recognition; Kim, Kwangyoun*; Wu, Felix; Peng, Yifan; Pan, Jing; Sridhar, Prashant; Han, Kyu Jeong; Watanabe, Shinji
  • A comprehensive study on self-supervised distillation for speaker representation learning; Chen, Zhengyang*; Qian, Yao; Han, Bing; Qian, Yanmin; Zeng, Michael
  • TDOA ESTIMATION OF SPEECH SOURCE IN NOISY REVERBERANT ENVIRONMENTS; Bu, Suliang; Zhao, Tuo*; Zhao, Yunxin
  • On the Utility of Self-supervised Models for Prosody-related Tasks; Lin, Guan-Ting*; Feng, Chi Luen; Huang, Wei-Ping; Tseng, Yuan; Li, Chen An; Lin, Tzu-Han; Lee, Hung-yi; Ward, Nigel
  • vTTS: visual-text to speech; Nakano, Yoshifumi; Saeki, Takaaki; Takamichi, Shinnosuke*; Sudoh, Katsuhito; Saruwatari, Hiroshi
  • SPEECH EMOTION RECOGNITION WITH COMPLEMENTARY ACOUSTIC REPRESENTATIONS; Zhang, Xiaoming*; Zhang, Fan; Cui, Xiaodong; Zhang, Wei
  • EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers; Maiti, Soumi*; Ueda, Yushi; Watanabe, Shinji; zhang, chunlei ; Yu, Meng; Zhang, Shixiong; Xu, Yong
  • A ZERO-SHOT APPROACH TO IDENTIFYING CHILDREN’S SPEECH IN AUTOMATIC GENDER CLASSIFICATION; Saraf, Amruta; Sivaraman, Ganesh*; Khoury, Elie
  • STREAMING, FAST AND ACCURATE ON-DEVICE INVERSE TEXT NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION; Gaur, Yashesh*; Kibre, Nick; Xue, Jian; Shu, Kangyuan; Wang, Yuhui; Alphonso, Issac; Li, Jinyu; Gong, Yifan
  • Personalization of CTC Speech Recognition Models; Dingliwal, Saket*; Sunkara, Monica; Bodapati, Sravan Babu; Ronanki, Srikanth; Farris, Jeff; Kirchhoff, Katrin
  • How Do Phonological Properties Affect Bilingual Automatic Speech Recognition?; Jain, Shelly*; Yadavalli, Aditya; Mirishkar, Sai Ganesh; Vuppala, Anil
  • Building Markovian Generative Architectures over Pretrained LM Backbones for Efficient Task-Oriented Dialog Systems; Liu, Hong*; Cai, Yucheng; Ou, Zhijian; Huang, Yi; Feng, Junlan
  • Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using $\beta$-VAE; Lu, Hui*; Wang, Disong; Wu, Xixin; Wu, Zhiyong; Liu, Xunying; Meng, Helen
  • PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0; Bannò, Stefano*; Matassoni, Marco
  • IMPROVING LUXEMBOURGISH SPEECH RECOGNITION WITH CROSS-LINGUAL SPEECH REPRESENTATIONS; Nguyen, Le Minh*; Nayak, Shekhar; Coler, Matt
  • Macro-block dropout for improved regularization in training end-to-end speech recognition models; Kim, Chanwoo*; Indurti, Sathish; Park, Jinhwan; Sung, Wonyong
  • Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation; Chevi, Rendi*; Prasojo, Radityo Eko; Aji, Alham Fikri; Tjandra, Andros; Sakti, Sakriani
  • AUTOMATIC PREDICTION OF INTELLIGIBILITY OF WORDS AND PHONEMES PRODUCED ORALLY BY JAPANESE LEARNERS OF ENGLISH; Minematsu, Nobuaki*; Zhu, Chuanbo; Kunihara, Takuya; Saito, Daisuke; Nakanishi, Noriko
  • PADA: PRUNING ASSISTED DOMAIN ADAPTATION FOR SELF-SUPERVISED SPEECH REPRESENTATIONS; Lodagala, Vasista Sai*; Ghosh, Sreyan; Umesh, S
  • CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS; Lodagala, Vasista Sai*; Ghosh, Sreyan; Umesh, S
  • Effective Mispronunciation Detection and Diagnosis Leveraging Heterogeneous Information Cues; Yan, Bi-Cheng*; Wang, Hsin-Wei; Chen, Berlin
  • SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning; Feng, Tzu-hsun*; Dong, Annie; Yeh, Ching-Feng; Yang, Shu-wen; Lin, Tzu-Quan; Shi, Jiatong; Chang, Kai-Wei; Huang, Zili; Wu, Haibin; Chang, Xuankai; Watanabe, Shinji; Mohamed, Abdel-rahman; Li, Shang-Wen; Lee, Hung-yi
  • AVSE CHALLENGE: AUDIO-VISUAL SPEECH ENHANCEMENT CHALLENGE; Aldana, Andrea L*; Valentini, Cassia; Klejch, Ondrej; Gogate, Mandar; Dashtipour, Kia K; Hussain, Amir; Bell, Peter
  • Demo Papers
  • ISPEAK: INTERACTIVE SPOKEN LANGUAGE UNDERSTANDING SYSTEM FOR CHILDREN WITH SPEECH AND LANGUAGE DISORDERS; Lin, Baihan; Zhang, Xinxin
  • ON-DEVICE STREAMING TARGET-SPEAKER ASR WITH NEURAL TRANSDUCER; Moriya, Takafumi; Sato, Hiroshi; Ochiai, Tsubasa; Delcroix, Marc; Asami, Taichi
  • VOICE–ENABLED AUDIOVISUAL AGENT FOR QUESTION ANSWERING IN ENGLISH AND ARABIC; Saz, Oscar; Abdellah, Ahmed; McArthur, Luca; McKenna, Daniel; Shelley, Simon; Zhang, Xinyue
  • LUX-ASR: BUILDING AN ASR SYSTEM FOR THE LUXEMBOURGISH LANGUAGE; Gilles, Peter; Hosseini-Kivanani, Nina; Hillah, Leopold