Flow-ER: a Flow-based Embedding Regularization Strategy for Robust Speech Representation Learning; |
Kang, Woo Hyun*; Alam, Jahangir ; Fathan, Abderrahim |
Continual Self-supervised Domain Adaptation for End-to-end Speaker Diarization; |
Coria, Juan Manuel*; Bredin, Hervé; Ghannay, Sahar; Rosset, Sophie |
Fine Grained Spoken Document Summarization Through Text Segmentation; |
Kotey, Samantha*; Dahyot, Rozenn; Harte, Naomi |
Joint speaker diarisation and tracking in switching state-space model; |
Wong, Jeremy H. M.*; Gong, Yifan |
Diarisation using location tracking with agglomerative clustering; |
Wong, Jeremy H. M.*; Abramovski, Igor; Xiao, Xiong; Gong, Yifan |
Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition; |
Houston, Brady*; Kirchhoff, Katrin |
End-to-End Multi-speaker ASR with Independent Vector Analysis; |
Scheibler, Robin*; Zhang, Wangyou; Chang, Xuankai; Watanabe, Shinji; Qian, Yanmin |
An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition; |
Yang, Chao-Han Huck*; Chen, I-Fan; Stolcke, Andreas; Siniscalchi, Sabato M; Lee, Chin-hui |
THE CLEVER HANS EFFECT IN VOICE SPOOFING DETECTION; |
Chettri, Bhusan* |
Distribution-based Emotion Recognition in Conversation; |
Wu, Wen*; Zhang, Chao; Woodland, Phil |
JOIST: A Joint Speech and Text Streaming Model For ASR; |
Sainath, Tara*; Prabhavalkar, Rohit; Bapna, Ankur; Zhang, Yu; Huo, Zhouyuan; Chen, Zhehuai; Li, Bo; Wang, Weiran; Strohman, Trevor |
Mixture of Domain Experts for Language Understanding: An Analysis of Modularity, Task Performance, and Memory Tradeoffs; |
Kleiner, Benjamin*; FitzGerald, Jack; Khan, Haidar; Tur, Gokhan |
DUAL LEARNING FOR LARGE VOCABULARY ON-DEVICE ASR; |
Peyser, Charles C*; Huang, Ronny; Sainath, Tara; Prabhavalkar, Rohit; Picheny, Michael; Cho, Kyunghyun |
Untied Positional Encodings for Efficient Transformer-based Speech Recognition; |
Samarakoon, Lahiru T*; Fung, Ivan |
PHONE-LEVEL PRONUNCIATION SCORING FOR L1 USING WEIGHTED-DYNAMIC TIME WARPING; |
SINI, Aghilas*; Perquin, Antoine; Lolive, Damien; Delhay, Arnaud |
MASC: Massive Arabic Speech Corpus; |
Al-Fetyani, Mohammad*; AlBarham, Mohammad; Abandah, Gheith A.; Alsharkawi, Adham; Dawas, Maha |
Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection; |
Chen, Xuanjun*; Wu, Haibin; Lee, Hung-yi; Meng, Helen; Jang, Roger |
StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models; |
Li, Yinghao A*; Han, Cong; Mesgarani, Nima |
Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech; |
Wagner, Dominik*; Bayerl, Sebastian P; Cordourier, Hector; Bocklet, Tobias |
Improving generalizability of distilled self-supervised speech processing models under distorted settings; |
Huang, Kuan-Po*; FU, YU-KUAN; Hsu, Tsu-Yuan; Ritter Gutierrez, Fabian Alejandro; Wang, Fan-Lin; Tseng, Liang-Hsuan; Zhang, Yu; Lee, Hung-yi |
AN ANALYSIS OF THE EFFECTS OF DECODING ALGORITHMS ON FAIRNESS IN OPEN-ENDED LANGUAGE GENERATION; |
Dhamala, Jwala*; Kumar , Varun ; Gupta, Rahul; Chang, Kai-Wei; Galstyan, Aram |
SIMD-SIZE AWARE WEIGHT REGULARIZATION FOR FAST NEURAL VOCODING ON CPU; |
Kanagawa, Hiroki*; Ijima, Yusuke |
IMPROVED NOISY ITERATIVE PSEUDO-LABELING FOR SEMI-SUPERVISED SPEECH RECOGNITION; |
Li, Tian*; Meng, Qingliang; Sun, Yujian |
Towards End-to-end Unsupervised Speech Recognition; |
Liu, Alexander H*; Hsu, Wei-Ning; Auli, Michael; Baevski, Alexei |
Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition; |
Yoon, Ji Won; Woo, Beom Jun; Ahn, Sunghwan; Lee, Hyeonseung; Kim, Nam Soo* |
MULTI-STAGE PROGRESSIVE AUDIO BANDWIDTH EXTENSION; |
wen, liang*; Wang, Lizhong; Zhang, Ying; Choi, Kwang Pyo |
Spatial-DCCRN: DCCRN Equipped with Frame-level Angle Feature and Hybrid Filtering for Multi-channel Speech Enhancement; |
Lv, Shubo*; Fu, Yihui; Ju, Yukai; Xie, Lei; Zhu, Weixin; Rao, Wei; Wang, Yannan |
ASBERT: ASR-SPECIFIC SELF-SUPERVISED LEARNING WITH SELF-TRAINING; |
kim, hyungyong; Kim, Byeong-Yeol*; Yu, Seung Woo; Lim, Youshin; Lim, Yunkyu; Lee, Hanbin |
Code-switched language modelling using a code predictive LSTM in under-resourced South African languages; |
Jansen Van Vuren, Joshua M*; Niesler, Thomas |
Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss; |
Georgiou, Efthymios*; Kritsis, Kosmas; Paraskevopoulos, Georgios; Katsamanis, Athanasios; Katsouros, Vassilis; Potamianos, Alexandros |
HOW TO BOOST ANTI-SPOOFING WITH X-VECTORS; |
Ma, Xinyue*; Zhang, Shanshan; Huang, Shen; Gao, Ji; Hu, Ying; HE, Liang |
Speed-Robust Keyword Spotting via Soft Self-Attention on Multi-Scale Features; |
Ding, Chaoyue*; Li, Jiakui; Zong, Martin; Li, Baoxiang |
Can we use Common Voice to train a Multi-Speaker TTS system?; |
Ogun, Sewade O*; Colotte, Vincent; Vincent, Emmanuel |
TRANSFORMER-BASED LIP-READING WITH REGULARIZED DROPOUT AND RELAXED ATTENTION; |
Li, Zhengyang*; Lohrenz, Timo; Dunkelberg, Matthias; Fingscheidt, Tim |
A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION; |
Chingacham, Anupama*; Demberg, Vera; Klakow, Dietrich |
Flickering reduction with partial hypothesis reranking for streaming ASR; |
Bruguier, Antoine*; Qiu, David; strohman, Trevor; He, Yanzhang |
Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio; |
Gao, Yan*; Fernandez-Marques, Javier; Parcollet, Titouan; Gusmao, Pedro; Lane, Nicholas |
Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization; |
Horiguchi, Shota*; Takashima, Yuki; Watanabe, Shinji; Garcia, Paola |
GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION; |
Khare, Aparna*; Wu, Minhua; Bhati, Saurabhchand; Droppo, Jasha; Maas, Roland |
Exploring a unified ASR for multiple south Indian languages leveraging multilingual acoustic and language models; |
C. S., ANOOP*; A G, Ramakrishnan |
HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch; |
Raissi, Tina*; Zhou, Wei; Berger, Simon; Schlüter, Ralf; Ney, Hermann |
Exploring Efficient-tuning Methods in Self-supervised Speech Models; |
Chen, Zih-Ching; Fu, Chin-Lun; Liu, Chih Ying; Li, Shang-Wen; Lee, Hung-yi* |
A MULTI-MODAL ARRAY OF INTERPRETABLE FEATURES TO EVALUATE LANGUAGE AND SPEECH PATTERNS IN DIFFERENT NEUROLOGICAL DISORDERS; |
Favaro, Anna*; Motley, Chelsie; Cao, Tianyu; Iglesias, Miguel ; Butala, Ankur; Oh, Esther S. ; Stevens, Robert; Villalba, Jesús ; Dehak, Najim; Moro-Velazquez, Laureano |
A Truly Multilingual First Pass and Monolingual Second Pass Streaming On-Device ASR System; |
Mavandadi, Sepand*; Li, Bo; Zhang, Chao; Farris, Brian; Sainath, Tara; Strohman, Trevor |
Scaling Up Deliberation for Multilingual ASR; |
Hu, Ke*; Sainath, Tara; Li, Bo |
On the Use of Semantically-Aligned Speech Representation for Spoken Language Understanding; |
Laperrière, Gaëlle; Pelloin, Valention; Rouvier, Mickael; Stafylakis, Themos; Estève, Yannick* |
MULTILINGUAL SPEECH EMOTION RECOGNITION WITH MULTI-GATING MECHANISM AND NEURAL ARCHITECTURE SEARCH; |
Wang, Zihan*; Meng, Qi; Lan, Haifeng; Zhang, Xinrui; Guo, Kehao; Gupta, Akshat |
Improving Semi-supervised E2E ASR using CycleGAN and Inter-domain Losses; |
Li, Chia-Yu*; Thang, Vu |
STOP: A DATASET FOR SPOKEN TASK ORIENTED SEMANTIC PARSING; |
Tomasello, Paden*; Shrivastava, Akshat; Lazar, Daniel A; Hsu, Po-chun; Le, Duc; Sagar, Adithya; Elkahky, Ali; Copet, Jade; Hsu, Wei-Ning; Adi, Yossi; Algayres, Robin; Nguyen, Tu Anh; Dupoux, Emmanuel; Zettlemoyer, Luke; Mohamed, Abdel-rahman |
Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR; |
Fujita, Yusuke*; Komatsu, Tatsuya; Kida, Yusuke |
Exploiting information from native data for non-native automatic pronunciation assessment; |
Lin, Binghuai; wang, Liyuan* |
Fully Unsupervised Training of Few-Shot Keyword Spotting; |
Kim, Minchan*; Lee, Dongjune; Mun, Sung Hwan; Han, Min Hyun; Kim, Nam Soo |
FLEURS: FEW-SHOT LEARNING EVALUATION OF UNIVERSAL REPRESENTATIONS OF SPEECH; |
Conneau, Alexis; Ma, Min*; Khanuja, Simran; Zhang, Yu; Axelrod, Vera; Dalmia, Siddharth; Riesa, Jason; Rivera, Clara; Bapna, Ankur |
SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION: A REGULARIZATION-FREE APPROACH; |
Zhen, Kai*; Radfar, Martin; Nguyen, Hieu D; Strimel, Grant ; Mouchtaris, Athanasios; Susanj, Nathan |
FREQUENCY AND MULTI-SCALE SELECTIVE KERNEL ATTENTION FOR SPEAKER VERIFICATION; |
Mun, Sung Hwan*; Jung, Jee-weon; Han, Min Hyun; Kim, Nam Soo |
MFCCA: Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario; |
Yu, Fan*; 张, 仕良; Guo, Pengcheng; Liang, Yuhao; Du, Zhihao; Lin, Yuxiao; Xie, Lei |
WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration; |
Koizumi, Yuma*; Yatabe, Kohei; Zen, Heiga; Bacchiani, Michiel |
Modular Hybrid Autoregressive Transducer; |
Meng, Zhong*; Chen, Tongzhou; Prabhavalkar, Rohit; Zhang, Yu; Wang, Yuan; Audhkhasi, Kartik; Emond, Jesse; Strohman, Trevor; Ramabhadran, Bhuvana; Huang, Ronny; Variani, Ehsan; Huang, Yinghui; Moreno, Pedro |
SpeechCLIP: Integrating Speech with Pre-trained Vision and Language Model; |
Shih, Yi-Jen*; Wang, Hsuan-Fu; Chang, Heng-Jui; Berry, Layne; Lee, Hung-yi; Harwath, David |
A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR; |
Lu, Ke-Han*; CHEN, Kuan-Yu |
Efficient dynamic filter for robust and low computational feature extraction; |
Kim, Donghyeon*; Kwak, Jeong-gi; Ko, Hanseok |
Exploring WavLM on Speech Enhancement; |
Song, Hyungchan*; Chen, Sanyuan; Chen, Zhuo; Wu, Yu; Yoshioka, Takuya; Tang, Min; Shin, Jong Won; Liu, Shujie |
Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition; |
Shen, Peng*; Lu, Xugang; Kawai, Hisashi |
YFACC: A Yoruba Speech-Image Dataset for Cross-lingual Keyword Localisation through Visual Grounding; |
Olaleye, Kayode K*; Oneață, Dan; Kamper, Herman |
ON THE USE OF MODALITY-SPECIFIC LARGE-SCALE PRE-TRAINED ENCODERS FOR MULTIMODAL SENTIMENT ANALYSIS; |
Ando, Atsushi*; Masumura, Ryo; Takashima, Akihiko; Suzuki, Satoshi; Makishima, Naoki; Suzuki, Keita; Moriya, Takafumi; Ashihara, Takanori; Sato, Hiroshi |
Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition; |
Laptev, Aleksandr*; Ginsburg, Boris |
BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications; |
Zuluaga Gomez, Juan Pablo *; Sarfjoo, Seyyed Saeed; Prasad, Amrutha; Nigmatulina, Iuliia; Motlicek, Petr; Ondrej, Karel; Ohneiser, Oliver; Helmke, Hartmut |
An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition; |
Moritz, Niko*; Seide, Frank; Le, Duc; Mahadeokar, Jay; Fuegen, Christian |
How Does Pre-trained Wav2Vec2.0 Perform on Domain-Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications; |
Zuluaga Gomez, Juan Pablo *; Prasad, Amrutha; Nigmatulina, Iuliia; Sarfjoo, Seyyed Saeed; Motlicek, Petr; Kleinert, Matthias; Helmke, Hartmut; Ohneiser, Oliver; Zhan, Qingran |
Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition; |
Poncelet, Jakob*; Van hamme, Hugo |
GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models; |
Baas, Matthew*; Kamper, Herman |
CONFORMER-BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH KD COMPRESSION AND TWO-PASS ARCHITECTURE; |
Park, Jinhwan*; Jin, Sichen; Park, Junmo; Kim, Sungsoo; Sandhyana, Dhairya ; Lee, Changheon; Han, Myoungji; Lee, Jungin; Jung, Seokyeong; Han, Chang Woo; Kim, Chanwoo |
Improving Noise Robustness for Spoken Content Retrieval using semi-supervised ASR and N-best transcripts for BERT-based ranking models; |
Moriya, Yasufumi*; Jones, Gareth |
TEA-PSE 2.0: SUB-BAND NETWORK FOR REAL-TIME PERSONALIZED SPEECH ENHANCEMENT; |
Ju, Yukai*; Zhang, Shimin; Rao, Wei; Wang, Yannan; Yu, Tao; Xie, Lei; Shang, Shi-dong |
An Analysis of Semantically-Aligned Speech-Text Embeddings; |
Huzaifah, Muhammad*; Kukanov, Ivan |
Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation; |
Zhao, Chendong*; Wang, Jianzong; Qu, Xiaoyang; Wang, Haoqian; Xiao, Jing |
LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION; |
Liu, Qinghua; Huang, Yating; Hao, Yunzhe; Xu, Jiaming*; Xu, Bo |
Towards visually prompted keyword localisation for zero-resource spoken languages; |
Nortje, Leanne*; Kamper, Herman |
AN ATTENTION-BASED BACKEND ALLOWING EFFICIENT FINE-TUNING OF TRANSFORMER MODELS FOR SPEAKER VERIFICATION; |
Peng, Junyi*; Plchot, Oldrich; Stafylakis, Themos; Mosner, Ladislav; Burget, Lukas; Cernocky, Jan |
Distilling Sequence-to-Sequence Voice Conversion Models For Streaming Conversion Applications; |
Tanaka, Kou*; Kameoka, Hirokazu; Kaneko, Takuhiro; Seki, Shogo |
A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction With Improved Training; |
Mack, Wolfgang*; Habets, Emanuel |
Learning accent representation with multi-level VAE towards controllable speech synthesis; |
Melechovsky, Jan*; Mehrish, Ambuj; Herremans, Dorien; Sisman, Berrak |
INTER-DECODER: USING ATTENTION-DECODER LOSSES AS INTERMEDIATE REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION; |
Komatsu, Tatsuya*; Fujita, Yusuke |
Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora; |
Li, Yuanchao*; Mohamied, Yumnah; Bell, Peter; Lai, Catherine |
STREAMING BILINGUAL END TO END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX; |
Joshi, Vikas V*; Agrawal, Purvi; Mehta, Rupesh; Patil, Aditya |
Weak-Supervised Dysarthria-invariant Features for Spoken Language Understanding using an FHVAE and Adversarial Training; |
Qi, Jinzi*; Hugo, Van hamme |
Monotonic segmental attention for automatic speech recognition; |
Zeyer, Albert*; Schmitt, Robin; Zhou, Wei; Schlüter, Ralf; Ney, Hermann |
Automatic Rating of Spontaneous Speech for Low-Resource Languages; |
Getman, Yaroslav*; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Singh, Mittul; Kurimo, Mikko |
SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning; |
Kang, Zuheng*; Wang, Jianzong; Peng, Junqing; Xiao, Jing |
On the Efficiency of Integrating Self-supervised Learning and Meta-learning for User-defined Few-shot Keyword Spotting; |
Wu, Yuan-Kuei*; Kao, Wei-Tsung; Lee, Hung-yi; Chen, Chia-Ping; Chen, Zhi-Sheng; Tsai, Yu-Pao |
Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection; |
Cornell, Samuele*; Balestri, Thomas; Senechal, Thibaud |
Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion; |
Ma, Ding*; Violeta, Lester Phillip G; Kobayashi, Kazuhiro; Toda, Tomoki |
Combining Contrastive and Non-Contrastive Losses for Fine-Tuning Pretrained Models in Speech Analysis; |
Lux, Florian*; Chen, Ching-Yi; Thang, Vu |
Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech; |
Lux, Florian*; Koch, Julia; Thang, Vu |
Accelerator-Aware Training for Transducer-based Speech Recognition; |
Swaminathan, Rupak Vignesh*; Mumtaj Shakiah, Suhaila; Nguyen, Hieu D; chinta, Raviteja; Afzal, Tariq; Susanj, Nathan ; Mouchtaris, Athanasios ; Strimel, Grant; Rastrow, Ariya |
End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation; |
Masuyama, Yoshiki*; Chang, Xuankai; Cornell, Samuele; Watanabe, Shinji; Ono, Nobutaka |
NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING; |
LI, Mohan*; Doddipatla, Rama S |
Residual Adapters for Targeted Updates in RNN-Transducer Based Speech Recognition System; |
Han, Sungjun; Baby, Deepak; Mendelev, Valentin* |
Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows; |
Ezzerg, Abdelhamid*; Merritt, Thomas; Yanagisawa, Kayoko; Bilinski, Piotr; Proszewska, Magdalena; Pokora, Kamil; Korzeniowski, Renard; Barra-Chicote, Roberto; Korzekwa, Daniel |
Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models; |
Sukhadia, Vrunda N*; Umesh, S |
N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS; |
Zeng, Lu*; Parthasarathi, Sree Hari Krishnan; Hakkani-Tur, Dilek Z |
VSAMETER: EVALUATION OF A NEW OPEN-SOURCE TOOL TO MEASURE VOWEL SPACE AREA AND RELATED METRICS; |
Cao, Tianyu*; Moro-Velazquez, Laureano; Żelasko, Piotr; Villalba, Jesús; Dehak, Najim |
On Compressing Sequences for Self-Supervised Speech Models; |
Meng, Yen*; Chen, Hsuan-Jui; Shi, Jiatong; Watanabe, Shinji; Garcia, Paola; Lee, Hung-yi; Tang, Hao |
G-AUGMENT: SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR; |
Wang, Yuan*; Cubuk, Ekin D; Rosenberg, Andrew; Cheng, Shuyang; Weiss, Ron J; Ramabhadran, Bhuvana; Moreno, Pedro; Le, Quoc; Park, Daniel S |
Low-Latency Speech Separation Guided Diarization for Telephone Conversations; |
Morrone, Giovanni*; Cornell, Samuele; Raj, Desh; Serafini, Luca; Zovato, Enrico; Brutti, Alessio; Squartini, Stefano |
JOINT OPTIMIZATION OF DIFFUSION PROBABILISTIC-BASED MULTICHANNEL SPEECH ENHANCEMENT WITH FAR-FIELD SPEAKER VERIFICATION; |
Dowerah, Sandipana*; serizel, romain; Jouvet, Denis; Mohammadamini, Mohammad; Matrouf, Driss |
IMPROVED NORMALIZING FLOW-BASED SPEECH ENHANCEMENT USING AN ALL-POLE GAMMATONE FILTERBANK FOR CONDITIONAL INPUT REPRESENTATION; |
Strauss, Martin*; Torcoli, Matteo; Edler, Bernd |
Adaptive-FSN: Integrating full-band extraction and adaptive sub-band encoding for monaural speech enhancement; |
TSAO, YU-SHENG*; Hsun, Ho Kuan; Hung, Jeih-weih; Chen, Berlin |
Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations; |
Stafylakis, Themos*; Mošner, Ladislav; Kakouros, Sofoklis; Oldřich, Plchot; Burget, Lukas; Cernocky, Jan Honza |
Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR; |
Chen, Zhehuai*; Bapna, Ankur; Rosenberg, Andrew; Zhang, Yu; Ramabhadran, Bhuvana; Moreno, Pedro; Chen, Nanxin |
Damage Control during Domain Adaptation for Transducer Based Automatic Speech Recognition; |
Majumdar, Somshubra*; Acharya, Shantanu; Lavrukhin, Vitaly; Ginsburg, Boris |
Internal Language Model Personalization of E2E Automatic Speech Recognition Using Random Encoder Features; |
Stooke, Adam *; Sim, Khe C; Chua, Mason; Munkhdalai, Tsendsuren; Strohman, Trevor |
On granularity of prosodic representations in expressive text-to-speech; |
Babiański, Mikołaj*; Pokora, Kamil; Shah, Raahil; Sienkiewicz, Rafał; Korzekwa, Daniel; Klimkov, Viacheslav |
A STUDY ON THE INTEGRATION OF PRE-TRAINED SSL, ASR, LM AND SLU MODELS FOR SPOKEN LANGUAGE UNDERSTANDING; |
Peng, Yifan*; Arora, Siddhant; Higuchi, Yosuke; Ueda, Yushi; Kumar, Sujay; Ganesan, Karthik; Dalmia, Siddharth; Chang, Xuankai; Watanabe, Shinji |
Phoneme Segmentation Using Self-Supervised Speech Models; |
Strgar, Luke*; Harwath, David |
UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS; |
Bijwadia, Shaan*; Chang, Shuo-yiin; Sainath, Tara; Li, Bo; Zhang, Chao; He, Yanzhang |
Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition; |
Hussain, Amir*; Chowdhury, Shammur; Abdelali, Ahmed; Dehak, Najim; Ali, Ahmed; Khudanpur, Sanjeev |
Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy; |
Meyer, Sarina*; Tilli, Pascal; Denisov, Pavel; Lux, Florian; Koch, Julia; Thang, Vu |
Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition; |
Hamed, Injy*; Hussain, Amir; Chellah, Oumnia; Chowdhury, Shammur; Mubarak, Hamdy; Sitaram, Sunayana; Habash, Nizar; Ali, Ahmed |
Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition; |
Tan, Sharman W; Behre, Piyush*; Kibre, Nick; Alphonso, Issac; Chang, Shawn |
INVESTIGATING THE IMPORTANT TEMPORAL MODULATIONS FOR DEEP-LEARNING-BASED SPEECH ACTIVITY DETECTION; |
Vuong, Tyler*; Madaan, Nikhil; Panda, Rohan; Stern, Richard M |
UNSUPERVISED DOMAIN ADAPTATION OF NEURAL PLDA USING SEGMENT PAIRS FOR SPEAKER VERIFICATION; |
Ülgen, İsmail Rasim*; Arslan, Mustafa Levent |
Context-aware Neural Confidence Estimation for Rare Word Speech Recognition; |
Qiu, David*; Munkhdalai, Tsendsuren; He, Yanzhang; Sim, Khe C |
NAM+: TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR; |
Wu, Zelin*; Munkhdalai, Tsendsuren; Pundak, Golan; Sim, Khe C; Li, David; Rondon, Pat; Sainath, Tara |
INVESTIGATING ACTIVE-LEARNING-BASED TRAINING DATA SELECTION FOR SPEECH SPOOFING COUNTERMEASURE; |
Wang, Xin*; Yamagishi, Junichi |
Learning a Dual-Mode Speech Recognition Model via Self-Pruning; |
Liu, Chunxi*; Shangguan, Yuan; Yang, Haichuan; Shi, Yangyang; Krishnamoorthi , Raghuraman ; Kalinli, Ozlem |
Learning mask scalars for improved robust automatic speech recognition; |
Narayanan, Arun*; Walker, James; Panchapagesan, Sankaran; Howard, Nathan; Koizumi, Yuma |
Efficient Text Analysis with Pre-trained Neural Network Models; |
Cui, Jia*; Lu, Heng; Wang, Wenjie; Kang, Shiyin; He, Liqiang; Li, Guangzhi; Yu, Dong |
Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems; |
Sugiyama, Hiroaki*; Mizukami, Masahiro; Arimoto, Tsunehiro; Narimatsu, Hiromni; Chiba, Yuya; Nakajima, Hideharu; Meguro, Toyomi |
Response Timing Estimation for Spoken Dialog Systems based on Syntactic Completeness Prediction; |
Sakuma, Jin*; Fujie, Shinya; Kobayashi, Tetsunori |
E-Branchformer: Branchformer with Enhanced merging for speech recognition; |
Kim, Kwangyoun*; Wu, Felix; Peng, Yifan; Pan, Jing; Sridhar, Prashant; Han, Kyu Jeong; Watanabe, Shinji |
A comprehensive study on self-supervised distillation for speaker representation learning; |
Chen, Zhengyang*; Qian, Yao; Han, Bing; Qian, Yanmin; Zeng, Michael |
TDOA ESTIMATION OF SPEECH SOURCE IN NOISY REVERBERANT ENVIRONMENTS; |
Bu, Suliang; Zhao, Tuo*; Zhao, Yunxin |
On the Utility of Self-supervised Models for Prosody-related Tasks; |
Lin, Guan-Ting*; Feng, Chi Luen; Huang, Wei-Ping; Tseng, Yuan; Li, Chen An; Lin, Tzu-Han; Lee, Hung-yi; Ward, Nigel |
vTTS: visual-text to speech; |
Nakano, Yoshifumi; Saeki, Takaaki; Takamichi, Shinnosuke*; Sudoh, Katsuhito; Saruwatari, Hiroshi |
SPEECH EMOTION RECOGNITION WITH COMPLEMENTARY ACOUSTIC REPRESENTATIONS; |
Zhang, Xiaoming*; Zhang, Fan; Cui, Xiaodong; Zhang, Wei |
EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers; |
Maiti, Soumi*; Ueda, Yushi; Watanabe, Shinji; zhang, chunlei ; Yu, Meng; Zhang, Shixiong; Xu, Yong |
A ZERO-SHOT APPROACH TO IDENTIFYING CHILDREN’S SPEECH IN AUTOMATIC GENDER CLASSIFICATION; |
Saraf, Amruta; Sivaraman, Ganesh*; Khoury, Elie |
STREAMING, FAST AND ACCURATE ON-DEVICE INVERSE TEXT NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION; |
Gaur, Yashesh*; Kibre, Nick; Xue, Jian; Shu, Kangyuan; Wang, Yuhui; Alphonso, Issac; Li, Jinyu; Gong, Yifan |
Personalization of CTC Speech Recognition Models; |
Dingliwal, Saket*; Sunkara, Monica; Bodapati, Sravan Babu; Ronanki, Srikanth; Farris, Jeff; Kirchhoff, Katrin |
How Do Phonological Properties Affect Bilingual Automatic Speech Recognition?; |
Jain, Shelly*; Yadavalli, Aditya; Mirishkar, Sai Ganesh; Vuppala, Anil |
Building Markovian Generative Architectures over Pretrained LM Backbones for Efficient Task-Oriented Dialog Systems; |
Liu, Hong*; Cai, Yucheng; Ou, Zhijian; Huang, Yi; Feng, Junlan |
Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using $\beta$-VAE; |
Lu, Hui*; Wang, Disong; Wu, Xixin; Wu, Zhiyong; Liu, Xunying; Meng, Helen |
PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0; |
Bannò, Stefano*; Matassoni, Marco |
IMPROVING LUXEMBOURGISH SPEECH RECOGNITION WITH CROSS-LINGUAL SPEECH REPRESENTATIONS; |
Nguyen, Le Minh*; Nayak, Shekhar; Coler, Matt |
Macro-block dropout for improved regularization in training end-to-end speech recognition models; |
Kim, Chanwoo*; Indurti, Sathish; Park, Jinhwan; Sung, Wonyong |
Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation; |
Chevi, Rendi*; Prasojo, Radityo Eko; Aji, Alham Fikri; Tjandra, Andros; Sakti, Sakriani |
AUTOMATIC PREDICTION OF INTELLIGIBILITY OF WORDS AND PHONEMES PRODUCED ORALLY BY JAPANESE LEARNERS OF ENGLISH; |
Minematsu, Nobuaki*; Zhu, Chuanbo; Kunihara, Takuya; Saito, Daisuke; Nakanishi, Noriko |
PADA: PRUNING ASSISTED DOMAIN ADAPTATION FOR SELF-SUPERVISED SPEECH REPRESENTATIONS; |
Lodagala, Vasista Sai*; Ghosh, Sreyan; Umesh, S |
CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS; |
Lodagala, Vasista Sai*; Ghosh, Sreyan; Umesh, S |
Effective Mispronunciation Detection and Diagnosis Leveraging Heterogeneous Information Cues; |
Yan, Bi-Cheng*; Wang, Hsin-Wei; Chen, Berlin |
SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning; |
Feng, Tzu-hsun*; Dong, Annie; Yeh, Ching-Feng; Yang, Shu-wen; Lin, Tzu-Quan; Shi, Jiatong; Chang, Kai-Wei; Huang, Zili; Wu, Haibin; Chang, Xuankai; Watanabe, Shinji; Mohamed, Abdel-rahman; Li, Shang-Wen; Lee, Hung-yi |
AVSE CHALLENGE: AUDIO-VISUAL SPEECH ENHANCEMENT CHALLENGE; |
Aldana, Andrea L*; Valentini, Cassia; Klejch, Ondrej; Gogate, Mandar; Dashtipour, Kia K; Hussain, Amir; Bell, Peter |