Nvidia has announced the release of an open dataset and models for multilingual speech AI. The dataset includes over 15,000 hours of audio data in 30 languages, while the models are designed to support a range of applications, including voice recognition and language translation. The announcement is significant in the field of artificial intelligence, as it enables the development of more accurate and robust speech recognition systems.
Nvidia has recently announced the release of an open dataset and models for multilingual speech AI. The dataset includes over 15,000 hours of audio data in 30 languages, while the models are designed to support a range of applications, including voice recognition and language translation. The announcement is significant in the field of artificial intelligence, as it enables the development of more accurate and robust speech recognition systems.
The Canary-1b-v2 model is a 1-billion parameter model built for high-quality speech transcription and translation across 25 European languages. It excels at both automatic speech recognition (ASR) and speech translation (AST), supporting speech transcription in 25 languages and speech translation from English to 24 languages and vice versa. The model is available for commercial and non-commercial use under the CC-BY-4.0 license [1].
Key features of the Canary-1b-v2 model include support for 25 European languages, state-of-the-art performance among models of similar size, comparable quality to models three times larger while being up to ten times faster, automatic punctuation and capitalization, accurate word-level and segment-level timestamps, and segment-level timestamps for translated outputs. The model is the first from the NeMo team to leverage Nvidia's Granary dataset, showcasing its multitask and multilingual capabilities [2].
The Canary-1b-v2 model is an encoder-decoder architecture featuring a FastConformer Encoder and a Transformer Decoder. It uses a unified SentencePiece Tokenizer with a vocabulary of 16,384 tokens, optimized across all 25 supported languages. The model was trained using the NeMo toolkit on a massive multilingual speech recognition and translation dataset combining Nvidia's newly published Granary and in-house dataset NeMo ASR Set 3.0. The training dataset includes human-labeled transcriptions from various corpora such as Multilingual LibriSpeech, Mozilla Common Voice, and Fleurs LibriSpeech [3].
Evaluation results show that the Canary-1b-v2 model achieves strong performance across multiple tasks. For Automatic Speech Recognition (ASR), the model achieves a Word Error Rate (WER) of 7.15% on the AMI dataset and 10.82% on the LibriSpeech Clean test set. For Speech Translation (AST), the model achieves a COMET score of 79.30 for X → English and 84.56 for English → X. The model also demonstrates robustness to noise and hallucination, with a WER of 2.18% at a Signal-to-Noise Ratio (SNR) of 100 dB and a character per minute rate of 134.7 on the MUSAN dataset [4].
The release of the Canary-1b-v2 model and dataset is significant for the AI industry, as it provides developers, researchers, and academics with powerful tools for building applications that require speech-to-text capabilities. The model is designed to run on Nvidia GPU-accelerated systems, leveraging Nvidia's hardware and software frameworks to achieve faster training and inference times compared to CPU-only solutions [5].
References:
[1] https://huggingface.co/nvidia/canary-1b-v2
[2] https://github.com/NVIDIA/NeMo
[3] https://github.com/NVIDIA/NeMo/blob/main/docs/asr.md
[4] https://github.com/NVIDIA/NeMo/blob/main/docs/evaluation.md
[5] https://github.com/NVIDIA/NeMo/blob/main/docs/hardware.md
Comments
No comments yet