Nvidia Unveils Open Dataset and Multilingual Speech AI Models
PorAinvest
viernes, 15 de agosto de 2025, 3:13 am ET2 min de lectura
NVDA--
The Canary-1b-v2 model is a 1-billion parameter model built for high-quality speech transcription and translation across 25 European languages. It excels at both automatic speech recognition (ASR) and speech translation (AST), supporting speech transcription in 25 languages and speech translation from English to 24 languages and vice versa. The model is available for commercial and non-commercial use under the CC-BY-4.0 license [1].
Key features of the Canary-1b-v2 model include support for 25 European languages, state-of-the-art performance among models of similar size, comparable quality to models three times larger while being up to ten times faster, automatic punctuation and capitalization, accurate word-level and segment-level timestamps, and segment-level timestamps for translated outputs. The model is the first from the NeMo team to leverage Nvidia's Granary dataset, showcasing its multitask and multilingual capabilities [2].
The Canary-1b-v2 model is an encoder-decoder architecture featuring a FastConformer Encoder and a Transformer Decoder. It uses a unified SentencePiece Tokenizer with a vocabulary of 16,384 tokens, optimized across all 25 supported languages. The model was trained using the NeMo toolkit on a massive multilingual speech recognition and translation dataset combining Nvidia's newly published Granary and in-house dataset NeMo ASR Set 3.0. The training dataset includes human-labeled transcriptions from various corpora such as Multilingual LibriSpeech, Mozilla Common Voice, and Fleurs LibriSpeech [3].
Evaluation results show that the Canary-1b-v2 model achieves strong performance across multiple tasks. For Automatic Speech Recognition (ASR), the model achieves a Word Error Rate (WER) of 7.15% on the AMI dataset and 10.82% on the LibriSpeech Clean test set. For Speech Translation (AST), the model achieves a COMET score of 79.30 for X → English and 84.56 for English → X. The model also demonstrates robustness to noise and hallucination, with a WER of 2.18% at a Signal-to-Noise Ratio (SNR) of 100 dB and a character per minute rate of 134.7 on the MUSAN dataset [4].
The release of the Canary-1b-v2 model and dataset is significant for the AI industry, as it provides developers, researchers, and academics with powerful tools for building applications that require speech-to-text capabilities. The model is designed to run on Nvidia GPU-accelerated systems, leveraging Nvidia's hardware and software frameworks to achieve faster training and inference times compared to CPU-only solutions [5].
References:
[1] https://huggingface.co/nvidia/canary-1b-v2
[2] https://github.com/NVIDIA/NeMo
[3] https://github.com/NVIDIA/NeMo/blob/main/docs/asr.md
[4] https://github.com/NVIDIA/NeMo/blob/main/docs/evaluation.md
[5] https://github.com/NVIDIA/NeMo/blob/main/docs/hardware.md
Nvidia has announced the release of an open dataset and models for multilingual speech AI. The dataset includes over 15,000 hours of audio data in 30 languages, while the models are designed to support a range of applications, including voice recognition and language translation. The announcement is significant in the field of artificial intelligence, as it enables the development of more accurate and robust speech recognition systems.
Nvidia has recently announced the release of an open dataset and models for multilingual speech AI. The dataset includes over 15,000 hours of audio data in 30 languages, while the models are designed to support a range of applications, including voice recognition and language translation. The announcement is significant in the field of artificial intelligence, as it enables the development of more accurate and robust speech recognition systems.The Canary-1b-v2 model is a 1-billion parameter model built for high-quality speech transcription and translation across 25 European languages. It excels at both automatic speech recognition (ASR) and speech translation (AST), supporting speech transcription in 25 languages and speech translation from English to 24 languages and vice versa. The model is available for commercial and non-commercial use under the CC-BY-4.0 license [1].
Key features of the Canary-1b-v2 model include support for 25 European languages, state-of-the-art performance among models of similar size, comparable quality to models three times larger while being up to ten times faster, automatic punctuation and capitalization, accurate word-level and segment-level timestamps, and segment-level timestamps for translated outputs. The model is the first from the NeMo team to leverage Nvidia's Granary dataset, showcasing its multitask and multilingual capabilities [2].
The Canary-1b-v2 model is an encoder-decoder architecture featuring a FastConformer Encoder and a Transformer Decoder. It uses a unified SentencePiece Tokenizer with a vocabulary of 16,384 tokens, optimized across all 25 supported languages. The model was trained using the NeMo toolkit on a massive multilingual speech recognition and translation dataset combining Nvidia's newly published Granary and in-house dataset NeMo ASR Set 3.0. The training dataset includes human-labeled transcriptions from various corpora such as Multilingual LibriSpeech, Mozilla Common Voice, and Fleurs LibriSpeech [3].
Evaluation results show that the Canary-1b-v2 model achieves strong performance across multiple tasks. For Automatic Speech Recognition (ASR), the model achieves a Word Error Rate (WER) of 7.15% on the AMI dataset and 10.82% on the LibriSpeech Clean test set. For Speech Translation (AST), the model achieves a COMET score of 79.30 for X → English and 84.56 for English → X. The model also demonstrates robustness to noise and hallucination, with a WER of 2.18% at a Signal-to-Noise Ratio (SNR) of 100 dB and a character per minute rate of 134.7 on the MUSAN dataset [4].
The release of the Canary-1b-v2 model and dataset is significant for the AI industry, as it provides developers, researchers, and academics with powerful tools for building applications that require speech-to-text capabilities. The model is designed to run on Nvidia GPU-accelerated systems, leveraging Nvidia's hardware and software frameworks to achieve faster training and inference times compared to CPU-only solutions [5].
References:
[1] https://huggingface.co/nvidia/canary-1b-v2
[2] https://github.com/NVIDIA/NeMo
[3] https://github.com/NVIDIA/NeMo/blob/main/docs/asr.md
[4] https://github.com/NVIDIA/NeMo/blob/main/docs/evaluation.md
[5] https://github.com/NVIDIA/NeMo/blob/main/docs/hardware.md

Divulgación editorial y transparencia de la IA: Ainvest News utiliza tecnología avanzada de Modelos de Lenguaje Largo (LLM) para sintetizar y analizar datos de mercado en tiempo real. Para garantizar los más altos estándares de integridad, cada artículo se somete a un riguroso proceso de verificación con participación humana.
Mientras la IA asiste en el procesamiento de datos y la redacción inicial, un miembro editorial profesional de Ainvest revisa, verifica y aprueba de forma independiente todo el contenido para garantizar su precisión y cumplimiento con los estándares editoriales de Ainvest Fintech Inc. Esta supervisión humana está diseñada para mitigar las alucinaciones de la IA y garantizar el contexto financiero.
Advertencia sobre inversiones: Este contenido se proporciona únicamente con fines informativos y no constituye asesoramiento profesional de inversión, legal o financiero. Los mercados conllevan riesgos inherentes. Se recomienda a los usuarios que realicen una investigación independiente o consulten a un asesor financiero certificado antes de tomar cualquier decisión. Ainvest Fintech Inc. se exime de toda responsabilidad por las acciones tomadas con base en esta información. ¿Encontró un error? Reportar un problema

Comentarios
Aún no hay comentarios