Kling 2.1 Adds Audio Feature, Undercuts Google Veo 3 by 20 Times

Coin WorldMonday, Jun 16, 2025 8:51 pm ET
3min read

Kling 2.1, an AI-powered video creation tool developed by the Chinese short video platform Kuaishou, has recently added an audio generation feature. This new capability allows users to produce video clips with synchronized sound effects, such as footsteps, rainfall, and ambient noise. The feature is available in Kling's image-to-video mode, where users can upload a still image and the platform will animate it with both motion and audio generated by artificial intelligence.

The launch of this feature positions Kling 2.1 in direct competition with Google's Veo 3, which has had integrated audio capabilities since its inception. Early users have praised Kling 2.1 for its seamless audio-visual synchronization, with some creators calling it one of the most useful models on the market for producing generative video content. The feature is currently free during its initial rollout and is accessible through Kling's website and mobile app.

Kling 2.1 generates 5- to 10-second clips at up to 1080p resolution, utilizing what the company describes as "3D spatiotemporal attention mechanisms" to synchronize sounds with visuals. However, the audio tool currently generates only sound effects and produces unintelligible audio when text is involved. Despite this limitation, Kling 2.1 offers a significant cost advantage over Veo 3. At $9 per month, Kling undercuts Google’s Veo 3 by more than 20 times, making it a more accessible option for creators who need to experiment with different audio approaches.

In a head-to-head test, Kling 2.1's new audio features were compared against Google's Veo 3. The price gap between the two platforms is massive, with Kling 2.1 offering more than 20 videos for every single Veo 3 creation. For example, one generation with Google Veo 3 is currently on sale for 4,000 credits, whereas Kling 2.1 costs 300 credits per video. Google's model runs exclusively through its $250-per-month Ultra subscription, while Kling is available on its official site, offering some free generations with subscriptions starting at around $9 per month. Even with Google's current promotional pricing, Veo 3 remains ten times more expensive than Kling.

For creators who know that video generation involves plenty of trial and error, Kling's economics make experimentation feasible. The Premium plan on Kling unlocks 1080p resolution, improving overall video quality while still maintaining the cost advantage. However, Veo 3 offers sophisticated sound generation, accurately synthesizing speech and matching complex audio elements to visual scenes. Its understanding of spatial audio and contextual sounds surpassed Kling's offerings by a wide margin. While Kling 2.1 can’t compete in terms of dialogue and music, it excels in generating atmospheric audio for scenes or videos requiring ambient sounds.

Kling 2.1's new ability to add effects to existing silent videos gives it an edge that Veo 3 couldn't match. Users can upload finished videos and retrofit them with appropriate soundscapes, a workflow that Google's model doesn't support. Additionally, Kling offers a lip-syncing feature that allows users to upload a photo and a speech or dialogue separately, and the model will make a video in which the subjects interact naturally, as if they were speaking to each other according to the uploaded audio.

In terms of video generation quality, Kling 2.1's standard version outperformed both Veo 3 and its own Master edition in a test scene featuring a woman fleeing from a giant spider. The standard model accurately represented the scene dynamics, exhibiting fluid motion and proper directional movement. Veo 3 inexplicably generated the woman running toward the spider instead of away from it. The Master edition typically produces sharper, crisper visuals, but the standard version demonstrated superior scene comprehension and more fluid movement.

Platform limitations shape each tool's workflow differently. Kling 2.1's audio feature works only with image-to-video generation, not text-to-video, which remains exclusive to the Master edition without audio support. The best workaround is using Kolors, Kuaishou's image generator, to create starting frames before converting them to video with synchronized audio. Kolors produces highly realistic images that serve as excellent starting points for video generation. However, Veo 3 took the opposite approach, offering only text-to-video generation without any image-to-video option. This forces users to rely entirely on prompt engineering, with no way to control the starting visual.

Content moderation revealed contrasting philosophies between the two platforms. Veo 3 employs aggressive keyword filtering and post-generation checks, blocking content that violates Google's policies. The system flags potentially problematic prompts before generation and analyzes completed videos for policy violations. In contrast, Kling applies more liberal restrictions, allowing content that Veo will block outright. However, the model's training data naturally excluded explicit content—the model generates figures without anatomical details and violence without gore. So, users can generate certain types of content that bypass keyword filters while still maintaining safety boundaries. Both platforms refund credits when post-generation censorship blocks a video, but Kling's lighter touch allows more creative freedom within boundaries.

In conclusion, while Veo 3 might still be the king in terms of dialogue and sound design quality, Kling 2.1 is definitely close to a populist on a mission to overthrow the monarchy. Its audio feature is pretty revolutionary when you consider it’s a $9 tool competing against a $250 subscription. The atmospheric sounds work, the rain sounds like rain, footsteps match the movement most of the time, and you can generate twenty attempts while Veo users carefully craft their single shot. That retrofit feature, where you add sound to finished videos, is something Google doesn't offer, and it's genuinely useful for salvaging silent clips. For specific requirements like speech, Google Veo 3 is the obvious and only choice. However, for creators looking for a more affordable and flexible option, Kling 2.1 is a strong contender.

Comments



Add a public comment...
No comments

No comments yet

Disclaimer: The news articles available on this platform are generated in whole or in part by artificial intelligence and may not have been reviewed or fact checked by human editors. While we make reasonable efforts to ensure the quality and accuracy of the content, we make no representations or warranties, express or implied, as to the truthfulness, reliability, completeness, or timeliness of any information provided. It is your sole responsibility to independently verify any facts, statements, or claims prior to acting upon them. Ainvest Fintech Inc expressly disclaims all liability for any loss, damage, or harm arising from the use of or reliance on AI-generated content, including but not limited to direct, indirect, incidental, or consequential damages.