Chinese Video Generation Models Revolutionize AI Landscape with Creative Advancements
In recent times, there has been a surge in the development and public release of Chinese video generation models. Industry experts view this technology as a significant focus within the AI sector, rapidly advancing and poised to make a substantial impact on fields such as film production and advertisement design.
Recently, Volcano Engine, a subsidiary of ByteDance, unveiled the Doubao video generation model. This model is noted for its capability to generate consistent multi-shot scenes, dynamic camera movements, and support for 3D animation. The team highlighted its innovative diffusion model training approach, which resolves the challenge of maintaining consistency across multi-shot transitions without compromising the subject, style, or atmosphere.
Another notable development is the release of a video generation model by Tongyi Wanxiang, capable of producing detailed animations from textual descriptions. This model improves upon challenges such as motion generation and physical simulation, offering realistic portrayals that can be utilized in film creation, animation design, and advertising.
The rise of these video generation models has drawn considerable attention within the global AI industry. Companies like Kuaishou, Shensu Technology, and Zhipu AI are swiftly launching their products, showcasing the industry's momentum.
According to Deng Daozheng, Deputy Director of Saizhi Industry Research Institute, these developments are expected to significantly influence industries such as media, advertising, education, and the metaverse by reducing costs and production times in short video, live streaming, and film production.
However, while many models have emerged, experts emphasize the need for evolution from quantity to quality. Tang Jiayu, Co-Founder and CEO of Shensu Technology, points out a common issue—insufficient controllability and consistency, particularly with maintaining subject coherence in complex interactions.
Despite significant technical progress, Deng notes that video quality and continuity remain areas for improvement. Models struggle with complex scenes, often resulting in disjointed or defective visuals. Additionally, their understanding of natural language prompts is limited, leading to random and sometimes incoherent outputs.
In response, companies are accelerating model iterations. For instance, Vidu, developed by Shensu Technology in collaboration with Tsinghua University, updates its "subject reference" feature, allowing users to maintain subject consistency by uploading a single image of the subject. This enhanced capability offers seamless scene transitions driven by descriptive inputs.
Looking forward, Deng suggests fostering innovation and collaboration among enterprises, universities, and research institutions. Investing in core algorithmic advancements, building comprehensive datasets, and expanding application scenarios are crucial for elevating the quality of video generation and ensuring its broad adoption and commercialization.