ChinaRxiv

Intelligent Audiovisual: Application Prospects and Challenges of AIGC in Short Video Production (Postprint)

ZHAO Yiying

Submitted 2025-07-09 | ChinaXiv: chinaxiv-202507.00213

Note: Figures in this paper have not yet been translated.

Abstract

【Purpose】This study aims to investigate the specific applications and challenges of AIGC technology in the short video domain, reflect on the advancement pathways for audio-visual creation in the artificial intelligence era, scrutinize the risks that AIGC introduces to short video production, and endeavor to propose technical solution pathways. 【Method】This study delves into the technical operational mechanisms of AIGC, and through integration with concrete practices, dissects the prospects, problems, and challenges it presents for short video creation. 【Results/Conclusion】The author contends that the introduction of AIGC will unlock the potential of short video creation across creative discovery, visual assessment, character development, scene portrayal, virtual-real synthesis, visual restoration, and human-computer interaction, while concurrently confronting challenges arising from model bias, deepfakes, and digital rights infringement.

Full Text

Intelligent Audio-Visual: Application Prospects and Challenges of AIGC in Short Video Production

Southwest University, Chongqing 400715

Abstract

[Purpose] This paper aims to explore the specific applications and challenges of AIGC technology in the short video domain, examine pathways for enhancing audio-visual creation in the artificial intelligence era, assess the risks that AIGC poses to short video production, and propose potential technical solutions. [Method] This study delves into the operational mechanisms of AIGC technology and analyzes its prospects, problems, and challenges for short video creation through concrete practice. [Results/Conclusion] The author argues that the introduction of AIGC will unlock potential in short video creation across creative discovery, visual evaluation, character shaping, scene depiction, virtual-real integration, visual restoration, and human-computer interaction, while simultaneously confronting challenges arising from model bias, deepfakes, and digital infringement.

Keywords: Generative Artificial Intelligence; Short Video; AIGC; Large Models
Classification Code: G202
Document Code: A
Article ID: 1671-0134(2025)03-137-05
DOI: 10.19483/j.cnki.11-4653/n.2025.03.030
Citation Format: Zhao Yiying. Intelligent Audio-Visual: Application Prospects and Challenges of AIGC in Short Video Production [J]. China Media Technology, 2025, 32(3): 137-140, 158.

AIGC (Artificial Intelligence Generated Content) refers to technology that generates creative text, images, audio, video, and other multimodal AI products based on massive data, algorithms, and models [1]. The emergence of AIGC in 2024 is closely linked to continuous breakthroughs in its technical capabilities, which have gradually demonstrated diverse application potential. In February 2024, OpenAI released Sora, the first text-to-video model, along with 48 text-to-video cases, achieving an industry leap in quality and duration [2]. This milestone marked a sector-wide breakthrough for AI technology in the text-to-video domain and drew global attention from the AI media industry toward the potential of AIGC in audio-visual content creation. However, text-to-video models have also revealed technical weaknesses: first, the consistency challenge in handling moving and long shots; second, spatial logic errors in generated content; and third, computational resource constraints limiting production efficiency [3]. Consequently, AIGC is more favored in short video creation compared to industrial-grade audio-visual applications. For instance, Kuaishou's text-to-video product "Keling" attracted over 3.6 million users within six months of its release, generating 37 million videos [4], fully demonstrating AIGC's enormous potential and market value in the short video domain.

On a deeper level, as AI technology permeates the short video field, issues of mutual adaptation between technology and society will become increasingly prominent and cannot be ignored. Therefore, this paper will delve into the technical fabric of artificial intelligence to analyze AIGC's application prospects in short video audio-visual creation, while also examining the resulting socio-technical risks and ethical concerns, attempting to provide valuable insights for building a harmonious and sustainable human-machine-society symbiotic relationship in the audio-visual domain.

1. Key Technical Breakthroughs of AIGC in the Audio-Visual Domain

This section focuses on introducing the key technical breakthroughs of AIGC in the audio-visual field.

1.1 Generative Adversarial Networks (GANs) Enhancing Audio-Visual Content Authenticity

Generative Adversarial Networks (GANs) represent a core architecture in AI audio-visual applications, comprising two fundamental components: a Generator and a Discriminator [5]. The Generator's task is to learn and simulate the distribution of real-world audio-visual data through deep learning models, thereby generating high-quality, realistic audio-visual content. The Discriminator's task is to analyze video details such as texture, color, and motion coherence to determine video authenticity. During this process, the Discriminator's evaluation criteria continuously improve, forcing the Generator to optimize its generation strategies and thereby enhancing the realism of generated content. The "adversarial" process in GANs involves finding an equilibrium between the Generator and Discriminator—the Generator attempts to "fool" the Discriminator with realistic audio-visual content, while the Discriminator evaluates authenticity and provides optimization strategies.

1.2 Variational Autoencoders (VAEs) Enhancing Audio-Visual Content Coherence

Variational Autoencoders (VAEs) are an advanced method specifically designed to optimize audio-visual content generation. By constructing a latent probability model of audio-visual data, VAEs enable the model to skillfully balance diversity and coherence during generation, creating rich yet fluid audio-visual experiences [6]. The VAE architecture comprises two core components: an encoder and a decoder. The encoder compresses video content into vector representations in latent space, while the decoder precisely reconstructs audio-visual content from these latent vectors. The essence of this technology lies in using gradient optimization between reconstruction error and KL divergence to ensure that audio-visual sequences in latent space are not only diverse but also smooth and natural. Simultaneously, by fine-tuning the dimensions and distribution of audio-visual data in latent space, the quality and efficiency of audio-visual content generation can be further enhanced.

1.3 Sequence-to-Sequence (Seq2Seq) Models Enhancing Audio-Visual Content Consistency

Sequence-to-Sequence (Seq2Seq) models address temporal issues in audio-visual content generation through Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) architectures, demonstrating efficient management of temporal dependencies between video frames and audio milliseconds. This model's uniqueness lies in its ability to receive a series of inputs (such as previous video frames) and predict the next series of outputs (such as future video frames), thereby ensuring visual and narrative continuity and consistency in generated videos. For example, when generating specific action scenes, Seq2Seq models can ensure reasonable action sequences and logical plot development coherence. Additionally, the training process of Seq2Seq models allows for the incorporation of conditional encoding, such as indicators for emotional changes or scene transitions, further enhancing the model's mastery of complex narrative structures. This makes Seq2Seq models an extremely valuable tool for audio-visual projects requiring high temporal control and narrative depth.

1.4 Conditional Neural Networks Enhancing Audio-Visual Content Uniqueness

Conditional Neural Networks integrate additional conditional information—such as user preferences, scene descriptions, or specific topic labels—into the video generation process, significantly improving content customization and precise control capabilities. This technology first involves encoding specific conditional information, such as vectorizing particular text descriptions or feature extraction to make them suitable for neural network processing [7]. The encoded conditional information is integrated into the input or multiple layers of the audio-visual generation network, allowing the network to consider these guiding factors throughout the generation process. For instance, when generating videos about "natural landscapes," the network introduces relevant images and scenes tagged with "nature"; or when producing videos for specific cultural festivals, it incorporates symbols and elements of that culture. This approach not only optimizes video visual appeal and content relevance but also enables videos to precisely meet specific scene requirements, thereby enhancing content adaptability and personalized experiences.

2. Application Prospects of AIGC in Short Video Production

2.1 Inspiration Activation: Creative Discovery and Visual Evaluation

Creativity is the lifeblood of short videos, directly determining whether content can attract audiences and stand out in a competitive market. In traditional content creation workflows, creative topics typically originate from producers' personal experiences and observations of social reality—a method that relies on creators' sensitivity and experience but is limited by individual perspectives and cognitive scope, potentially leading to insufficient content diversity and innovation [8]. The introduction of AIGC technology will profoundly transform the creative discovery and topic planning process for short videos. This transformation is first reflected in the diversification of creative sources. AI algorithms can analyze social media trending topics, search engine events, user online behavior patterns, and other data to identify novel topic perspectives overlooked by the market, thereby generating content ideas with broader coverage and fresher angles. Second, AIGC technology can also provide more efficient decision-making support in creative screening. Through machine learning models, AI systems can predict the potential audience size and popularity of different creative topics, helping producers quickly and accurately select more market-viable options among numerous possibilities.

Once creativity is determined, the key lies in effectively translating these ideas into visual content to form short video products. Visual evaluation plays a critical role in this process. With AIGC's assistance, creators can use natural language instruction fine-tuning to simulate different visual expression schemes, obtaining various video content alternatives at low cost and high efficiency, and optimizing through comparison. This not only helps creators conduct rapid experimentation and selection among different creative options, significantly shortening the time from concept to product, but also ensures that the final audio-visual content can precisely express creative intentions and possess market appeal.

In short video production, character and scene design are core elements for capturing audience attention. The integration of AIGC technology injects unprecedented personalized color and infinite creative space into character shaping and scene depiction, opening new artistic expression pathways for creators. In character generation, video production teams can use AIGC technology to automatically generate virtual characters with unique personalities and appearances. This process is typically implemented through deep learning-based generative adversarial networks, where the "Generator" gradually fits realistic virtual images through training on massive character data (such as specific skin tones, genders, ethnicities, etc.), while the "Discriminator" continuously feeds back evaluation and parameter fine-tuning to construct character images that are both logical and full of personality. AIGC can also endow characters with more distinctive personality traits and behavior patterns based on the video's theme and context provided by creators. For instance, in a series of short videos showcasing different ethnic cultures, AIGC can generate virtual characters that conform to ethnic cultural characteristics—possessing both "form and spirit"—using only a few keyword prompts, thereby enhancing content authenticity and perceptibility to resonate with audiences.

Complementing character generation is scene creation and expression. AIGC technology can automatically generate complex and delicate three-dimensional scenes according to script requirements. Using similar technologies, production teams can preview and select the most suitable backgrounds and settings for video content without physically constructing physical scenes, greatly reducing short video creation's dependence on external environments. Simultaneously, dynamic scene generation for specific times or atmospheres plays an important role in constructing video emotional appeal. Whether flowing seasonal landscapes, shuttling traffic and crowds, or rapidly changing weather phenomena, all can be precisely captured and integrated into videos, not only increasing emotional tension but also creating an immersive experience for audiences.

2.3 Integration of Virtual and Real: Bidirectional Empowerment Expanding Realistic Boundaries

The introduction of AIGC technology significantly reduces production costs and brings unprecedented creative freedom and efficiency to short video creation. However, videos generated purely through AIGC technology are often criticized for their lack of authenticity. At this point, the advantages of traditional shooting methods become apparent, as they can compensate for AI technology's shortcomings in training data, optimize technical parameter feedback, and inject more vitality and vividness into short videos.

On one hand, AIGC technology demonstrates enormous potential in enhancing real-shot footage. For example, in natural landscape shooting, AIGC algorithms can skillfully render natural phenomena like sunrise and sunset or add dynamic weather effects such as storms and lightning, making videos more lifelike and accurately conveying shooting intentions and emotional atmosphere. Additionally, AIGC technology excels at fine-tuning picture color, lighting, and details to meet specific narrative needs or artistic styles. Using advanced physics-based simulation and particle system algorithms, AI technology can simulate realistic weather effects based on lighting and other environmental parameters at shooting locations, ensuring seamless integration between virtual elements and real-shot content.

On the other hand, real shooting provides richer material and more authentic texture for AIGC content. Taking motion capture technology as an example, real-shot action data (such as details of complex human body movements, facial expressions, and environmental interactions) constitutes the foundation of motion models in AIGC technology [9]. By analyzing real human movements, AIGC technology can learn how to simulate natural motion laws, making virtual character movements appear natural visually and consistent with the real world in physical behavior, thereby feeding back into virtual character motion generation and enhancing the authenticity and viewability of short video production.

2.4 Value Reshaping: Visual Restoration and Content Reuse

The reuse of video footage can inspire new creative perspectives, reconstruct narrative structures, and significantly improve the efficiency and flexibility of short video creation. Many early short video materials, despite their high content value, often suffer from poor image quality due to technical limitations, preventing these precious materials from fully realizing their potential [10]. Using advanced spatial enhancement technology, AIGC can apply semantic feature-based video super-resolution methods to transform originally low-resolution video materials into higher-definition, more delicate versions. This technology not only substantially improves image clarity but also preserves the unique style and flavor of the original video while presenting more vivid and realistic visual effects in the restored footage.

In addition to spatial enhancement technology, temporal enhancement technology is another highlight of AIGC in short video visual restoration. This technology focuses on smooth transitions between video sequences, ensuring that restored videos display coherent and natural picture effects during playback. Through the organic combination of AIGC optimization networks and traditional enhancement technologies, video temporal enhancement effects are significantly improved. Even complex motion scenes and rapidly switching frames in old videos can be accurately and effectively repaired and restored [11].

2.5 Interaction Enhancement: Human-Computer Interaction and Interactive Short Videos

Interactive video is a new media form that combines video with interactive elements, aiming to create a richer and more immersive viewing experience for audiences through diversified means such as enhanced somatosensory feedback, deepened plot participation, and broadened content exploration [12]. With the vigorous development of AI technology, the field of human-computer interaction is undergoing unprecedented profound transformation, and the rise of AIGC technology has opened up new possibilities for the creation and dissemination of interactive short videos.

On one hand, AIGC technology can help achieve real-time dialogue between video characters and audiences. Through advanced technologies such as deep learning and natural language processing, AIGC can endow video characters with "intelligence." These characters can recognize and understand audience voice or text input and generate corresponding responses to achieve real-time dialogue with viewers. For example, in travel short videos, audiences can easily converse with characters in the video to obtain detailed destination information and personalized travel recommendations; in educational short videos, students can interact with teachers in real-time to promptly resolve learning questions and deepen knowledge understanding; in entertainment short videos, audiences can even interact intimately with virtual idols to enjoy unprecedented unique entertainment experiences. This technology not only greatly enhances video interactivity but also allows audiences to more deeply participate in short video content and enjoy immersive viewing pleasure.

On the other hand, AIGC technology can add rich interactive elements to video content, enhancing audience participation depth. Through precise analysis of backend data or audience connections, AIGC technology can fully understand audience needs and viewing contexts, and then dynamically insert or adjust interactive segments in video content according to preset interaction logic or real-time audience input. These interactive elements are not limited to simple click choices but encompass diversified forms including voice recognition and response, facial recognition and expression interaction, gesture recognition and control, and personalized content adjustment based on audience emotional feedback, providing audiences with an immersive, multi-sensory interactive experience. This new interaction model not only gives audiences more fun and sense of participation during video viewing but also enables creators to fully utilize their imagination to produce more creative and attractive interactive video works.

3. Challenges

3.1 Model Bias and Deepfakes

Text-to-video models, as dynamic presentations of images, may further exacerbate cultural bias, gender bias, socioeconomic bias, and racial and ethnic bias in AIGC short video content [13]. Bias in AIGC-generated content primarily stems from two aspects. First is data-driven bias. Text-to-video models generate new content by learning from massive video datasets. If these training data themselves contain biases—for example, underrepresentation or overrepresentation of certain groups—the model-generated content will likely reflect and even amplify these biases. Second, bias in algorithm design cannot be ignored. During model design and development, algorithm configuration and parameter selection entirely depend on engineers' personal judgments, and individual perspectives and sociocultural backgrounds of algorithm engineers will significantly influence model bias generation.

Therefore, solving bias issues in text-to-video requires addressing these two key points. First is ensuring comprehensive and diverse training data. This includes extensively collecting data from diverse cultures, regions, genders, and social groups, and ensuring the data contains diverse perspectives and narratives. This will create a more balanced learning foundation for models and generate more comprehensive and inclusive video content. Second is enhancing algorithm transparency. Developers should adopt explainable AI technology to reveal model decision-making bases, allow users and stakeholders to participate in algorithm fairness testing, understand model working principles and reasons for specific outputs, and conduct fairness-based algorithm adjustments on this foundation.

Additionally, deepfake videos are increasingly becoming a focus of attention across society. With their astonishingly realistic effects, these videos can fabricate seemingly irrefutable news or historical events, thereby misleading public understanding of major social and political issues, and in severe cases may even trigger social unrest and political crises [14]. Developing technologies that can effectively identify deepfake videos is key to addressing this challenge. For example, using anomaly detection technology during video inspection to precisely identify flaws in video processing (such as distorted facial expressions, unnatural lighting inconsistent with physical phenomena). Another approach is using comparative analysis technology to conduct detailed comparisons between video or audio samples to be tested and known authentic samples, accurately judging authenticity by carefully examining possible subtle differences.

3.2 Digital Infringement and Copyright Protection

In 2024, the Beijing Internet Court heard the first AI text-to-video copyright infringement case, where the creator of the "Shanhai Qijing" trailer accused the defendant of using AI to generate animations highly similar to their work, sparking discussion about the originality and copyright protection of AI-generated content.

The originality issue of AIGC content is a major challenge currently facing digital copyright protection. On one hand, AI technology can generate works similar to or even indistinguishable from human creations through learning and imitation. On the other hand, the intelligent generation process often involves complex algorithms and data processing, making originality judgment particularly difficult [15]. Data确权 (rights confirmation), as a foundational process for digital copyright circulation and protection, plays an important role in protecting short video digital copyrights. First, short video creators can build data rights confirmation platforms using blockchain technology to achieve clear ownership and orderly circulation in data usage and trading systems. Second, combining frontier technologies such as big data, cloud computing, and intelligent algorithms can fully model potential usage scenarios, modalities, and objects of data, more precisely identifying data value and providing scientific basis for data rights confirmation and pricing. Finally, AI technology applications can also help mainstream media achieve intelligent monitoring and rights protection of digital copyrights, promptly discovering and handling infringement acts to protect the legitimate rights and interests of data owners and content creators.

References

[1] Weng Yujun. How AIGC Can Be Embedded in Normalized News Production [J]. Media Review, 2023(8): 31-33.

[2] Bai Daoxin, Zhao Su. Reflections on Enhancement Paths for Film and Television Production Based on AIGC [J]. China Media Technology, 2024(8): 138-141.

[3] Xu Bo, Li Kuangyi. Integration and Symbiosis of AIGC and Micro-Dramas: Audio-Visual Art Exploration Driven by Technology [J]. Contemporary TV, 2024(12): 22-27.

[4] Jiemian News. Kuaishou: Keling AI Exceeds 3.6 Million Users, Standalone App to Launch Soon [EB/OL]. (2024-10-24) [2024-12-13]. https://www.jiemian.com/article/11876919.html.

[5] Yuan Bin. Research and Practice in Building Video Generation Models [J]. Radio and Television Network, 2024(S1): 13-17.

[6] Wang Yanwen, Lei Weimin, Zhang Wei, et al. A Survey of Video Image Reconstruction Methods Based on Generative Models [J]. Journal of Communications, 2022, 43(9): 194-208.

[7] Shao Yu. Exploration of Research Ideas on Generative Adversarial Neural Networks [J]. Communications and Information Technology, 2024(1): 117-122.

[8] Chen Changfeng, Yuan Yuqing. Intelligent Journalism: Generative AI as Infrastructure [J]. Inner Mongolia Social Sciences, 2024, 45(1): 40-48.

[9] Wu Jianmei. Virtual Digital Human Motion Capture Technology Based on Multi-Feature Fusion [J]. Journal of Heilongjiang University of Technology (Comprehensive Edition), 2024, 24(1): 10-15, 21.

[10] Lin Song. Video Image Key Frame Extraction and Restoration Method Based on Computer Vision [J]. Journal of Chongqing University of Science and Technology (Natural Science Edition), 2022, 24(6): 10-15, 21.

[11] Wang Yong, Chen Zanwei. Real-Time Rendering 3D Engine Technology Application Development and Effect Innovation [J]. Popular Literature and Art, 2021(23): 66-68.

[12] Li Yulin. Research on Interactive Video Application and Future Development [J]. Radio & TV Broadcast Engineering, 2013, 40(5): 30-32.

[13] Qin Shengyun, Li Xingyi. From ChatGPT to Sora: Production Process Reshaping and Trust Crisis Response in the AIGC Transformation of the Film and Television Industry [J]. Radio & TV Journal, 2024(11): 3-8.

[14] Wu Jing. New Applications and Alienation Risks of "Deepfake" Technology in the Media Field [J]. Media, 2023(3): 51-54.

[15] Liu Haiming, Tao Penghui. Imitation Ethics in AIGC Copyright Practice for Media Digital Content: Controversies, Boundaries, and Principles [J]. Journalism Lover, 2024(7): 22-27.

Author Bio: Zhao Yiying (2004—), female, Han ethnicity, from Yuncheng, Shanxi, undergraduate student, research interests include intelligent communication, computational communication, radio and television.

(Editor in Charge: Li Yansong)

Submission history

[v1] 2025-07-09

Abstract

Full Text

Intelligent Audio-Visual: Application Prospects and Challenges of AIGC in Short Video Production

Abstract

1. Key Technical Breakthroughs of AIGC in the Audio-Visual Domain

1.1 Generative Adversarial Networks (GANs) Enhancing Audio-Visual Content Authenticity

1.2 Variational Autoencoders (VAEs) Enhancing Audio-Visual Content Coherence

1.3 Sequence-to-Sequence (Seq2Seq) Models Enhancing Audio-Visual Content Consistency

1.4 Conditional Neural Networks Enhancing Audio-Visual Content Uniqueness

2. Application Prospects of AIGC in Short Video Production

2.1 Inspiration Activation: Creative Discovery and Visual Evaluation

2.3 Integration of Virtual and Real: Bidirectional Empowerment Expanding Realistic Boundaries

2.4 Value Reshaping: Visual Restoration and Content Reuse

2.5 Interaction Enhancement: Human-Computer Interaction and Interactive Short Videos

3. Challenges

3.1 Model Bias and Deepfakes

3.2 Digital Infringement and Copyright Protection

References

Submission history

Access Paper

Citation

Share

Related Papers

Feedback

Intelligent Audiovisual: Application Prospects and Challenges of AIGC in Short Video Production (Postprint)