ChinaRxiv

Research and Application of Key Technical Stages in Film Virtual Digital Human Production (Postprint)

Cheng Xiangyi

Submitted 2025-07-09 | ChinaXiv: chinaxiv-202507.00247

Note: Figures in this paper have not yet been translated.

Abstract

Objective: In recent years, with the digital evolution of high-tech formats such as mobile internet, artificial intelligence, and virtual reality, "virtual digital humans" have emerged as a novel technological format, demonstrating unique value in the film industry. However, the conceptual boundaries of "cinematic virtual digital humans" remain ambiguous, and their production technical workflows have yet to be systematically organized.

Methods: To address the film industry's application demands for virtual digital humans, this study clarifies the boundaries of "cinematic virtual digital humans" and delineates their essential concepts, while systematically organizing production workflows to better serve high-quality industrialized film production. This paper takes the virtual digital human character "Li Bai" produced in our research project as a case study to conduct in-depth research on the key technical stages of cinematic virtual digital human production.

Results / Conclusion: This paper defines the concept and characteristics of "cinematic virtual digital humans," breaks down the key production stages and key technologies for film-grade virtual digital humans in detail, and identifies existing problems in current film-grade virtual digital humans, providing technical references for future large-scale industrialized applications.

Full Text

Preamble

Research and Application of Key Technical Processes in Movie Virtual Digital Human Production
China Film Science and Technology Institute (Publicity Department Film Technology Quality Testing Institute), Beijing 100086

Abstract

In recent years, with the digital evolution of high-tech formats such as mobile internet, artificial intelligence, and virtual reality, "virtual digital humans" have emerged as a novel technological paradigm and demonstrated unique value in the film industry. However, the conceptual boundaries of "movie virtual digital humans" remain ambiguous, and their production technical workflows have yet to be systematically organized. Addressing the application demands of the film industry for virtual digital humans, this paper aims to clarify the boundaries and essential concepts of movie virtual digital humans while systematically organizing their production processes to better serve high-quality industrialized film production. Using the virtual digital human character "Li Bai" produced for our research project as a case study, we conduct an in-depth investigation into the key technical processes of movie virtual digital human production. This paper defines the concept and characteristics of movie virtual digital humans, breaks down the critical production stages and key technologies for cinema-grade virtual digital humans, and identifies existing problems in current cinema-grade virtual digital human production, providing technical references for future large-scale industrialized applications.

Keywords: Movie Virtual Digital Human; CG Technology; Artificial Intelligence; Deep Learning; Rendering Technology
CLC Number: G202
Document Code: A
Article ID: 1671-0134(2025)02-141-06
DOI: 10.19483/j.cnki.11-4653/n.2025.02.028
Citation Format: Cheng Xiangyi. Research and Application of Key Technical Processes in Movie Virtual Digital Human Production [J]. China Media Technology, 2025, 32(2): 141-145, 154.

Virtual digital humans (Digital Human) originated in the film and television domain and have been maturely applied in major productions such as The Lord of the Rings and Avatar since the early 21st century [1], representing digital characters created through CG (Computer Graphics) technology in the film industry. According to their definition, their characteristics can be summarized in two primary aspects. First, "hyper-realistic" human physical features achieved through special effects and artistic techniques capable of meeting cinema-grade standards, typically serving as digital doubles in film productions. Virtual digital humans in films must possess appearances similar to real-world individuals, including facial expressions, skin texture, and other details, while exhibiting fluid and natural movements, lighting environments, and visual effects that achieve visual indistinguishability from reality. In the 2015 film Furious 7, after actor Paul Walker passed away during production, the filmmakers employed facial replacement technology to combine his face and hair from existing footage with another actor's body, creating a virtual character to complete unfinished scenes.

[FIGURE:1] Furious 7 Paul Walker (Project Production)

Second, "interactive" authentic emotional experiences enabled through intelligent technology that can recognize user intentions and drive digital humans to initiate subsequent voice and action interactions, typically serving as digital avatars in film productions. Artistic works utilizing cinema-grade virtual digital humans can deliver authentic visual experiences and provide audiences with natural and realistic effects through highly anthropomorphic production quality, while offering creators greater creative space—representing a crucial criterion for virtual digital humans to replace real actors in film production [2]. In the 2019 film Alita: Battle Angel, the character Alita marked Weta Digital's first fully CG humanoid character created through facial capture. Her distinctive features of large eyes and a small mouth emphasized her origins as a manga character, representing a breakthrough in fully CG humanoid characters.

[FIGURE:2] Alita: Battle Angel Alita Character Facial Capture (Project Production)

[FIGURE:3] "Li Bai" Character Image (Project Production)

2. Production Process of Movie Virtual Digital Humans

Whether existing as "digital doubles" or "digital avatars," the basic production workflow for movie virtual digital humans can be divided into four stages: digital human modeling, model binding, driving/motion capture, and rendering.

Step 1: Digital Human Modeling. This process uses computer technology and related tools to digitize the shape, structure, and movement information of real human bodies, generating a virtual human model that can be simulated, emulated, and analyzed in computers. Virtual digital human modeling can employ static reconstruction, high visual fidelity dynamic light field 3D reconstruction technology to construct the basic virtual human image, with emphasis on detailed production or restoration of the character's appearance. Current modeling methods can be primarily categorized into three types: manual modeling, image acquisition modeling, and instrument acquisition modeling. Manual modeling, though widely applied, involves relatively long production cycles. Image acquisition modeling can restore 3D facial structures from several photographs, but the precision is insufficient for creating high-quality models. Instrument acquisition modeling represents the current focus of modeling technology development, achieving precision up to 0.1mm, though at relatively high cost. Among these, camera array scanning reconstruction technology has become the mainstream approach for character modeling.

The "Li Bai" virtual digital human character was created using a combination of traditional manual digital sculpting and real human light field acquisition, specifically by applying manual digital sculpting based on the muscle and bone structure of a real human face to precisely control every detail of the facial structure and obtain realistic human face texture maps.

Step 2: Model Binding. Digital human model binding connects the digital human's skeletal system and animation control system to external controllers or data sources to enable motion performance. Digital human model binding methods primarily include four types: manual binding, motion capture binding, physics simulation binding, and morphological binding. Manual binding involves manually connecting the digital human's skeletal system and animation control system to external controllers, requiring meticulous adjustment and optimization to achieve realistic and natural digital human movements. Motion capture binding connects the digital human's skeletal system and animation control system to motion capture equipment, typically requiring post-processing and optimization to correct noise and errors in the captured data. Physics simulation binding connects the digital human's skeletal system and animation control system to a physics engine, requiring appropriate physical parameters and constraints to ensure the digital human's movements and physical effects meet requirements. Morphological binding connects the digital human's skeletal system and animation control system to morphological data, requiring complex deformation algorithms and optimization to achieve facial expressions, muscles, and skin morphological changes.

The virtual digital human "Li Bai" utilizes the industry-advanced MetaHuman facial binding system, employing complex expression control panels, dynamic normal maps, and hundreds of expression blend shapes to present delicate, realistic, controllable, and rich expression variations.

[FIGURE:4] "Li Bai" Expression Binding (Project Production)

Step 3: Driving/Motion Capture. The primary method for 3D virtual human motion generation involves high-precision motion capture and skeletal binding of 3D models, specifically by using motion capture equipment or specialized cameras with image recognition to capture movement variations in body shape, expressions, eye gaze, gestures, and joints through motion capture probes. This determines the basic actions of virtual digital humans through forms such as live-action motion capture (human-driven) and trained driving models (algorithm-driven). Current motion capture can be implemented through optical, inertial, electromagnetic, or computer vision-based methods.

Currently, optical motion capture basically uses Marker points for capture, by attaching infrared-reflective markers to real actors that are tracked by optical sensor cameras. In other words, the placement of markers on the actor and their reflection sensitivity determine motion capture precision. Inertial motion capture primarily relies on Inertial Measurement Units (IMU) to capture human movements, specifically by binding IMUs integrated with accelerometers, gyroscopes, and magnetometers to specific skeletal nodes on the human body and calculating the measured values through algorithms to complete motion capture. Computer vision-based motion capture primarily captures movements by collecting and calculating depth information, and has become a frequently used motion capture solution due to its simplicity, ease of use, and low cost [3].

"Li Bai" uses Optitrack optical motion capture, attaching markers to various joints of the actor to bind the actor's movements with the skeletal model.

[FIGURE:5] "Li Bai" Character Motion Capture and Binding (Project Production)

Step 4: Rendering. This stage involves post-production compositing of scenes, characters, roles, special effects, and voice acting, followed by final rendering into complete footage. Based on the presentation effects and required elements for characters in scenes, components and models are rendered to achieve optimal visual quality [4]. Rendering technology enhances the realism of virtual humans. Real-time interaction also requires real-time rendering, which determines the final quality and style of the work. Rendering technology is divided into offline rendering (pre-rendering) and real-time rendering, with the essential difference being whether immediate interaction is possible. Offline rendering technology is primarily used in film and television animation, requiring high authenticity and detail and necessitating more computational resources. Real-time rendering technology focuses on interactivity and immediacy, suitable for scenarios with frequent user interaction such as games, virtual customer service, and virtual anchors. Currently, advancements in graphics production hardware and pre-compilation of available information have improved real-time rendering performance, but quality remains constrained by rendering time and computational resources. Limitations in hardware computing power and improvements in algorithmic capabilities will bring tremendous advances to rendering, enabling substantial improvements in rendering speed, effects, and image realism, particularly through significant enhancements in real-time rendering capabilities, making digital virtual images viable substitutes for real people and achieving realism indistinguishable from actual footage [5].

"Li Bai" plays offline animation data and renders in Unreal Engine to complete high-quality video production.

[FIGURE:6] "Li Bai" Offline Rendering (Project Production)

3. Key Technologies in Movie Virtual Digital Human Production

Based on the fundamental definition of movie virtual digital humans and supported by technology stacks including graphics recognition, visual technology, 3D modeling, CG rendering, motion capture, artificial intelligence, computer voice technology, and natural language processing, movie virtual digital human production involves numerous fields of hardware equipment and software algorithms. The foundation lies in "achieving high fidelity at the visual level + real-time rendering," primarily influenced by key technologies such as "CG modeling/image transfer technology, speech synthesis TTS technology, NLP technology, speech recognition ASR technology, and CV deep learning models."

3.1 CG Modeling/Image Transfer Technology Affects Presentation: Embodied in Anthropomorphic Appearance of Virtual Digital Humans

Movie virtual digital humans exhibit high anthropomorphism, particularly in appearance, visual effects, behavior, and interactive capabilities. External presentation and interaction effects have become critical development paths. The appearance of virtual digital humans represents their facial features and overall image, generally influenced by factors such as virtual digital human category (e.g., direct use of real human images, high-fidelity modeling, stylization), production details (modeling of details such as fine hair, skin, and hair), rendering level, and design aesthetics. Virtual digital human behavior relates to facial expressions, body movements, and voice expression, influenced by driving methods (human-driven, computation-driven, pre-adjusted, etc.), driving model categories (fine facial muscle driving, processing of interjections and prosody in speech synthesis models, etc.), training data, and driving model precision.

Alita: Battle Angel used a camera capture system to replace the special technologies Light Stage and Medusa Rig for creating realistic human face models. The production team deployed 60 cameras to cover every corner around the actor in 180°, capturing 4D data. For key plot performances, actors reproduced their performances in this system, providing reference and supplementary data for the deep learning process.

[FIGURE:7] Alita: Battle Angel Camera Capture (Project Production)

3.2 Speech Synthesis TTS/Speech Recognition ASR Technology Affects Language Processing: Embodied in External Language Conversion

ASR (Automatic Speech Recognition) technology converts human speech into text. Speech recognition is a multidisciplinary field closely connected with acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and numerous other disciplines [6]. TTS technology (Text-to-Speech) belongs to speech synthesis, transforming text information into speech output through computer intelligent synthesis. Its output effects can be converted through digital calculation from databases. Currently, iFLYTEK has conducted in-depth research in speech synthesis, speech conversion, and speech translation, and has developed mature speech large model products for market application. TTS technology primarily solves the conversion of text information into audible speech information, enabling machines to speak like humans.

Movie virtual digital humans utilize ASR technology to achieve feature extraction from external sound signals, improving speech recognition effects and enabling voice interaction devices to distinguish target sounds, ultimately achieving voice interaction. TTS technology helps virtual digital humans output natural and fluent speech streams, making virtual humans more natural and vivid during interactions.

[FIGURE:8] External Language Conversion (Project Production)

3.3 NLP Interaction Technology Affects Interactive Experience: Centered on Dialogue Capability

NLP interaction technology continues to play a core role in digital virtual humans following text dialogue assistants and voice AI assistants, and can be regarded as the brain of digital virtual humans. In AI interactive assistants, companies such as Xiaoice have achieved ideal results, enabling them to add good general interactive capabilities [7]. Companies like Zhuiyi Technology enhance the business interaction capabilities of digital virtual humans through knowledge graphs, business Q&A databases, and dialogue-based engineering engines.

Movie virtual digital humans can achieve communication between different languages and implement intelligent Q&A systems through NLP interaction technology. They can conduct sentiment analysis on input text to understand emotional information contained in the text, such as positive, negative, or neutral words, and provide language feedback.

3.4 CV Deep Learning Models Affect Driving Effects: Whether Natural Facial Expression Changes and Body Movements Can Be Presented

Whether natural facial expression changes and body movements can be presented depends profoundly on factors such as data volume, computational frameworks, and key feature points. The ability to present natural facial expression changes and body movements largely depends on the effectiveness of speech-driven deep models. Additionally, whether special designs for emotions and other factors can be implemented also produces important impacts.

Movie virtual digital humans can utilize this technology to import actor facial expression information (which can be called training data). Markers can enable computers to recognize how different markers on different facial muscles move through deep learning. Deep learning decomposes actor facial expressions, activates corresponding facial muscles, and drives the facial model to generate expressions autonomously. This approach will be more conducive to reconstructing digital human expressions.

4. Challenges in Movie Virtual Digital Human Production

Among the top 10 films worldwide with box office revenues exceeding $1 billion, four films have earned over $2 billion: Titanic, Avatar, Avengers: Infinity War, and Furious 7. All four films have extensively used virtual digital human technology. While movie virtual digital humans have brought unparalleled visual shock to audiences, the technical and capital thresholds behind these films also constrain the large-scale application and promotion of movie virtual digital humans. With the advent of blockchain technology, metaverse concepts, and LED virtual production, movie virtual digital humans have been pushed back into the spotlight. The full integration of new technologies, arts, and commerce has brought new economic growth and visual impact to the film industry, but also faces new problems and challenges.

4.1 Lack of Unified Authoritative Definition and Industry Standards

The virtual digital human industry is in its early development stage, with production companies, technology companies, operation companies, application companies, and various capitals flocking into the market. However, significant differences exist among enterprises in terms of technology and product quality, and relatively unified service production and evaluation standards are lacking [8], hindering the high-speed and high-quality development of the virtual digital human industry. Traditional film and television digital character production companies maintain relatively closed technologies, seeking to ensure their advantageous positions through technical barriers and professional experience, which to some extent also leads to relatively slow overall technological development in virtual digital humans. Establishing a virtual digital human technology, product, and evaluation standard system will effectively promote healthy industry development [9].

Regarding international standards, the international standard IEEE Draft Standard for Digital Human Quality Assessment was officially promulgated on March 21, 2023. This standard provides a specified digital human quality assessment framework that evaluates the authenticity of digital humans used in immersive content services. The China Academy of Information and Communications Technology (CAICT) Cloud Computing and Big Data Research Institute first proposed evaluation standards for digital humans globally, officially releasing two digital human standards—Basic Framework and Evaluation Indicators for Digital Human Application Systems and Requirements and Evaluation Methods for Non-Interactive 2D Real Human Image Digital Human Application Systems—on July 29, 2022. The "standards" content focuses on digital human application systems, first clarifying the definition of digital humans ("digital human") and proposing a reference framework for digital human application systems.

Regarding domestic standards, searches on the "National Standard Information Public Service Platform" show that no formally issued national standards or industry standards for "virtual digital humans" currently exist, though relevant technical specifications and group standards are actively being promoted. CAICT has researched and formulated Basic Capability Requirements and Evaluation Methods for Digital Human Systems in collaboration with industry forces, and initiated technical specifications and industry standards such as Technical Requirements for Trusted Virtual Human Generated Content Management Systems. The Shenzhen Artificial Intelligence Industry Association, Zhongguancun Internet of Things Industry Alliance Standards Committee, and others have led the release of group standards including Technical Specifications for Virtual Digital Humans Supporting Voice and Visual Interaction and Metaverse Virtual Digital Human Full-Process Developer.

According to standard research, no industry standards or definitions for movie virtual digital humans have been proposed to date, and authoritative and unified definitions and specifications are still needed.

4.2 Technical Barriers and High Costs

From a technical perspective, the rise of digital virtual humans has created channels for integrating the real world with the virtual world, but intelligent and anthropomorphic digital virtual humans will require further improvement in the era of massive computing power. Currently, to achieve complete industrialized film production content and truly realize intelligent integration of virtual humans, there remains a distance in technical implementation and computing power support, requiring deeper research and breakthroughs [10].

On one hand, movie virtual digital human technology research and application are developing toward intelligence, but various nodes in the production industry chain are relatively fragmented, particularly lacking in collaborative work and complementary differences. This leads to certain technical barriers in the production process and adjustment stages, where service scenarios and performance scenarios have not been effectively connected. Performance-oriented digital humans lack the business capabilities required by clients, while service-oriented digital humans lack character settings and find it difficult to conduct emotional exchanges with users. According to market surveys, most companies in the industry currently only handle one or several segments of the full digital human production and operation process [11].

On the other hand, the cost of meeting high mobility and high-frequency demands remains very high. Movie virtual digital human production involves long production cycles, high production costs, and insufficient production capacity. Currently, most virtual digital human production still employs the traditional method of 3D modeling + motion capture. Although this has formed a finely crafted proprietary cinema-grade virtual digital character production workflow and can produce cinema-quality virtual humans that are exquisite and delicate, achieving high standards in aesthetics and technical evaluation, the cost remains very high. This prevents mass production of virtual humans and fails to meet the demands of diversified modern film production scenarios and applications.

4.3 Talent Shortage Brings Homogenized Competition

Currently, technical talent in the film industry is relatively weak, and films also lack imagination driven by technology. Many virtual digital humans have not achieved differentiation, with most virtual digital human companies basically developing applications based on open-source technologies such as UE5. Very few companies possess overwhelming technical advantages, and most can only make superficial changes in content, operations, and creativity [12]. However, due to the shortage of professional film and television technical personnel, specialized film and television subfields such as rendering, color grading, and other post-visual effects, as well as scene art and lighting, still lack professional technical staff. Many companies subcontract and outsource production, lacking experience and awareness in cultivating talent chains, resulting in insufficient backup of film and television professional talent and lack of motivation in production.

Conclusion and Outlook

With the development of new technologies such as AIGC and real-time rendering, movie virtual digital humans, as an entirely new technological form, are currently developing toward "hyper-realistic, interactive, and intelligent" directions. The acquisition and production processes of movie virtual digital humans will be gradually simplified, production cycles may be significantly shortened, and production costs will be gradually reduced. Particularly, the continuous improvement of vertical domain AI large models and the advancement of cross-modal research capabilities will significantly enhance the perception capabilities of movie virtual digital humans, enabling more realistic and delicate expressions across different scenes, emotions, and dialogues in film production. Additionally, with the continuous increase in training data, movie virtual digital humans may possess perception and decision-making capabilities in the future, enabling film production to develop toward greater industrialization. In the future, virtual digital humans could become our avatars in the virtual digital world, interacting in the digital realm and bringing us more intelligent and immersive experiences. Facing the challenges and problems in movie virtual digital human production, we hope that future technology fusion research and applications will bring new business forms to different industries and fields, collectively providing technical support for high-quality film development.

References

[2] Gu Xiaoqing, Wan Ping, Wang Gong. Educational Metaverse: Making Every Learner the Protagonist [J]. Journal of East China Normal University (Educational Science Edition), 2023(11).
[3] Luo Dao. AI Becomes Real: Virtual Digital Humans Enter Life [2023-05-08] (2024-10-26) [EB/OL]. Computer News. https://www.163.com/dy/article/I47C83CE05562GMS.html
[4] Qin Beibei, Liu Weidong, Shi Liang. Analysis of Key Links in Metaverse Industry Chain and Audio-Visual Application Scenarios [J]. Radio and Television Network, 2022(9): 118-120.
[5] Fan Wen, Sun Hongyue, Lu Lin. Integration of CG and Modern Art [J]. Heilongjiang Science and Technology Information, 2012(15): 63.
[6] Xu Kaiwei, Peng Fei. Research on Language Education System Based on ASR and TTS [J]. Agriculture Network Information, 2006(6): 132-133.
[7] Xiang Xu. Cross-Modal Person Re-Identification Based on Modal Alignment [D]. Hefei: Anhui University, 2021.
[8] Peng Jiayao. Research on Virtual Dynamic Sculpture Based on 5G Technology [D]. Jingdezhen: Jingdezhen Ceramic University, 2021.
[9] Wang Jia. Research on Application of Virtual Digital Human Technology in Media Field [J]. Modern Television Technology, 2023(4): 102-105.
[10] Chen Yuxuan, Ma Xiaocheng. Digital Virtual Humans Frequently Become Popular, To What Extent Can They Replace Real People? [EB/OL]. Xinhua Daily Telegraph. [2020-01-12] [2024-11-30]. https://baijiahao.baidu.com/s?id=1721733245237381970&wfr=spider&for=pc
[11] Shen Hao, Yuan Lu. Artificial Intelligence: Meeting "New Film" [J]. Modern Film Technology, 2019(8): 31-34.
[12] Fang Jiexin. Research on Film Digital Distribution Framework Technology and Content Delivery Standardization [J]. Modern Film Technology, 2021(7): 13-17.

Author Biography: Cheng Xiangyi (1994—), female, from Yuncheng, Shanxi, holds a master's degree and is an engineer. Her research focuses on digital film technology and film standardization research.
(Responsible Editor: Li Jing)

Submission history

[v1] 2025-07-09

Abstract

Full Text

Preamble

Abstract

2. Production Process of Movie Virtual Digital Humans

3. Key Technologies in Movie Virtual Digital Human Production

3.1 CG Modeling/Image Transfer Technology Affects Presentation: Embodied in Anthropomorphic Appearance of Virtual Digital Humans

3.2 Speech Synthesis TTS/Speech Recognition ASR Technology Affects Language Processing: Embodied in External Language Conversion

3.3 NLP Interaction Technology Affects Interactive Experience: Centered on Dialogue Capability

3.4 CV Deep Learning Models Affect Driving Effects: Whether Natural Facial Expression Changes and Body Movements Can Be Presented

4. Challenges in Movie Virtual Digital Human Production

4.1 Lack of Unified Authoritative Definition and Industry Standards

4.2 Technical Barriers and High Costs

4.3 Talent Shortage Brings Homogenized Competition

Conclusion and Outlook

References

Submission history

Access Paper

Citation

Share

Related Papers

Feedback

Research and Application of Key Technical Stages in Film Virtual Digital Human Production (Postprint)