OmniAvatar

Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

Qijun Gan1         Ruizi Yang1           Jianke Zhu1  
Shaofei Xue2         Steven Hoi2          
1Zhejiang University     2Alibaba Group

Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing.

Generated Videos

OmniAvatar can generate lifelike speaking avatar videos that the characters' actions and expressions are natural and rich, with audio perfectly synchronized to their lip movements. OmniAvatar also supports for controlled movement amplitudes by prompts.

Human-object Interaction

OmniAvatar is able to interact with objects while speaking, significantly broadening the application scenarios for audio-driven digital avatars.

Background Control

OmniAvatar can control the background through prompts, adapting to a variety of different scenes.

Emotion Control

OmniAvatar can control the emotions through prompts, like happy, angry, surprise and sad.

PodCast

Sing

Architecture Overview

Full Body

BibTeX


@misc{gan2025omniavatar,
      title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation}, 
      author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
      year={2025},
      eprint={2506.18866},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.18866}, 
}