Scalable Real-Time AI 3D Lip Sync Scalable Real-Time AI 3D Lip Sync

Scalable Real-Time AI 3D Lip Sync

A scalable real-time facial animation pipeline integrating AI-driven audio processing with Unreal Engine MetaHumans.
Project Year
2026
Softwares/Frameworks Used
Unreal Engine (MetaHumans, Pixel Streaming), NVIDIA Audio2Face , Python , WebSockets , Linux (Containerized environment) , TTS APIs (Cartesia / Minimax) , Docker, Cloud GPU Service (GKE/Modal)
  • Architected an end-to-end real-time pipeline for generating facial animation from streaming audio inputs
  • Integrated multiple TTS providers (Cartesia, Minimax) with dynamic switching and voice configuration
  • Implemented audio chunk streaming and real-time processing for low-latency lip-sync generation
  • Built WebSocket-based communication layer between TTS services and Unreal Engine
  • Integrated NVIDIA Audio2Face for real-time facial solving and animation driving
  • Developed Pixel Streaming setup for remote avatar interaction via WebRTC

Scalability & Infrastructure

  • Containerized Unreal Engine and supporting services using Docker
  • Deployed pipeline on cloud infrastructure (Modal, GKE)
  • Designed system to handle multiple avatar instances and concurrent sessions
  • Implemented session handling and avatar registry system for dynamic avatar loading

Performance Optimization

  • Achieved ~60 FPS streaming at 720p using L40S GPUs
  • Reduced cold start time to ~5–6 seconds
  • Optimized audio resampling pipeline to eliminate cracking and stuttering issues
  • Improved runtime performance by addressing FPS drops during speech

Pipeline Evolution & Architecture Improvements

  • Transitioned from signalling server-based architecture to direct WebSocket communication
  • Built relay system for efficient data flow between frontend and Unreal instances
  • Implemented dynamic avatar loading (removing hardcoded setups)
  • Enabled real-time parameter control (voice ID, TTS provider, background, avatar selection)

Avatar System Enhancements

  • Supported multiple avatars (full-body and half-body variants)
  • Added dynamic background customization (color & image-based)
  • Integrated control rigs for facial expressions, blinks, and realism improvements

Monitoring & Debugging

  • Implemented latency reporting across pipeline stages (TTS, streaming, inference)
  • Identified and resolved issues related to:
    • WebRTC connectivity (ICE candidates)
    • Session disconnections
    • Audio pipeline inconsistencies

Link to GitHub
A logo of github
No items found.

Other Projects