Scalable Real-Time AI 3D Lip Sync

A scalable real-time facial animation pipeline integrating AI-driven audio processing with Unreal Engine MetaHumans.

Project Year

2026

Softwares/Frameworks Used

Unreal Engine (MetaHumans, Pixel Streaming), NVIDIA Audio2Face , Python , WebSockets , Linux (Containerized environment) , TTS APIs (Cartesia / Minimax) , Docker, Cloud GPU Service (GKE/Modal)

Architected an end-to-end real-time pipeline for generating facial animation from streaming audio inputs
Integrated multiple TTS providers (Cartesia, Minimax) with dynamic switching and voice configuration
Implemented audio chunk streaming and real-time processing for low-latency lip-sync generation
Built WebSocket-based communication layer between TTS services and Unreal Engine
Integrated NVIDIA Audio2Face for real-time facial solving and animation driving
Developed Pixel Streaming setup for remote avatar interaction via WebRTC

Scalability & Infrastructure

Containerized Unreal Engine and supporting services using Docker
Deployed pipeline on cloud infrastructure (Modal, GKE)
Designed system to handle multiple avatar instances and concurrent sessions
Implemented session handling and avatar registry system for dynamic avatar loading

Performance Optimization

Achieved ~60 FPS streaming at 720p using L40S GPUs
Reduced cold start time to ~5–6 seconds
Optimized audio resampling pipeline to eliminate cracking and stuttering issues
Improved runtime performance by addressing FPS drops during speech

Pipeline Evolution & Architecture Improvements

Transitioned from signalling server-based architecture to direct WebSocket communication
Built relay system for efficient data flow between frontend and Unreal instances
Implemented dynamic avatar loading (removing hardcoded setups)
Enabled real-time parameter control (voice ID, TTS provider, background, avatar selection)

‍

Avatar System Enhancements

Supported multiple avatars (full-body and half-body variants)
Added dynamic background customization (color & image-based)
Integrated control rigs for facial expressions, blinks, and realism improvements

Monitoring & Debugging

Implemented latency reporting across pipeline stages (TTS, streaming, inference)
Identified and resolved issues related to:
- WebRTC connectivity (ICE candidates)
- Session disconnections
- Audio pipeline inconsistencies

‍