Building a Local Video Avatar Generator Using Ollama and Open-Source Tools

Creating a video avatar generator completely locally without cloud services or API keys is challenging but possible. Here’s a step-by-step guide to build a system that generates talking video avatars using locally-run models.

Prerequisites

A computer with decent GPU (at least 8GB VRAM recommended)
16GB+ RAM
50GB+ free storage space
Linux or macOS (Windows with WSL also works)
Basic familiarity with command line

Step 1: Set Up Ollama for Local LLM

Ollama allows you to run large language models locally for text generation.

Install Ollama:

# For macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
   
# For Windows (via WSL)
# First install WSL, then run the Linux command above

Pull a suitable model (Llama3 recommended for better performance):
```
ollama pull llama3
```

Test your Ollama installation:

ollama run llama3 "Write a short 30-second script about climate change"

Step 2: Install Local Text-to-Speech Engine

We’ll use Piper, a fast local TTS system:

Install dependencies:

sudo apt-get update
sudo apt-get install -y build-essential python3-pip python3-venv

Set up a Python virtual environment:

python3 -m venv ~/venv-tts
source ~/venv-tts/bin/activate

Install Piper:
```
pip install piper-tts
```

Download a voice model:

mkdir -p ~/.local/share/piper-tts/voices
cd ~/.local/share/piper-tts/voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Test Piper:

echo "This is a test of the text to speech system." | piper \
  --model ~/.local/share/piper-tts/voices/en_US-lessac-medium.onnx \
  --output-raw | aplay -r 22050 -f S16_LE -c 1

Step 3: Install Wav2Lip for Avatar Animation

Clone the repository:

git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip

Set up environment:

python3 -m venv ~/venv-wav2lip
source ~/venv-wav2lip/bin/activate
pip install -r requirements.txt

Download the pre-trained model:

mkdir -p checkpoints
wget -O checkpoints/wav2lip.pth https://github.com/Rudrabha/Wav2Lip/releases/download/v1.0/wav2lip.pth

Prepare a reference face image or video clip to be animated (save as “face.jpg” or “face.mp4”)

Step 4: Install FFmpeg for Video Processing

sudo apt-get install -y ffmpeg

Step 5: Create the Pipeline Script

Create a file called avatar_generator.py:

#!/usr/bin/env python3
import os
import subprocess
import argparse
import tempfile
import json

def generate_script(prompt):
    """Generate script text using Ollama"""
    print("Generating script with Ollama...")
    cmd = f'ollama run llama3 "{prompt}"'
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout.strip()

def text_to_speech(text, output_wav):
    """Convert text to speech using Piper"""
    print("Converting text to speech...")
    voice_model = os.path.expanduser("~/.local/share/piper-tts/voices/en_US-lessac-medium.onnx")
    with tempfile.NamedTemporaryFile(mode='w', suffix='.txt') as f:
        f.write(text)
        f.flush()
        cmd = f'cat {f.name} | piper --model {voice_model} --output_file {output_wav}'
        subprocess.run(cmd, shell=True)

def animate_avatar(face_path, audio_path, output_video):
    """Animate the avatar using Wav2Lip"""
    print("Animating avatar...")
    wav2lip_path = os.path.expanduser("~/Wav2Lip")
    os.chdir(wav2lip_path)
    cmd = f'python inference.py --checkpoint_path checkpoints/wav2lip.pth --face {face_path} --audio {audio_path} --outfile {output_video} --nosmooth'
    subprocess.run(cmd, shell=True)

def main():
    parser = argparse.ArgumentParser(description='Generate a talking avatar video locally')
    parser.add_argument('--prompt', required=True, help='Prompt for script generation')
    parser.add_argument('--face', required=True, help='Path to face image or video')
    parser.add_argument('--output', default='output.mp4', help='Output video path')
    args = parser.parse_args()
    
    # Create temporary directory for intermediate files
    with tempfile.TemporaryDirectory() as tmpdir:
        script_file = os.path.join(tmpdir, 'script.txt')
        audio_file = os.path.join(tmpdir, 'speech.wav')
        
        # Generate script
        script = generate_script(args.prompt)
        with open(script_file, 'w') as f:
            f.write(script)
        print(f"Generated script:\n{script}\n")
        
        # Convert script to speech
        text_to_speech(script, audio_file)
        
        # Animate avatar
        animate_avatar(args.face, audio_file, args.output)
        
        print(f"Video generated and saved to {args.output}")

if __name__ == "__main__":
    main()

Make the script executable:

chmod +x avatar_generator.py

Step 6: Run the Avatar Generator

Prepare a face image or short video clip of the avatar you want to animate

Run the script:

./avatar_generator.py --prompt "Write a short introduction about renewable energy" --face face.jpg --output avatar_video.mp4

Step 7: Add Optional Captions (Using Local Whisper)

Install the local version of Whisper:

python3 -m venv ~/venv-whisper
source ~/venv-whisper/bin/activate
pip install openai-whisper

Create a caption script:

#!/usr/bin/env python3
import whisper
import subprocess
import os
import argparse

def generate_captions(audio_file):
    """Generate captions using Whisper"""
    print("Generating captions...")
    model = whisper.load_model("base")
    result = model.transcribe(audio_file)
    return result["text"]

def add_captions_to_video(video_file, caption_text, output_file):
    """Add captions to video using FFmpeg"""
    print("Adding captions to video...")
    with open("captions.srt", "w") as f:
        f.write("1\n00:00:00,000 --> 00:05:00,000\n" + caption_text)
       
    cmd = f'ffmpeg -i {video_file} -vf subtitles=captions.srt {output_file}'
    subprocess.run(cmd, shell=True)
    os.remove("captions.srt")

def main():
    parser = argparse.ArgumentParser(description='Add captions to a video')
    parser.add_argument('--video', required=True, help='Input video file')
    parser.add_argument('--audio', required=True, help='Input audio file for transcription')
    parser.add_argument('--output', default='captioned_video.mp4', help='Output video path')
    args = parser.parse_args()
       
    captions = generate_captions(args.audio)
    add_captions_to_video(args.video, captions, args.output)
    print(f"Captioned video saved to {args.output}")

if __name__ == "__main__":
    main()

Troubleshooting Tips

Memory Issues: If you encounter memory errors with Ollama or Wav2Lip, try using a smaller model or reducing batch sizes.
GPU Problems: Some models require CUDA. Make sure your GPU drivers are properly installed:
```
# Check CUDA installation
nvidia-smi
```
Video Quality: For better results:
- Use high-quality reference faces
- Ensure good lighting in the reference image
- Try different models or parameters in Wav2Lip
Integration Issues: If components don’t work together, ensure all paths are correctly specified in the scripts.

Conclusion

You now have a completely local video avatar generator pipeline that:

Generates script text using a local LLM (Ollama)
Converts text to speech using Piper
Animates a face to match the speech using Wav2Lip
Optionally adds captions using Whisper

All components run locally without any cloud services, online dependencies, or API keys. This approach gives you full control over the process and protects your privacy, though it requires more computational resources than cloud-based alternatives.