My homelab that generates AI videos while I sleep

How I built a fully automated pipeline for iartshorts — with a custom Python orchestrator.

🔒 100% offline, zero cloud. Everything runs locally: LLMs, image generation, video, TTS, Wikipedia search. No external API calls, no data leaves my server.

The Project: Bringing Paintings to Life

iartshorts (and aiartshorts in English) is a vertical shorts channel that transforms portrait paintings into talking videos. The person in the painting animates, tells their story, and references the artist who painted them — all in a slightly cheeky tone, spoken in first person.

The goal: automate everything. From historical research to uploading on YouTube Shorts, TikTok, and Instagram Reels. One single human intervention during processing : validation.

But for that to work, you need hardware.

🖥️ Homelab AI Workstation

3 GPUs · 192 GB VRAM · ~1050W Total Homelab running on Proxmox — GPU passthrough to Ubuntu LXC
📋
Motherboard
ASRock Rack ROMED8-2T
AMD EPYC single CPU, PCIe 4.0
CPU
AMD EPYC 7763
64 cores — chosen for PCIe lanes
💾
RAM
128 GB DDR4
ECC Registered
GPUs — 192 GB VRAM Total
🟢
LLM Inference
2× RTX 4090D
Modded to 48 GB VRAM each · 300W limit · vLLM/SGLANG TP2
🟠
Image & Video Gen
1× RTX 6000 Pro Blackwell WS
96 GB VRAM · 450W limit · Silent & Efficient · ComfyUI
💡 Why the EPYC? Not for compute — for the PCIe lanes needed to drive 3 GPUs simultaneously.

💡 LLM perf: 2× 4090D in Tensor Parallel 2 ≈ single RTX 6000 Pro. But the 6000 Pro shines on image/video generation and runs much quieter.

The Software Stack

Everything runs on Proxmox since it’s also my homelab. A single Ubuntu LXC container hosts everything that needs GPU access.

The LLMs

ModelUsageInfrastructure
Qwen3.6-27B-FP8Daily driver (coding + pipeline)vLLM / SGLANG TP2
Qwen3.5-122B-AWQFormer daily drivervLLM TP2
Gemma4 31BOccasionalSGLANG

Models are managed by llama-swap + a custom Python script that generates the config from my model storage folder. Services run in Docker containers (vLLM, SGLANG, llama.cpp).

ComfyUI

All image/video generation goes through ComfyUI, with:

  • Flux Klein 9b for painting restoration and realistic transformation
  • Qwen TTS for speech synthesis
  • LTX2.3 for video lip-sync generation (AudioSync)
  • Wan2.2 for transitions and animations

The Pipeline:

10-Step

From a portrait painting to a ready-to-publish short video — fully automated

🖼️ Random selection from ~30,000 paintings → ~20 per night
🔍
Step 0
Prepare
Vision AI analyzes portrait: gender, nudity, child detection → SFW filtering. Resize to 1280p via FFmpeg.
Qwen3.6-27B Vision · FFmpeg
📝
Step 1
Script
Offline Wikipedia ZIM research on the painting’s author and the portrait subject (if identifiable). LLM extracts key facts, anecdotes, and context → then generates the 1st-person narrative script (max 150 words, cheeky tone). Structured output via Pydantic.
Qwen3.6-27B · libzim · Pydantic
🎙️
Step 2
Audio
Text-to-Speech in French AND English. Natural voice from the portrait’s perspective.
Qwen TTS · ComfyUI
🎨
Step 3
Clean Pic
AI restoration of the painting. Fix cracks, revive colors, keep pose unchanged.
Flux Klein · ComfyUI
📸
Step 4
Realistic Pic
Transform the painting into a photo-realistic portrait — same model as Step 3.
Flux Klein · ComfyUI
Step 5
Transition
Golden sparkle/morph video — the painting transforms into the realistic version.
Wan2.2 · ComfyUI
👄
Step 6
Script Video
Lip-sync: the realistic portrait speaks the generated audio. Head motion + facial expressions.
LTX2.3 · ComfyUI
💬
Step 7
Subtitles
Generate and burn-in subtitles with custom styling.
Whisper · FFmpeg
👋
Step 8
Loop / Outro
Loop back to the original painting image to generate a seamless looping video.
Wan2.2 · ComfyUI
🎬
Step 9
Final Video
Concatenate all clips, add overlays (title, artist, IArtShorts watermark), background music, extract thumbnail.
FFmpeg
Runs nightly → generates videos → stores in a ready-to-publish pool
▶️ 🎵 📸
~20 paintings / night · ~10 videos / morning
📤 Separate Upload Process
Background task · 5 paintings/day · posted at optimal times across all platforms

The Batch Processing Trick

Instead of processing each painting end-to-end, I process Step 1 for ALL paintings, then Step 2 for all, etc.

This drastically reduces GPU loading times: one warmup per step instead of 10 × N. It’s a big gain on total processing time.

The State Machine

Each painting is tracked in a SQLite database with a precise state: PENDING → PROCESSING → COMPLETE / FAILED / ABORTED / REDO.

The system supports automatic retries (×2) and resuming from any step. A crash at step 7 on 50 videos doesn’t mean redoing everything.

Custom ComfyUI Client

My pipeline uses a custom HTTP + WebSocket client to interact with ComfyUI, with real-time progress tracking and polling fallback. Between batches, ComfyUI is automatically restarted via SSH (systemctl restart comfyui) to prevent memory leaks.


The Human-in-the-Loop: Honest Numbers

Contrary to what you might think, the pipeline doesn’t produce « plug-and-play » content. Every day, I manually review each video, one by one.

Rejection Rates

StageRejection RateReasons
Pre-filtering automatic (step 0)~30%Blurry painting, multiple people, nudity
Post-processing manual (after 10 steps)~30%Incorrect facial animation, visual artifacts, failed lip-sync, ugly transitions, scripts with errors
Global success rate~49%

Out of 30,000 paintings in my database (free-rights photos), a random draw picks ~20 paintings from a predefined list meant to correspond to portraits each evening.

In the morning: ~10 validated videos out of ~20 started.

💡 Pre-filtering is as critical as the pipeline itself. Better to reject 30% before wasting GPU time. Vision AI analysis covers part of the filtering (nudity, child, single portrait), but some criteria (blur, character size, head orientation) still require the human eye.


Why Local? The Energy Argument

Cloud (closed) models are better. But here, I’m only limited by electricity costs.

In France:

  • ✅ Electricity is cheap
  • ✅ Low carbon (nuclear)
  • ✅ GPUs run at night when energy is even cheaper
  • ✅ Full independence: no API credits, no request limits
  • ✅ Zero marginal cost per generated video

💰 Average Cost Per Video: €0.0193 (~1.9 ct) — at off-peak rate of €0.1579/kWh (2.9 ct if we take into account the 30% reject rate)


Lessons Learned

  1. SQLite state management is essential. Without it, a crash = start over. With it, you resume where you left off.
  2. Batch-by-step = big speedup. One GPU warmup per step instead of one warmup per painting per step.
  3. Automatic ComfyUI restart. Via SSH between batches — it’s the only reliable way to avoid memory leaks over multi-hour runs.
  4. Offline search via ZIM. No network dependency for Wikipedia data. Faster, more reliable, more private.
  5. 30% rejection is the current reality. Video models still generate too many artifacts for 100% automation. Human-in-the-loop is non-negotiable for quality content.

Licenses

ModèleLicenseUsage commercial
Qwen3.6-27BApache 2.0✅ Free
Qwen3.5-122BApache 2.0✅ Free
Gemma4-31BApache 2.0✅ Free
Qwen TTSApache 2.0✅ Free
Wan2.2-I2V-A14BApache 2.0✅ Free
Flux.2 Klein 9BFLUX Non-Commercial License v2.1✅ Free usage of outputs
LTX-2.3LTX-2 License⚠️ if revenue < 10M$

Pipeline Execution Report

Batch #0 · iartshorts · 2026-05-22 02:58 → 07:54 · 20 paintings selected, 30 videos produced (15 FR + 15 EN)

4h 56m
Total Wall-Clock
4h 24m
GPU Compute
89.4%
GPU Efficiency
30
Videos Output

15 paintings completed → 30 videos (French + English each). 5 paintings aborted at step 0 (Vision Analysis) — all were rejected in under 3 seconds


GPU Efficiency Breakdown

Across all 15 completed paintings, the GPU was actively computing 89.4% of the total wall-clock time. Only 31 minutes were spent waiting in queues — excellent utilization for a serial pipeline.

89.4%
GPU Active
GPU Compute: 4h 24m
Queue Wait: 31m

Time Breakdown by Pipeline Step

Each painting goes through 10 steps. Every step with a ComfyUI workflow is executed for both French and English versions. Here’s how time is distributed across the 15 completed paintings:

Wall-Clock Time per Step (15 paintings)

Step 0 — Prepare (Vision)
0.9m
0.3%
Step 1 — Script (LLM + ZIM)
21.6m
7.3%
Step 2 — Audio (TTS × 2)
28.2m
9.6%
Step 3 — Clean Pic (Flux)
2.3m
0.8%
Step 4 — Realistic Pic (Flux)
1.6m
0.5%
Step 5 — Transition Vid (Wan)
22.5m
7.6%
Step 6 — LipSync Vid (LTX × 2)
164.1m
55.6%
Step 7 — Subtitles (Wan)
7.5m
2.5%
Step 8 — Outro Vid (Wan × 2)
42.8m
14.5%
Step 9 — Final Video (FFmpeg)
3.4m
1.2%

Compute vs Queue Time per Step

StepStageWall-ClockGPU ComputeQueue WaitGPU %
0Prepare (Vision)0.9m0.9m
1Script (LLM + ZIM)21.6m21.6m
2Audio (TTS × 2)28.2m27.5m0.8m97.3%
3Clean Pic (Flux)2.3m1.5m0.8m66.2%
4Realistic Pic (Flux)1.6m0.8m0.8m51.6%
5Transition Vid (Wan)22.5m21.8m0.8m96.6%
6LipSync Vid (LTX × 2)164.1m163.3m0.8m99.5%
7Subtitles (Wan)7.5m6.7m0.8m89.3%
8Outro Vid (Wan × 2)42.8m42.0m0.8m98.1%
9Final Video (FFmpeg)3.4m3.4m

Top 3 Bottlenecks

These three steps account for 88.3% of all GPU compute time:

#StepModelGPU Compute% of TotalAvg / Painting
1LipSync VideoLTX2.3-AudioSync × 22h 43m62.0%10.9 min
2Outro VideoWan2.2 × 242 min15.9%2.8 min
3Audio (TTS)Qwen TTS × 228 min10.4%1.8 min

🐌 LipSync (LTX2.3) is the dominant bottleneck at 62% of GPU time. Each painting requires 2 lip-sync videos (FR + EN), each averaging ~5.4 minutes. Reducing this step’s duration would have the biggest impact on overall throughput.


Per-Painting Breakdown (15 Completed)

PaintingArtistWall-ClockGPU ComputeGPU %Longest Step
Young_Girl_Holding_a_BasketBerthe Morisot22.5m21.9m97.3%LipSync 7.6m
Thomas_Howard_2nd_Earl_of_ArundelAnthony van Dyck23.2m22.7m97.8%LipSync 7.9m
The_Spanish_GuitaristPierre-Auguste Renoir25.5m25.2m98.8%LipSync 10.0m
Woman_in_a_Flowered_HatPierre-Auguste Renoir26.2m25.9m98.8%LipSync 10.3m
Woman_in_a_GardenBerthe Morisot26.5m26.2m98.9%LipSync 10.6m
Young_Girl_with_an_ApronBerthe Morisot26.9m26.5m98.5%LipSync 10.8m
Woman_with_Red_HairAlice Pike Barney27.2m26.9m98.9%LipSync 10.9m
Young_Woman_with_a_Water_PitcherJohannes Vermeer27.2m26.8m98.5%LipSync 11.1m
Woman_in_Tulle_BlousePierre-Auguste Renoir27.5m27.1m98.6%LipSync 11.2m
The_EmbroidererJean Siméon Chardin27.6m27.2m98.6%LipSync 11.3m
The_Stroller_Suzanne_HoschedéClaude Monet27.8m27.4m98.6%LipSync 11.4m
Young_Woman_with_RosesAlice Pike Barney29.0m28.6m98.6%LipSync 12.2m
Woman_Seated_under_the_WillowsClaude Monet29.3m28.9m98.6%LipSync 12.5m
Young_Girl_in_a_Pink-and-Black_HatPierre-Auguste Renoir29.7m29.3m98.7%LipSync 12.7m
The_Red_KerchiefClaude Monet31.6m31.1m98.4%LipSync 13.6m

💡 Fastest painting: « Young Girl Holding a Basket » by Berthe Morisot — 22.5 min total. Slowest: « The Red Kerchief » by Claude Monet — 31.6 min total. The spread is 9 minutes, primarily driven by lip-sync duration differences.