How I built a fully automated pipeline for iartshorts — with a custom Python orchestrator.
🔒 100% offline, zero cloud. Everything runs locally: LLMs, image generation, video, TTS, Wikipedia search. No external API calls, no data leaves my server.
The Project: Bringing Paintings to Life
iartshorts (and aiartshorts in English) is a vertical shorts channel that transforms portrait paintings into talking videos. The person in the painting animates, tells their story, and references the artist who painted them — all in a slightly cheeky tone, spoken in first person.
The goal: automate everything. From historical research to uploading on YouTube Shorts, TikTok, and Instagram Reels. One single human intervention during processing : validation.
But for that to work, you need hardware.
🖥️ Homelab AI Workstation
💡 LLM perf: 2× 4090D in Tensor Parallel 2 ≈ single RTX 6000 Pro. But the 6000 Pro shines on image/video generation and runs much quieter.
The Software Stack
Everything runs on Proxmox since it’s also my homelab. A single Ubuntu LXC container hosts everything that needs GPU access.
The LLMs
| Model | Usage | Infrastructure |
|---|---|---|
| Qwen3.6-27B-FP8 | Daily driver (coding + pipeline) | vLLM / SGLANG TP2 |
| Qwen3.5-122B-AWQ | Former daily driver | vLLM TP2 |
| Gemma4 31B | Occasional | SGLANG |
Models are managed by llama-swap + a custom Python script that generates the config from my model storage folder. Services run in Docker containers (vLLM, SGLANG, llama.cpp).
ComfyUI
All image/video generation goes through ComfyUI, with:
- Flux Klein 9b for painting restoration and realistic transformation
- Qwen TTS for speech synthesis
- LTX2.3 for video lip-sync generation (AudioSync)
- Wan2.2 for transitions and animations
The Pipeline:
10-Step
From a portrait painting to a ready-to-publish short video — fully automated
The Batch Processing Trick
Instead of processing each painting end-to-end, I process Step 1 for ALL paintings, then Step 2 for all, etc.
This drastically reduces GPU loading times: one warmup per step instead of 10 × N. It’s a big gain on total processing time.
The State Machine
Each painting is tracked in a SQLite database with a precise state: PENDING → PROCESSING → COMPLETE / FAILED / ABORTED / REDO.
The system supports automatic retries (×2) and resuming from any step. A crash at step 7 on 50 videos doesn’t mean redoing everything.
Custom ComfyUI Client
My pipeline uses a custom HTTP + WebSocket client to interact with ComfyUI, with real-time progress tracking and polling fallback. Between batches, ComfyUI is automatically restarted via SSH (systemctl restart comfyui) to prevent memory leaks.
The Human-in-the-Loop: Honest Numbers
Contrary to what you might think, the pipeline doesn’t produce « plug-and-play » content. Every day, I manually review each video, one by one.
Rejection Rates
| Stage | Rejection Rate | Reasons |
|---|---|---|
| Pre-filtering automatic (step 0) | ~30% | Blurry painting, multiple people, nudity |
| Post-processing manual (after 10 steps) | ~30% | Incorrect facial animation, visual artifacts, failed lip-sync, ugly transitions, scripts with errors |
| Global success rate | ~49% |
Out of 30,000 paintings in my database (free-rights photos), a random draw picks ~20 paintings from a predefined list meant to correspond to portraits each evening.
In the morning: ~10 validated videos out of ~20 started.
💡 Pre-filtering is as critical as the pipeline itself. Better to reject 30% before wasting GPU time. Vision AI analysis covers part of the filtering (nudity, child, single portrait), but some criteria (blur, character size, head orientation) still require the human eye.
Why Local? The Energy Argument
Cloud (closed) models are better. But here, I’m only limited by electricity costs.
In France:
- ✅ Electricity is cheap
- ✅ Low carbon (nuclear)
- ✅ GPUs run at night when energy is even cheaper
- ✅ Full independence: no API credits, no request limits
- ✅ Zero marginal cost per generated video
💰 Average Cost Per Video: €0.0193 (~1.9 ct) — at off-peak rate of €0.1579/kWh (2.9 ct if we take into account the 30% reject rate)
Lessons Learned
- SQLite state management is essential. Without it, a crash = start over. With it, you resume where you left off.
- Batch-by-step = big speedup. One GPU warmup per step instead of one warmup per painting per step.
- Automatic ComfyUI restart. Via SSH between batches — it’s the only reliable way to avoid memory leaks over multi-hour runs.
- Offline search via ZIM. No network dependency for Wikipedia data. Faster, more reliable, more private.
- 30% rejection is the current reality. Video models still generate too many artifacts for 100% automation. Human-in-the-loop is non-negotiable for quality content.
Licenses
| Modèle | License | Usage commercial |
|---|---|---|
| Qwen3.6-27B | Apache 2.0 | ✅ Free |
| Qwen3.5-122B | Apache 2.0 | ✅ Free |
| Gemma4-31B | Apache 2.0 | ✅ Free |
| Qwen TTS | Apache 2.0 | ✅ Free |
| Wan2.2-I2V-A14B | Apache 2.0 | ✅ Free |
| Flux.2 Klein 9B | FLUX Non-Commercial License v2.1 | ✅ Free usage of outputs |
| LTX-2.3 | LTX-2 License | ⚠️ if revenue < 10M$ |
Pipeline Execution Report
Batch #0 · iartshorts · 2026-05-22 02:58 → 07:54 · 20 paintings selected, 30 videos produced (15 FR + 15 EN)
✅ 15 paintings completed → 30 videos (French + English each). 5 paintings aborted at step 0 (Vision Analysis) — all were rejected in under 3 seconds
GPU Efficiency Breakdown
Across all 15 completed paintings, the GPU was actively computing 89.4% of the total wall-clock time. Only 31 minutes were spent waiting in queues — excellent utilization for a serial pipeline.
Time Breakdown by Pipeline Step
Each painting goes through 10 steps. Every step with a ComfyUI workflow is executed for both French and English versions. Here’s how time is distributed across the 15 completed paintings:
Wall-Clock Time per Step (15 paintings)
Compute vs Queue Time per Step
| Step | Stage | Wall-Clock | GPU Compute | Queue Wait | GPU % |
|---|---|---|---|---|---|
| 0 | Prepare (Vision) | 0.9m | — | 0.9m | — |
| 1 | Script (LLM + ZIM) | 21.6m | — | 21.6m | — |
| 2 | Audio (TTS × 2) | 28.2m | 27.5m | 0.8m | 97.3% |
| 3 | Clean Pic (Flux) | 2.3m | 1.5m | 0.8m | 66.2% |
| 4 | Realistic Pic (Flux) | 1.6m | 0.8m | 0.8m | 51.6% |
| 5 | Transition Vid (Wan) | 22.5m | 21.8m | 0.8m | 96.6% |
| 6 | LipSync Vid (LTX × 2) | 164.1m | 163.3m | 0.8m | 99.5% |
| 7 | Subtitles (Wan) | 7.5m | 6.7m | 0.8m | 89.3% |
| 8 | Outro Vid (Wan × 2) | 42.8m | 42.0m | 0.8m | 98.1% |
| 9 | Final Video (FFmpeg) | 3.4m | — | 3.4m | — |
Top 3 Bottlenecks
These three steps account for 88.3% of all GPU compute time:
| # | Step | Model | GPU Compute | % of Total | Avg / Painting |
|---|---|---|---|---|---|
| 1 | LipSync Video | LTX2.3-AudioSync × 2 | 2h 43m | 62.0% | 10.9 min |
| 2 | Outro Video | Wan2.2 × 2 | 42 min | 15.9% | 2.8 min |
| 3 | Audio (TTS) | Qwen TTS × 2 | 28 min | 10.4% | 1.8 min |
🐌 LipSync (LTX2.3) is the dominant bottleneck at 62% of GPU time. Each painting requires 2 lip-sync videos (FR + EN), each averaging ~5.4 minutes. Reducing this step’s duration would have the biggest impact on overall throughput.
Per-Painting Breakdown (15 Completed)
| Painting | Artist | Wall-Clock | GPU Compute | GPU % | Longest Step |
|---|---|---|---|---|---|
| Young_Girl_Holding_a_Basket | Berthe Morisot | 22.5m | 21.9m | 97.3% | LipSync 7.6m |
| Thomas_Howard_2nd_Earl_of_Arundel | Anthony van Dyck | 23.2m | 22.7m | 97.8% | LipSync 7.9m |
| The_Spanish_Guitarist | Pierre-Auguste Renoir | 25.5m | 25.2m | 98.8% | LipSync 10.0m |
| Woman_in_a_Flowered_Hat | Pierre-Auguste Renoir | 26.2m | 25.9m | 98.8% | LipSync 10.3m |
| Woman_in_a_Garden | Berthe Morisot | 26.5m | 26.2m | 98.9% | LipSync 10.6m |
| Young_Girl_with_an_Apron | Berthe Morisot | 26.9m | 26.5m | 98.5% | LipSync 10.8m |
| Woman_with_Red_Hair | Alice Pike Barney | 27.2m | 26.9m | 98.9% | LipSync 10.9m |
| Young_Woman_with_a_Water_Pitcher | Johannes Vermeer | 27.2m | 26.8m | 98.5% | LipSync 11.1m |
| Woman_in_Tulle_Blouse | Pierre-Auguste Renoir | 27.5m | 27.1m | 98.6% | LipSync 11.2m |
| The_Embroiderer | Jean Siméon Chardin | 27.6m | 27.2m | 98.6% | LipSync 11.3m |
| The_Stroller_Suzanne_Hoschedé | Claude Monet | 27.8m | 27.4m | 98.6% | LipSync 11.4m |
| Young_Woman_with_Roses | Alice Pike Barney | 29.0m | 28.6m | 98.6% | LipSync 12.2m |
| Woman_Seated_under_the_Willows | Claude Monet | 29.3m | 28.9m | 98.6% | LipSync 12.5m |
| Young_Girl_in_a_Pink-and-Black_Hat | Pierre-Auguste Renoir | 29.7m | 29.3m | 98.7% | LipSync 12.7m |
| The_Red_Kerchief | Claude Monet | 31.6m | 31.1m | 98.4% | LipSync 13.6m |
💡 Fastest painting: « Young Girl Holding a Basket » by Berthe Morisot — 22.5 min total. Slowest: « The Red Kerchief » by Claude Monet — 31.6 min total. The spread is 9 minutes, primarily driven by lip-sync duration differences.

