How I built a fully automated pipeline for iartshorts — with a custom Python orchestrator.

🔒 100% offline, zero cloud. Everything runs locally: LLMs, image generation, video, TTS, Wikipedia search. No external API calls, no data leaves my server.

The Project: Bringing Paintings to Life

iartshorts (and aiartshorts in English) is a vertical shorts channel that transforms portrait paintings into talking videos. The person in the painting animates, tells their story, and references the artist who painted them — all in a slightly cheeky tone, spoken in first person.

The goal: automate everything. From historical research to uploading on YouTube Shorts, TikTok, and Instagram Reels. One single human intervention during processing : validation.

But for that to work, you need hardware.

🖥️ Homelab AI Workstation

3 GPUs · 192 GB VRAM · ~1050W Total Homelab running on Proxmox — GPU passthrough to Ubuntu LXC

📋

Motherboard

ASRock Rack ROMED8-2T

AMD EPYC single CPU, PCIe 4.0

⚡

CPU

AMD EPYC 7763

64 cores — chosen for PCIe lanes

💾

RAM

128 GB DDR4

ECC Registered

GPUs — 192 GB VRAM Total

🟢

LLM Inference

2× RTX 4090D

Modded to 48 GB VRAM each · 300W limit · vLLM/SGLANG TP2

🟠

Image & Video Gen

1× RTX 6000 Pro Blackwell WS

96 GB VRAM · 450W limit · Silent & Efficient · ComfyUI

💡 Why the EPYC? Not for compute — for the PCIe lanes needed to drive 3 GPUs simultaneously.

💡 LLM perf: 2× 4090D in Tensor Parallel 2 ≈ single RTX 6000 Pro. But the 6000 Pro shines on image/video generation and runs much quieter.

The Software Stack

Everything runs on Proxmox since it’s also my homelab. A single Ubuntu LXC container hosts everything that needs GPU access.

The LLMs

Model	Usage	Infrastructure
Qwen3.6-27B-FP8	Daily driver (coding + pipeline)	vLLM / SGLANG TP2
Qwen3.5-122B-AWQ	Former daily driver	vLLM TP2
Gemma4 31B	Occasional	SGLANG

Models are managed by llama-swap + a custom Python script that generates the config from my model storage folder. Services run in Docker containers (vLLM, SGLANG, llama.cpp).

ComfyUI

All image/video generation goes through ComfyUI, with:

Flux Klein 9b for painting restoration and realistic transformation
Qwen TTS for speech synthesis
LTX2.3 for video lip-sync generation (AudioSync)
Wan2.2 for transitions and animations

The Pipeline:

10-Step

From a portrait painting to a ready-to-publish short video — fully automated

🖼️ Random selection from ~30,000 paintings → ~20 per night

🔍

Step 0

Prepare

Vision AI analyzes portrait: gender, nudity, child detection → SFW filtering. Resize to 1280p via FFmpeg.

Qwen3.6-27B Vision · FFmpeg

↓

📝

Step 1

Script

Offline Wikipedia ZIM research on the painting’s author and the portrait subject (if identifiable). LLM extracts key facts, anecdotes, and context → then generates the 1st-person narrative script (max 150 words, cheeky tone). Structured output via Pydantic.

Qwen3.6-27B · libzim · Pydantic

↓

🎙️

Step 2

Audio

Text-to-Speech in French AND English. Natural voice from the portrait’s perspective.

Qwen TTS · ComfyUI

↓

🎨

Step 3

Clean Pic

AI restoration of the painting. Fix cracks, revive colors, keep pose unchanged.

Flux Klein · ComfyUI

↓

📸

Step 4

Realistic Pic

Transform the painting into a photo-realistic portrait — same model as Step 3.

Flux Klein · ComfyUI

↓

✨

Step 5

Transition

Golden sparkle/morph video — the painting transforms into the realistic version.

Wan2.2 · ComfyUI

↓

👄

Step 6

Script Video

Lip-sync: the realistic portrait speaks the generated audio. Head motion + facial expressions.

LTX2.3 · ComfyUI

↓

💬

Step 7

Subtitles

Generate and burn-in subtitles with custom styling.

Whisper · FFmpeg

↓

👋

Step 8

Loop / Outro

Loop back to the original painting image to generate a seamless looping video.

Wan2.2 · ComfyUI

↓

🎬

Step 9

Final Video

Concatenate all clips, add overlays (title, artist, IArtShorts watermark), background music, extract thumbnail.

FFmpeg

Runs nightly → generates videos → stores in a ready-to-publish pool

▶️ 🎵 📸

~20 paintings / night · ~10 videos / morning

📤 Separate Upload Process

Background task · 5 paintings/day · posted at optimal times across all platforms

The Batch Processing Trick

Instead of processing each painting end-to-end, I process Step 1 for ALL paintings, then Step 2 for all, etc.

This drastically reduces GPU loading times: one warmup per step instead of 10 × N. It’s a big gain on total processing time.

The State Machine

Each painting is tracked in a SQLite database with a precise state: PENDING → PROCESSING → COMPLETE / FAILED / ABORTED / REDO.

The system supports automatic retries (×2) and resuming from any step. A crash at step 7 on 50 videos doesn’t mean redoing everything.

Custom ComfyUI Client

My pipeline uses a custom HTTP + WebSocket client to interact with ComfyUI, with real-time progress tracking and polling fallback. Between batches, ComfyUI is automatically restarted via SSH (systemctl restart comfyui) to prevent memory leaks.

The Human-in-the-Loop: Honest Numbers

Contrary to what you might think, the pipeline doesn’t produce « plug-and-play » content. Every day, I manually review each video, one by one.

Rejection Rates

Stage	Rejection Rate	Reasons
Pre-filtering automatic (step 0)	~30%	Blurry painting, multiple people, nudity
Post-processing manual (after 10 steps)	~30%	Incorrect facial animation, visual artifacts, failed lip-sync, ugly transitions, scripts with errors
Global success rate	~49%

Out of 30,000 paintings in my database (free-rights photos), a random draw picks ~20 paintings from a predefined list meant to correspond to portraits each evening.

In the morning: ~10 validated videos out of ~20 started.

💡 Pre-filtering is as critical as the pipeline itself. Better to reject 30% before wasting GPU time. Vision AI analysis covers part of the filtering (nudity, child, single portrait), but some criteria (blur, character size, head orientation) still require the human eye.

Why Local? The Energy Argument

Cloud (closed) models are better. But here, I’m only limited by electricity costs.

In France:

✅ Electricity is cheap
✅ Low carbon (nuclear)
✅ GPUs run at night when energy is even cheaper
✅ Full independence: no API credits, no request limits
✅ Zero marginal cost per generated video

💰 Average Cost Per Video: €0.0193 (~1.9 ct) — at off-peak rate of €0.1579/kWh (2.9 ct if we take into account the 30% reject rate)

Lessons Learned

SQLite state management is essential. Without it, a crash = start over. With it, you resume where you left off.
Batch-by-step = big speedup. One GPU warmup per step instead of one warmup per painting per step.
Automatic ComfyUI restart. Via SSH between batches — it’s the only reliable way to avoid memory leaks over multi-hour runs.
Offline search via ZIM. No network dependency for Wikipedia data. Faster, more reliable, more private.
30% rejection is the current reality. Video models still generate too many artifacts for 100% automation. Human-in-the-loop is non-negotiable for quality content.

Licenses

Modèle	License	Usage commercial
Qwen3.6-27B	Apache 2.0	✅ Free
Qwen3.5-122B	Apache 2.0	✅ Free
Gemma4-31B	Apache 2.0	✅ Free
Qwen TTS	Apache 2.0	✅ Free
Wan2.2-I2V-A14B	Apache 2.0	✅ Free
Flux.2 Klein 9B	FLUX Non-Commercial License v2.1	✅ Free usage of outputs
LTX-2.3	LTX-2 License	⚠️ if revenue < 10M$

Pipeline Execution Report

Batch #0 · iartshorts · 2026-05-22 02:58 → 07:54 · 20 paintings selected, 30 videos produced (15 FR + 15 EN)

4h 56m

Total Wall-Clock

4h 24m

GPU Compute

89.4%

GPU Efficiency

Videos Output

✅ 15 paintings completed → 30 videos (French + English each). 5 paintings aborted at step 0 (Vision Analysis) — all were rejected in under 3 seconds

GPU Efficiency Breakdown

Across all 15 completed paintings, the GPU was actively computing 89.4% of the total wall-clock time. Only 31 minutes were spent waiting in queues — excellent utilization for a serial pipeline.

89.4%

GPU Active

GPU Compute: 4h 24m

Queue Wait: 31m

Time Breakdown by Pipeline Step

Each painting goes through 10 steps. Every step with a ComfyUI workflow is executed for both French and English versions. Here’s how time is distributed across the 15 completed paintings:

Wall-Clock Time per Step (15 paintings)

Step 0 — Prepare (Vision)

0.9m

0.3%

Step 1 — Script (LLM + ZIM)

21.6m

7.3%

Step 2 — Audio (TTS × 2)

28.2m

9.6%

Step 3 — Clean Pic (Flux)

2.3m

0.8%

Step 4 — Realistic Pic (Flux)

1.6m

0.5%

Step 5 — Transition Vid (Wan)

22.5m

7.6%

Step 6 — LipSync Vid (LTX × 2)

164.1m

55.6%

Step 7 — Subtitles (Wan)

7.5m

2.5%

Step 8 — Outro Vid (Wan × 2)

42.8m

14.5%

Step 9 — Final Video (FFmpeg)

3.4m

1.2%

Compute vs Queue Time per Step

Step	Stage	Wall-Clock	GPU Compute	Queue Wait	GPU %
0	Prepare (Vision)	0.9m	—	0.9m	—
1	Script (LLM + ZIM)	21.6m	—	21.6m	—
2	Audio (TTS × 2)	28.2m	27.5m	0.8m	97.3%
3	Clean Pic (Flux)	2.3m	1.5m	0.8m	66.2%
4	Realistic Pic (Flux)	1.6m	0.8m	0.8m	51.6%
5	Transition Vid (Wan)	22.5m	21.8m	0.8m	96.6%
6	LipSync Vid (LTX × 2)	164.1m	163.3m	0.8m	99.5%
7	Subtitles (Wan)	7.5m	6.7m	0.8m	89.3%
8	Outro Vid (Wan × 2)	42.8m	42.0m	0.8m	98.1%
9	Final Video (FFmpeg)	3.4m	—	3.4m	—

Top 3 Bottlenecks

These three steps account for 88.3% of all GPU compute time:

#	Step	Model	GPU Compute	% of Total	Avg / Painting
1	LipSync Video	LTX2.3-AudioSync × 2	2h 43m	62.0%	10.9 min
2	Outro Video	Wan2.2 × 2	42 min	15.9%	2.8 min
3	Audio (TTS)	Qwen TTS × 2	28 min	10.4%	1.8 min

🐌 LipSync (LTX2.3) is the dominant bottleneck at 62% of GPU time. Each painting requires 2 lip-sync videos (FR + EN), each averaging ~5.4 minutes. Reducing this step’s duration would have the biggest impact on overall throughput.

Per-Painting Breakdown (15 Completed)

Painting	Artist	Wall-Clock	GPU Compute	GPU %	Longest Step
Young_Girl_Holding_a_Basket	Berthe Morisot	22.5m	21.9m	97.3%	LipSync 7.6m
Thomas_Howard_2nd_Earl_of_Arundel	Anthony van Dyck	23.2m	22.7m	97.8%	LipSync 7.9m
The_Spanish_Guitarist	Pierre-Auguste Renoir	25.5m	25.2m	98.8%	LipSync 10.0m
Woman_in_a_Flowered_Hat	Pierre-Auguste Renoir	26.2m	25.9m	98.8%	LipSync 10.3m
Woman_in_a_Garden	Berthe Morisot	26.5m	26.2m	98.9%	LipSync 10.6m
Young_Girl_with_an_Apron	Berthe Morisot	26.9m	26.5m	98.5%	LipSync 10.8m
Woman_with_Red_Hair	Alice Pike Barney	27.2m	26.9m	98.9%	LipSync 10.9m
Young_Woman_with_a_Water_Pitcher	Johannes Vermeer	27.2m	26.8m	98.5%	LipSync 11.1m
Woman_in_Tulle_Blouse	Pierre-Auguste Renoir	27.5m	27.1m	98.6%	LipSync 11.2m
The_Embroiderer	Jean Siméon Chardin	27.6m	27.2m	98.6%	LipSync 11.3m
The_Stroller_Suzanne_Hoschedé	Claude Monet	27.8m	27.4m	98.6%	LipSync 11.4m
Young_Woman_with_Roses	Alice Pike Barney	29.0m	28.6m	98.6%	LipSync 12.2m
Woman_Seated_under_the_Willows	Claude Monet	29.3m	28.9m	98.6%	LipSync 12.5m
Young_Girl_in_a_Pink-and-Black_Hat	Pierre-Auguste Renoir	29.7m	29.3m	98.7%	LipSync 12.7m
The_Red_Kerchief	Claude Monet	31.6m	31.1m	98.4%	LipSync 13.6m

💡 Fastest painting: « Young Girl Holding a Basket » by Berthe Morisot — 22.5 min total. Slowest: « The Red Kerchief » by Claude Monet — 31.6 min total. The spread is 9 minutes, primarily driven by lip-sync duration differences.

My homelab that generates AI videos while I sleep