Built a CLI to reverse-engineer viral TikToks with AI. Here's the architecture.

TL;DR: Built a pipeline that downloads viral TikToks, analyzes them frame-by-frame with Gemini, and regenerates them with custom branding using AI video models. Currently at \~$2-4 per video depending on length.

The Problem

I run content for a few brands. The playbook everyone uses: find viral content, recreate it manually with your spin. Works, but brutal to scale.

Manual recreation meant:

Watching videos 10+ times to catch the vibe
Writing shot-by-shot scripts
Finding/creating matching visuals
Editing, re-editing, hoping for similar energy

One 30-second video = 4-6 hours of work. Didn't scale.

The Architecture

Built a CLI pipeline in TypeScript. Four stages:

1. Ingest Download TikTok via ScrapeCreators API. Calculate segment boundaries (I break videos into 2-4 second chunks for generation).

2. Analyze Feed full video to Gemini 2.5 Flash. Extract three things:

vibe_check — overall energy, pacing, hook structure
character_bible — who's in it, their role, visual style
segment_analysis — per-segment descriptions, camera angles, motion

This is where the magic happens. Gemini's video understanding is genuinely impressive.

3. Synthesize For each segment:

Generate image prompt from analysis (Gemini)
Create seed image (Flux via FAL)
Generate video clip (Kling, Hailuo, or Wan — each has tradeoffs)
Optional: clone voice for narration (ElevenLabs)

Two modes: parallel (faster) or chained (pass last frame to next segment for continuity).

4. Stitch FFmpeg merges clips, adds music/narration. Output: your "twinned" video.

Costs (Honest Numbers)

Per 30-second video (3 segments):

Gemini analysis: \~$0.02
Flux images (3x): \~$0.15
Video generation (3x Kling): \~$2.50
ElevenLabs (optional): \~$0.30
Total: $2.97

Hailuo is cheaper (\~$1.80 total) but slower. Wan is experimental.

What Actually Surprised Me

Segmentation matters more than model quality. Bad segment boundaries = jarring output regardless of video model.
Chained mode is worth the time penalty. Passing the last frame as reference to the next segment creates 10x better continuity. Only Kling supports this well.
Vibe extraction is the hardest part. Getting Gemini to capture the energy of a video, not just describe what's in it, took weeks of prompt iteration.

What's Next

Considering adding trend discovery (scrape trending sounds/formats) and batch processing. Right now it's one video at a time.

Happy to share more about specific parts. The Gemini prompt for vibe extraction took forever to get right — can share that if useful.

Author: Mindless_Swimming315