Leo Huang grew tired of watching large language models pretend to understand video. Paste a YouTube link into ChatGPT and it reads the transcript. Claude refuses video files outright. Gemini uploads everything to Google’s servers and grabs frames at a rigid one-per-second pace. Fast cuts vanish. Static slides flood the context window with duplicates.
So Huang built a fix. His open-source project, hosted at https://github.com/HUANGCHIHHUNGLeo/claude-real-video, processes videos locally. It detects actual scene changes. It discards near-identical frames. It produces a clean folder of meaningful images, a plain-text transcript, and a manifest file any model can consume. No cloud uploads. No fixed sampling. Just smarter input.
The command is simple. Run crv "https://www.youtube.com/watch?v=...". Out comes a directory with selected JPEGs, a transcript.txt, and MANIFEST.txt. Drop those into Claude, ChatGPT, or Gemini. Ask what happened on screen. The model finally sees.
Smarter Sampling Changes What Models Can Do
Fixed-interval frame extraction wastes tokens and misses action. A 10-minute static presentation yields 600 near-duplicates. A frenetic TikTok reel loses key moments between samples. Huang’s approach uses ffmpeg to catch every scene transition above a sensitivity threshold, then adds a density floor so slow scenes still get at least one frame every few seconds.
Next comes deduplication. The tool compares each candidate frame against a sliding window of previously kept images. It measures raw pixel differences after downscaling. Hashes often fail on flat colors or equal-luma shifts. Pixel math does not. A cutaway that returns to the same shot sends the image once. The result: fewer frames, higher relevance, lower cost.
Audio handling proves equally thoughtful. If a local file already carries subtitles in .srt or .vtt format or embedded tracks, the tool uses them. Accuracy improves. Speed increases. Absent subtitles, it falls back to Whisper. Users can also save the full original soundtrack as audio.m4a for models that accept audio input. The transcript captures words. The file preserves music, tone, and effects.
Options give control. Adjust scene sensitivity. Set a hard cap at 150 frames. Tune the deduplication threshold. Generate an HTML report that visualizes every keep-or-drop decision with exact difference percentages. Developers tune once. Production runs clean.
Installation stays straightforward. Pip install the package. Add the whisper extra for transcription. Install ffmpeg through brew, apt, winget, or direct download. The tool runs on macOS, Windows, and Linux with Python 3.10 or newer. Cookies support gated content for authorized personal use. Huang warns clearly: download only what you own the rights to. Don’t embed credentials in repositories.
Python integration fits agentic workflows. One import and a single function call returns an object with frame count and file paths. Chain it with video generation tools or analysis scripts. Recent creator discussions on X show exactly that pattern. Builders connect Claude Code to Higgsfield or other video models, use this preprocessing step, then direct full production pipelines from a single chat. One post described turning a story prompt into consistent multi-shot video with character references and clean dialogue in under 40 minutes.
The timing matters. Video generation has exploded. ByteDance’s Seedance 2.0, Google’s Veo, Kling, and others produce ever longer and more realistic clips. Yet understanding incoming video remains a bottleneck. A recent guide from Ryan Doser explains how creators route multiple video models through OpenRouter inside Claude Code setups. Preprocessing with accurate, deduplicated frames raises the quality of prompts and analysis that feed those generators.
Anthropic itself has not shipped native video understanding in Claude at the level some competitors offer. The company focuses on safety, reasoning, and coding strength. That leaves room for community tools. Huang’s repository, with its MIT license and 23 stars as of this week, fills a practical gap. It doesn’t claim to replace native multimodal models. It makes existing ones dramatically more effective.
Limitations exist. Processing a long feature film still demands time and disk space. Whisper transcription quality varies with accents and background noise. Scene detection can over-trigger on certain lighting shifts. Yet the manifest file and report.html let users audit and adjust. Reruns overwrite the output directory, so version carefully.
But consider the shift. Video no longer sits behind a wall of transcripts or fixed sampling rates. A local script extracts what matters. Models receive curated vision. Analysis becomes precise. Automation follows. Content creators repurpose long videos faster. Researchers annotate footage without manual frame selection. Security teams review surveillance without drowning in duplicates.
And the project keeps evolving. Recent updates added full audio preservation so models can hear music and effects alongside the transcript. Threads posts from the author, shared just days ago, highlight the addition. Builders already layer it with Claude Code toolkits that orchestrate script writing, voiceover, and rendering.
The broader pattern feels familiar. Frontier labs push model scale. Practitioners build the glue that makes those models useful in specific domains. Video understanding was one stubborn corner. Huang’s small tool just sanded down the rough edge. Expect more forks. Expect tighter integration with agent frameworks. Expect workflows where one prompt triggers video intake, analysis, then new generation without a human touching a frame.
Right now the barrier is low. Install ffmpeg. Run two pip commands. Point at a URL. Feed the output to your model of choice. The difference appears immediately. Static slides collapse to single frames. Action sequences deliver every beat. The transcript aligns. The manifest explains the sequence. The model finally watches. And the results speak for themselves.


WebProNews is an iEntry Publication