Claude / AI2026-05-13 · 6 min

Video-analyzer: a Claude skill for any spoken video

Video-analyzer is a Claude Code skill that turns any spoken-word video — a YouTube link, a Reels, a local .mp4 — into a transcript, a summary, a tutorial, a list of quotes, or whatever you ask for. The deterministic part (download, audio extraction, scene-keyframe sampling, speech-to-text) runs as bundled scripts. The synthesis is done by Claude using the transcript and a few sampled frames.

I use it daily — for podcast notes, lecture summaries, "what does this 90-min video actually say", and pulling timestamps out of long calls.

Install in 10 seconds

Paste this into your terminal. It downloads the bundle and unpacks it into the right folder for Claude Code to pick up automatically.

One-liner install

mkdir -p ~/.claude/skills && curl -L https://alfayzullaev.com/skills/video-analyzer.tar.gz | tar -xz -C ~/.claude/skills/

That's it — open Claude Code, the skill is now discoverable. No restart needed.

Or download manually

↓video-analyzer.tar.gz(28 KB)

Unpack into ~/.claude/skills/ so the final path is ~/.claude/skills/video-analyzer/. The bundle contains SKILL.md, the scripts/ folder, and references/ — no venv, no caches.

How to use it

Once installed, just talk to Claude Code naturally. The skill triggers on phrases like:

"transcribe this video: /path/to/file.mp4"
"make a tutorial out of this Reels: https://..."
"summarize this YouTube link in 5 bullets"
"what does the speaker say about pricing?"
"pull the best quotes with timestamps"
"list every video on this YouTube channel" (no transcription, just titles + links)

You don't have to say "use the video-analyzer skill" — Claude picks it up from the input pattern (path, URL, or channel/playlist link).

Output modes

The same pipeline runs for everything; only the final synthesis changes based on what you asked for.

| Mode | When it triggers | What you get | |---|---|---| | tutorial | "make a tutorial / breakdown / detailed notes" | Multi-section markdown with timestamps | | brief | "TLDR / short summary / в двух словах" | 1 paragraph + 5–10 bullets | | transcript | "just the subtitles / SRT / raw text" | Plain .srt and .txt from the workspace | | quotes | "best lines / memorable quotes / highlights" | 5–15 quotes with [MM:SS] timestamps | | qa | "what does it say about X" | Direct answer with cited timestamps | | topics | "what topics / table of contents" | Hierarchical breakdown with timestamps |

What's under the hood

The pipeline is scripts/pipeline.sh, which chains independent stages:

download_video.sh — yt-dlp for any URL, with --cookies-from-browser support for login-walled platforms (Instagram, etc.)
extract_audio.sh — ffmpeg to 16 kHz mono WAV
extract_keyframes.sh — scene-change detection → JPG frames + timestamps
transcribe.py — pluggable Whisper backend (mlx-whisper on Mac, faster-whisper on Linux/CUDA)

For URL inputs, the skill asks you upfront whether you want audio-only (cheapest, no frames), low-quality video + frames (default for tutorials), medium, or best (for slide-by-slide visual analysis). Audio-only saves 5–10× the bandwidth when you just want a transcript.

For YouTube channel/playlist links, there's a separate list_channel.sh that just dumps titles + URLs — no transcription.

Extending it

The pipeline is modular. To add a stage (e.g. OCR on keyframes, diarization, translation), drop a script into scripts/ that follows the workspace contract documented in references/pipeline_architecture.md. To add an output mode (e.g. slides, tweet-thread), append a row to the table in SKILL.md and write a template into references/.

Caveats

Needs ffmpeg and yt-dlp installed (the pipeline auto-installs them via brew on macOS; on Linux install from your package manager).
First run sets up a Python venv inside the skill folder for Whisper — adds ~500 MB.
Silent video doesn't work — there's no speech to transcribe. The skill will tell you.

If you want the design rationale for skills like this — why a deterministic pipeline beats letting Claude orchestrate ffmpeg from scratch every time — ping me.