Video-analyzer: a Claude skill for any spoken video
Video-analyzer is a Claude Code skill that turns any spoken-word video — a YouTube link, a Reels, a local .mp4 — into a transcript, a summary, a tutorial, a list of quotes, or whatever you ask for. The deterministic part (download, audio extraction, scene-keyframe sampling, speech-to-text) runs as bundled scripts. The synthesis is done by Claude using the transcript and a few sampled frames.
I use it daily — for podcast notes, lecture summaries, "what does this 90-min video actually say", and pulling timestamps out of long calls.
Install in 10 seconds
Paste this into your terminal. It downloads the bundle and unpacks it into the right folder for Claude Code to pick up automatically.
mkdir -p ~/.claude/skills && curl -L https://alfayzullaev.com/skills/video-analyzer.tar.gz | tar -xz -C ~/.claude/skills/That's it — open Claude Code, the skill is now discoverable. No restart needed.
Or download manually
↓video-analyzer.tar.gz(28 KB)Unpack into ~/.claude/skills/ so the final path is ~/.claude/skills/video-analyzer/. The bundle contains SKILL.md, the scripts/ folder, and references/ — no venv, no caches.
How to use it
Once installed, just talk to Claude Code naturally. The skill triggers on phrases like:
- "transcribe this video:
/path/to/file.mp4" - "make a tutorial out of this Reels:
https://..." - "summarize this YouTube link in 5 bullets"
- "what does the speaker say about pricing?"
- "pull the best quotes with timestamps"
- "list every video on this YouTube channel" (no transcription, just titles + links)
You don't have to say "use the video-analyzer skill" — Claude picks it up from the input pattern (path, URL, or channel/playlist link).
Output modes
The same pipeline runs for everything; only the final synthesis changes based on what you asked for.
| Mode | When it triggers | What you get |
|---|---|---|
| tutorial | "make a tutorial / breakdown / detailed notes" | Multi-section markdown with timestamps |
| brief | "TLDR / short summary / в двух словах" | 1 paragraph + 5–10 bullets |
| transcript | "just the subtitles / SRT / raw text" | Plain .srt and .txt from the workspace |
| quotes | "best lines / memorable quotes / highlights" | 5–15 quotes with [MM:SS] timestamps |
| qa | "what does it say about X" | Direct answer with cited timestamps |
| topics | "what topics / table of contents" | Hierarchical breakdown with timestamps |
What's under the hood
The pipeline is scripts/pipeline.sh, which chains independent stages:
download_video.sh—yt-dlpfor any URL, with--cookies-from-browsersupport for login-walled platforms (Instagram, etc.)extract_audio.sh—ffmpegto 16 kHz mono WAVextract_keyframes.sh— scene-change detection → JPG frames + timestampstranscribe.py— pluggable Whisper backend (mlx-whisper on Mac, faster-whisper on Linux/CUDA)
For URL inputs, the skill asks you upfront whether you want audio-only (cheapest, no frames), low-quality video + frames (default for tutorials), medium, or best (for slide-by-slide visual analysis). Audio-only saves 5–10× the bandwidth when you just want a transcript.
For YouTube channel/playlist links, there's a separate list_channel.sh that just dumps titles + URLs — no transcription.
Extending it
The pipeline is modular. To add a stage (e.g. OCR on keyframes, diarization, translation), drop a script into scripts/ that follows the workspace contract documented in references/pipeline_architecture.md. To add an output mode (e.g. slides, tweet-thread), append a row to the table in SKILL.md and write a template into references/.
Caveats
- Needs
ffmpegandyt-dlpinstalled (the pipeline auto-installs them viabrewon macOS; on Linux install from your package manager). - First run sets up a Python venv inside the skill folder for Whisper — adds ~500 MB.
- Silent video doesn't work — there's no speech to transcribe. The skill will tell you.
If you want the design rationale for skills like this — why a deterministic pipeline beats letting Claude orchestrate ffmpeg from scratch every time — ping me.