Transcription that stays on your machine
The Podcast Idiot Transcriber is a desktop application I built to solve a simple problem: I needed accurate transcripts of my podcast episodes without paying for a cloud service every month or handing my audio files off to someone else’s server.
It uses OpenAI Whisper — the same AI transcription engine that powers some of the best commercial services — running entirely on your local computer. After the one-time setup, it works completely offline. Your audio never leaves your machine.
It also supports speaker diarization — the ability to tell speakers apart — so you get a labeled transcript showing which person said what. Essential for interview-format podcasts.
Completely Private
Your audio files never leave your computer. No uploads, no cloud processing, no data collection. Ever.
Free Forever
No subscriptions, no API keys to pay for, no usage limits. Download once, use as much as you want.
Speaker Labels
Automatically identifies different voices and labels them in the transcript. Perfect for interviews.
All Formats
Outputs TXT, SRT, VTT, JSON, and Podcasting 2.0 JSON — ready to drop into your RSS feed.
GPU Accelerated
Automatically uses your NVIDIA, Apple Silicon, or AMD GPU for dramatically faster transcription.
Cross-Platform
One download works on Windows, macOS, and Linux. Installers and uninstallers for all three included.
Simple from start to transcript
Download & Install
Download the zip, extract it, and run the installer for your OS. It handles Python, ffmpeg, and all dependencies automatically — and downloads the Whisper AI model (about 150 MB) one time.
Open the App
Launch Podcast Idiot Transcriber from your desktop icon, Start Menu, or application launcher. A clean branded interface shows all your options at a glance.
Choose Your Audio
Browse for your MP3, WAV, M4A, FLAC, OGG, or AAC file. Select an output folder, or let the app save transcripts beside the original audio.
Configure Options
Pick your Whisper model, choose language or auto-detect, toggle speaker labels, choose output formats, and set CPU priority so transcription runs quietly in the background.
Click Transcribe
Hit the big red Transcribe button and watch the progress log. When done, all your chosen output files are ready in the output folder.
Every format podcasters need
The app creates all your transcript files in a single pass. Choose which ones you want — or grab all of them.
| Format | File | Best For |
|---|---|---|
| TXT | .txt | Plain readable transcript for your website, show notes, or personal reference |
| SRT | .srt | Subtitle file for video versions of your podcast, YouTube, or video editors |
| VTT | .vtt | WebVTT captions for HTML5 players and web-based podcast players |
| JSON | .json | Full timestamped segment data for developers or custom integrations |
| Podcast 2.0 | _podcast20.json | Ready for the <podcast:transcript> tag in your RSS feed |
Podcasting 2.0 ready: The Podcast 2.0 JSON format follows the official podcast namespace transcript spec. Upload it to your server and point to it from your RSS feed — compatible with all Podcasting 2.0 apps.
Pick your speed vs. accuracy tradeoff
Whisper comes in five sizes. The app defaults to base — a great balance for most podcasts. Switch to a larger model any time for more accurate results on difficult audio.
| Model | Size | Speed (1hr, CPU) | Accuracy |
|---|---|---|---|
| tiny | 75 MB | ~5 min | Good |
| base ★ | 145 MB | ~10–15 min | Very Good — recommended |
| small | 465 MB | ~20–30 min | Great |
| medium | 1.5 GB | ~45–60 min | Excellent |
| large | 2.9 GB | ~90–120 min | Best |
Have a GPU? The app automatically detects and uses your NVIDIA (CUDA), Apple Silicon (Metal), or AMD GPU. A mid-range NVIDIA GPU can be 8–15× faster than CPU — a one-hour episode in under two minutes.
Know who said what
Speaker diarization automatically detects different voices and labels them in the transcript. Instead of a wall of text, you get something like this:
[SPEAKER_01] Thanks for having me. I’ve been looking for something like this for a while.
[SPEAKER_00] Let’s start with why you think transcripts matter for podcasters.
Speaker labeling uses pyannote.audio and requires a free HuggingFace account and token — a two-minute one-time setup. You can also turn it off entirely for faster, label-free transcription.
Works best with two clearly distinct voices and minimal crosstalk — typical interview podcasts transcribe with excellent accuracy.
What you need to run it
Python 3.10+
The installer checks for Python and guides you if it’s missing. Mac and Linux often have it pre-installed.
ffmpeg
Required for audio processing. The Mac installer gets it via Homebrew. Windows and Linux instructions are included.
~500 MB Disk
For the base Whisper model and Python environment. Larger models need up to 3 GB extra.
4 GB RAM
8 GB or more recommended. Larger Whisper models need more RAM — see the model table above.