Whisper Speech Recognition (Speech-to-Text)¶

1. Overview¶

Whisper is an AI-based speech recognition model developed by OpenAI.
It can convert audio files into text (STT) and supports multiple languages.
In VoiceScriptPlayer, Whisper is used for automatic subtitle generation, script extraction, and real-time voice command recognition.

2. Installation & Setup¶

VoiceScriptPlayer already includes WhisperNet, so no additional installation is required.
WhisperNet is a .NET implementation of Whisper that allows it to run directly in VoiceScriptPlayer.
- WhisperNet GitHub

🔽 Automatic Model Download¶

In the AI / Whisper Settings tab of VoiceScriptPlayer, you can choose a model size
(tiny, base, small, medium, large) to automatically download and apply it.
If you are connected to the internet, no manual download is necessary.

If you prefer, you can also download models manually from the links below:

Model	Size	Download
tiny	~75 MB	Download
base	~142 MB	Download
small	~466 MB	Download
medium	~1.5 GB	Download
large	~2.9 GB	Download

⚠️ Larger models provide higher accuracy but slower processing and increased memory usage.

3. Configuration¶

Choose models via WhisperNet within VoiceScriptPlayer.
Set default model (e.g., base, medium)
Configure language detection (auto vs. manual)
Performance options:
Accuracy Priority / Speed Priority
CPU / GPU mode selection

4. Usage¶

Load an audio file (MP3, WAV, MP4, etc.)
Export subtitles as .srt or .vtt
Extract plain text
Use real-time speech recognition
Workflow example:
File → Whisper Processing → Display Result

5. Notes & Limitations¶

Processing time and memory usage vary by model size.
Long recordings may take more time to process.
Performance will be slower without GPU acceleration.
Whisper is open source, but you must check the license terms before commercial use.
Whisper works offline after models are downloaded (internet is only required for initial download).

⚡ Performance Benchmark¶

Environment	Model	Processing Time for 10-Minute Audio
CPU (Desktop i5/i7)	`base`	~7–10 minutes
CPU (Low-end Laptop)	`base`	~12–15 minutes
GPU (RTX 3060 or higher)	`base`	~1–2 minutes
GPU (High-end RTX 4090)	`large`	~30 seconds–1 minute

💡 Larger models improve transcription accuracy but increase processing time.
Once downloaded, Whisper can be used completely offline.

6. License & Credits¶

Whisper (original): MIT License
Whisper.cpp: MIT License
WhisperNet: MIT License
Official GitHub repositories:
Whisper
Whisper.cpp
WhisperNet
Commercial use allowed (ownership of transcribed text belongs to the user).

7. Troubleshooting / FAQ¶

❓ "Model file not found."
→ Models can be automatically downloaded in the Whisper settings tab.
For manual download, visit the Whisper.cpp GitHub page.
❓ "Processing is too slow."
→ Use a smaller model (tiny, base) or enable GPU acceleration.
On a typical CPU, a 10-minute file takes about 7–10 minutes;
with GPU, it completes in about 1–2 minutes.
❓ "Language is incorrectly detected."
→ Disable automatic detection and specify the language manually.
❓ "Out of memory error."
→ Use a smaller model or split the audio file into shorter segments.