Transcribe and Correct with AI

Reading time: approx. 10 min

What You Will Learn

This is the core of the process. We will use two different AI models. First KB-Whisper to perform the actual speech-to-text conversion on our audio files. Then we use a text correction model to automatically add punctuation (periods, commas) and correct capitalization, which makes the text much more readable.

The Basics: Two-Stage Rocket

Transcription: KBLab/kb-whisper-large listens to the audio and writes down the words it hears. The result is raw, unpunctuated text with timestamps.
Punctuation: sdadas/byt5-text-correction reads the raw text and uses its understanding of grammar and sentence structure to add periods, commas, and capital letters.

How We Do It: The Scripts That Do the Job

Step 1: Transcribe the Segments (`transcribe.py`)

Create the file transcribe.py in your project folder and paste in the code below. The code loads the KB-Whisper model into your graphics card memory (if you have one) and then feeds it one audio file at a time.

import os, warnings
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import torch

warnings.filterwarnings("ignore", message=".*deprecated.*")

# --- Settings ---
MODEL    = "KBLab/kb-whisper-large"
DEVICE   = "cuda:0" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.float16 if torch.cuda.is_available() else torch.float32
DIR      = "chunks"
OUTFILE  = "transcript_with_timestamps.txt"
CHUNK_DURATION_S = 30

# Load the model and processor
print(f"Loading model {MODEL} to {DEVICE}...")
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL, torch_dtype=DTYPE, use_safetensors=True).to(DEVICE)
processor = AutoProcessor.from_pretrained(MODEL)
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=DEVICE,
    torch_dtype=DTYPE
)

# Helper function to format time
def format_time(seconds):
    h, r = divmod(seconds, 3600)
    m, s = divmod(r, 60)
    return f"{int(h):02d}:{int(m):02d}:{int(s):02d}"

# Fetch and sort the audio files
files = sorted([f for f in os.listdir(DIR) if f.endswith(".wav")])

# Open a file to write the result to
with open(OUTFILE, "w", encoding="utf-8") as out:
    for filename in files:
        chunk_index = int(filename.split("_")[1].split(".")[0])
        start_time = chunk_index * CHUNK_DURATION_S
        end_time = start_time + CHUNK_DURATION_S

        filepath = os.path.join(DIR, filename)
        print(f"Transcribing {filename}...")

        result = asr_pipeline(filepath, generate_kwargs={"language": "sv"})
        text = result["text"].strip()

        out.write(f"[{format_time(start_time)} - {format_time(end_time)}]\n{text}\n\n")

print(f"Transcription complete! Result saved in: {OUTFILE}")

Step 2: Add Punctuation and Capitalization (`punctuate.py`)

Create the file punctuate.py. This script reads the raw text file, sends the text blocks to the correction model, and then writes a new, cleaner text file.

import re
from transformers import pipeline

# --- Settings ---
INPUT_FILE  = "transcript_with_timestamps.txt"
OUTPUT_FILE = "transcript_punctuated.txt"
MODEL = "sdadas/byt5-text-correction"
BATCH_SIZE = 8 # Process 8 text blocks at a time for efficiency

# Function to ensure capital letter after period
def capitalize_sentences(text):
    parts = re.split(r'([.?!]\s*)', text)
    return "".join(p.capitalize() for p in parts)

print(f"Loading punctuation model {MODEL}...")
punctuation_pipeline = pipeline(
    "text2text-generation",
    model=MODEL,
    tokenizer=MODEL,
    device=0, # Use first GPU
    batch_size=BATCH_SIZE
)

# Read in the raw text
with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    lines = f.read().splitlines()

# Separate timestamps from text to be processed
timestamps, texts_to_process = [], []
for line in lines:
    if line.startswith("[") or not line.strip():
        timestamps.append(line)
    else:
        timestamps.append(None) # Marker for text
        texts_to_process.append(line)

print(f"Correcting {len(texts_to_process)} text segments...")
corrected_texts = punctuation_pipeline(texts_to_process)

# Assemble the final text
output_lines = []
text_index = 0
for ts in timestamps:
    if ts is not None:
        output_lines.append(ts)
    else:
        # Fetch corrected text and apply capitalization
        corrected_text = corrected_texts[text_index]['generated_text'].strip()
        final_text = capitalize_sentences(corrected_text)
        output_lines.append(final_text)
        text_index += 1

with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    f.write("\n".join(output_lines))

print(f"Punctuation complete! Result saved in: {OUTPUT_FILE}")

How to Run the Scripts

Make sure your virtual environment is active (source .venv/bin/activate). Then run the scripts in order:

Run the transcription:
```
python transcribe.py
```
This can take a while depending on your computer's power and the length of the audio file.
Run the punctuation:
```
python punctuate.py
```
This usually goes much faster.

You now have a file called transcript_punctuated.txt with a clean and timestamped transcript.

Next Step

The text is ready! But having it in a .txt file is not always the most useful format. In the final moment we learn to convert our text into nice documents in formats like Markdown, Word (.docx), and HTML.

Local AI in Practice: Transcribe YouTube with KB-Whisper

Transcribe and Correct with AI

What You Will Learn

The Basics: Two-Stage Rocket

How We Do It: The Scripts That Do the Job

Step 1: Transcribe the Segments (`transcribe.py`)

Step 2: Add Punctuation and Capitalization (`punctuate.py`)

How to Run the Scripts

Next Step

Local AI in Practice: Transcribe YouTube with KB-Whisper

Local AI in Practice: Transcribe YouTube with KB-Whisper

Transcribe and Correct with AI

What You Will Learn

The Basics: Two-Stage Rocket

How We Do It: The Scripts That Do the Job

Step 1: Transcribe the Segments (transcribe.py)

Step 2: Add Punctuation and Capitalization (punctuate.py)

How to Run the Scripts

Next Step

Local AI in Practice: Transcribe YouTube with KB-Whisper

Step 1: Transcribe the Segments (`transcribe.py`)

Step 2: Add Punctuation and Capitalization (`punctuate.py`)