Transcribe and Correct with AI

Reading time: approx. 10 min

What You Will Learn

This is the core of the process. We will use two different AI models. First KB-Whisper to perform the actual speech-to-text conversion on our audio files. Then we use a text correction model to automatically add punctuation (periods, commas) and correct capitalization, which makes the text much more readable.

The Basics: Two-Stage Rocket

  1. Transcription: KBLab/kb-whisper-large listens to the audio and writes down the words it hears. The result is raw, unpunctuated text with timestamps.
  2. Punctuation: sdadas/byt5-text-correction reads the raw text and uses its understanding of grammar and sentence structure to add periods, commas, and capital letters.

How We Do It: The Scripts That Do the Job

Step 1: Transcribe the Segments (transcribe.py)

Create the file transcribe.py in your project folder and paste in the code below. The code loads the KB-Whisper model into your graphics card memory (if you have one) and then feeds it one audio file at a time.

import os, warnings
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import torch

warnings.filterwarnings("ignore", message=".*deprecated.*")

# --- Settings ---
MODEL    = "KBLab/kb-whisper-large"
DEVICE   = "cuda:0" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.float16 if torch.cuda.is_available() else torch.float32
DIR      = "chunks"
OUTFILE  = "transcript_with_timestamps.txt"
CHUNK_DURATION_S = 30

# Load the model and processor
print(f"Loading model {MODEL} to {DEVICE}...")
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL, torch_dtype=DTYPE, use_safetensors=True).to(DEVICE)
processor = AutoProcessor.from_pretrained(MODEL)
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=DEVICE,
    torch_dtype=DTYPE
)

# Helper function to format time
def format_time(seconds):
    h, r = divmod(seconds, 3600)
    m, s = divmod(r, 60)
    return f"{int(h):02d}:{int(m):02d}:{int(s):02d}"

# Fetch and sort the audio files
files = sorted([f for f in os.listdir(DIR) if f.endswith(".wav")])

# Open a file to write the result to
with open(OUTFILE, "w", encoding="utf-8") as out:
    for filename in files:
        chunk_index = int(filename.split("_")[1].split(".")[0])
        start_time = chunk_index * CHUNK_DURATION_S
        end_time = start_time + CHUNK_DURATION_S

        filepath = os.path.join(DIR, filename)
        print(f"Transcribing {filename}...")

        result = asr_pipeline(filepath, generate_kwargs={"language": "sv"})
        text = result["text"].strip()

        out.write(f"[{format_time(start_time)} - {format_time(end_time)}]\n{text}\n\n")

print(f"Transcription complete! Result saved in: {OUTFILE}")

Step 2: Add Punctuation and Capitalization (punctuate.py)

Create the file punctuate.py. This script reads the raw text file, sends the text blocks to the correction model, and then writes a new, cleaner text file.

import re
from transformers import pipeline

# --- Settings ---
INPUT_FILE  = "transcript_with_timestamps.txt"
OUTPUT_FILE = "transcript_punctuated.txt"
MODEL = "sdadas/byt5-text-correction"
BATCH_SIZE = 8 # Process 8 text blocks at a time for efficiency

# Function to ensure capital letter after period
def capitalize_sentences(text):
    parts = re.split(r'([.?!]\s*)', text)
    return "".join(p.capitalize() for p in parts)

print(f"Loading punctuation model {MODEL}...")
punctuation_pipeline = pipeline(
    "text2text-generation",
    model=MODEL,
    tokenizer=MODEL,
    device=0, # Use first GPU
    batch_size=BATCH_SIZE
)

# Read in the raw text
with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    lines = f.read().splitlines()

# Separate timestamps from text to be processed
timestamps, texts_to_process = [], []
for line in lines:
    if line.startswith("[") or not line.strip():
        timestamps.append(line)
    else:
        timestamps.append(None) # Marker for text
        texts_to_process.append(line)

print(f"Correcting {len(texts_to_process)} text segments...")
corrected_texts = punctuation_pipeline(texts_to_process)

# Assemble the final text
output_lines = []
text_index = 0
for ts in timestamps:
    if ts is not None:
        output_lines.append(ts)
    else:
        # Fetch corrected text and apply capitalization
        corrected_text = corrected_texts[text_index]['generated_text'].strip()
        final_text = capitalize_sentences(corrected_text)
        output_lines.append(final_text)
        text_index += 1

with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    f.write("\n".join(output_lines))

print(f"Punctuation complete! Result saved in: {OUTPUT_FILE}")

How to Run the Scripts

Make sure your virtual environment is active (source .venv/bin/activate). Then run the scripts in order:

  1. Run the transcription:

    python transcribe.py
    

    This can take a while depending on your computer's power and the length of the audio file.

  2. Run the punctuation:

    python punctuate.py
    

    This usually goes much faster.

You now have a file called transcript_punctuated.txt with a clean and timestamped transcript.

Next Step

The text is ready! But having it in a .txt file is not always the most useful format. In the final moment we learn to convert our text into nice documents in formats like Markdown, Word (.docx), and HTML.