AI Speech-to-Text Web App: Creating the Application

In this guide, we'll build all the necessary files for our AI-powered speech-to-text web application. We'll create a Flask backend that interfaces with Whisper and a web frontend that allows users to record and transcribe audio.

Understanding the Application Architecture

Our application consists of several components:

Flask Backend: Handles HTTP requests and interfaces with the Whisper model
Whisper Model: Processes audio and generates transcriptions
Web Frontend: Provides an interface for recording audio and displaying results
Audio Processing: Records, processes, and sends audio to the backend

Let's create each component step by step.

Building the Flask Backend (app.py)

First, let's create our main application file:

# app.py
import os
import tempfile
from flask import Flask, render_template, request, jsonify
import whisper
import numpy as np

app = Flask(__name__)

# Load the Whisper model (using the "base" model for balance of accuracy and speed)
model = whisper.load_model("base")

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/transcribe', methods=['POST'])
def transcribe():
    if 'audio' not in request.files:
        return jsonify({'error': 'No audio file provided'}), 400

    # Get the audio file from the request
    audio_file = request.files['audio']

    # Create a temporary file to store the audio
    with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_audio:
        audio_file.save(temp_audio.name)
        temp_audio_path = temp_audio.name

    try:
        # Transcribe the audio using Whisper
        result = model.transcribe(temp_audio_path)
        transcription = result['text']

        # Return the transcription
        return jsonify({'transcription': transcription})
    except Exception as e:
        return jsonify({'error': str(e)}), 500
    finally:
        # Clean up the temporary file
        os.unlink(temp_audio_path)

if __name__ == '__main__':
    app.run(debug=True)

This code does the following:

Imports necessary libraries
Creates a Flask application
Loads the Whisper model
Defines a route for the main page
Creates an API endpoint (/transcribe) that accepts audio files and returns transcriptions
Handles file saving, transcription, and cleanup

Creating the HTML Template (templates/index.html)

Now, let's create our main HTML template:

<!-- templates/index.html -->
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>AI Speech-to-Text App</title>
    <link
      href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css"
      rel="stylesheet"
    />
    <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}" />
  </head>
  <body>
    <div class="container mt-5">
      <div class="row justify-content-center">
        <div class="col-md-8">
          <div class="card">
            <div class="card-header text-center">
              <h1>AI Speech-to-Text</h1>
              <p class="text-muted">Powered by Whisper running locally</p>
            </div>
            <div class="card-body">
              <div class="mb-4 text-center">
                <button id="recordButton" class="btn btn-primary btn-lg">
                  <span id="recordIcon">🎤</span> <span id="recordText">Start Recording</span>
                </button>
              </div>

              <div class="alert alert-info" id="status" style="display: none;">
                Recording... Speak now.
              </div>

              <div class="progress mb-3" style="display: none;" id="progressContainer">
                <div
                  id="progressBar"
                  class="progress-bar progress-bar-striped progress-bar-animated"
                  role="progressbar"
                  aria-valuenow="0"
                  aria-valuemin="0"
                  aria-valuemax="100"
                ></div>
              </div>

              <div class="mb-3">
                <label for="transcription" class="form-label fw-bold">Transcription:</label>
                <div id="transcriptionContainer">
                  <textarea id="transcription" class="form-control" rows="6" readonly></textarea>
                </div>
              </div>

              <div class="d-flex justify-content-between">
                <button id="copyButton" class="btn btn-outline-secondary" disabled>
                  Copy to Clipboard
                </button>
                <button id="clearButton" class="btn btn-outline-danger" disabled>Clear</button>
              </div>
            </div>
            <div class="card-footer text-muted text-center">
              <small
                >All processing happens locally on your device. No data is sent to external
                servers.</small
              >
            </div>
          </div>
        </div>
      </div>
    </div>

    <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/js/bootstrap.bundle.min.js"></script>
    <script src="{{ url_for('static', filename='js/main.js') }}"></script>
  </body>
</html>

This HTML file creates a responsive interface with:

A button to start/stop recording
A status indicator
A progress bar for transcription
A text area to display the transcribed text
Buttons to copy and clear the transcription
Informative messages about privacy

Adding CSS Styles (static/css/style.css)

Let's add some custom styles to enhance our user interface:

/* static/css/style.css */
body {
  background-color: #f5f5f5;
}

.card {
  box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
  border: none;
}

.card-header {
  background-color: #f8f9fa;
  border-bottom: 1px solid #eaeaea;
  padding: 1.5rem;
}

.card-header h1 {
  margin-bottom: 0.5rem;
  font-weight: 600;
  color: #333;
}

#recordButton {
  transition: all 0.3s ease;
  min-width: 180px;
}

#recordButton.recording {
  background-color: #dc3545;
  border-color: #dc3545;
  animation: pulse 1.5s infinite;
}

@keyframes pulse {
  0% {
    transform: scale(1);
  }
  50% {
    transform: scale(1.05);
  }
  100% {
    transform: scale(1);
  }
}

#transcription {
  background-color: #fff;
  font-size: 1.1rem;
}

#status {
  font-weight: 500;
}

#progressContainer {
  height: 10px;
}

.btn-primary {
  background-color: #007bff;
}

.btn-primary:hover {
  background-color: #0069d9;
}

.btn-outline-secondary:hover {
  background-color: #6c757d;
  color: white;
}

.btn-outline-danger:hover {
  background-color: #dc3545;
  color: white;
}

@media (max-width: 768px) {
  .container {
    padding: 10px;
  }

  .card-header {
    padding: 1rem;
  }

  .card-header h1 {
    font-size: 1.5rem;
  }
}

These styles create a clean, modern interface with visual feedback during recording.

Implementing JavaScript Functionality (static/js/main.js)

Finally, let's create the JavaScript code to handle audio recording and communication with the backend:

// static/js/main.js
document.addEventListener('DOMContentLoaded', function () {
  // Get UI elements
  const recordButton = document.getElementById('recordButton')
  const recordText = document.getElementById('recordText')
  const recordIcon = document.getElementById('recordIcon')
  const status = document.getElementById('status')
  const transcription = document.getElementById('transcription')
  const progressContainer = document.getElementById('progressContainer')
  const progressBar = document.getElementById('progressBar')
  const copyButton = document.getElementById('copyButton')
  const clearButton = document.getElementById('clearButton')

  // Audio recording variables
  let mediaRecorder
  let audioChunks = []
  let isRecording = false
  let stream

  // Add event listeners
  recordButton.addEventListener('click', toggleRecording)
  copyButton.addEventListener('click', copyTranscription)
  clearButton.addEventListener('click', clearTranscription)

  // Function to toggle recording state
  async function toggleRecording() {
    if (!isRecording) {
      // Start recording
      try {
        stream = await navigator.mediaDevices.getUserMedia({ audio: true })
        startRecording(stream)
      } catch (err) {
        alert('Error accessing microphone: ' + err.message)
        console.error('Error accessing microphone:', err)
      }
    } else {
      // Stop recording
      stopRecording()
    }
  }

  // Function to start recording
  function startRecording(stream) {
    audioChunks = []
    mediaRecorder = new MediaRecorder(stream)

    mediaRecorder.addEventListener('dataavailable', (event) => {
      audioChunks.push(event.data)
    })

    mediaRecorder.addEventListener('stop', sendAudioForTranscription)

    // Start recording
    mediaRecorder.start()
    isRecording = true

    // Update UI
    recordButton.classList.add('recording')
    recordText.textContent = 'Stop Recording'
    recordIcon.textContent = '⏹️'
    status.style.display = 'block'
  }

  // Function to stop recording
  function stopRecording() {
    if (mediaRecorder && isRecording) {
      mediaRecorder.stop()
      stream.getTracks().forEach((track) => track.stop())
      isRecording = false

      // Update UI
      recordButton.classList.remove('recording')
      recordText.textContent = 'Start Recording'
      recordIcon.textContent = '🎤'
      status.textContent = 'Processing audio...'
      progressContainer.style.display = 'block'
      updateProgress(10) // Initial progress
    }
  }

  // Function to send audio for transcription
  function sendAudioForTranscription() {
    const audioBlob = new Blob(audioChunks, { type: 'audio/wav' })
    const formData = new FormData()
    formData.append('audio', audioBlob, 'recording.wav')

    updateProgress(30) // Update progress

    // Send the audio to the server
    fetch('/transcribe', {
      method: 'POST',
      body: formData,
    })
      .then((response) => {
        updateProgress(70)
        if (!response.ok) {
          throw new Error('Server returned ' + response.status)
        }
        return response.json()
      })
      .then((data) => {
        updateProgress(100)

        if (data.error) {
          alert('Error: ' + data.error)
          status.textContent = 'Error: ' + data.error
        } else {
          // Display the transcription
          const currentText = transcription.value
          transcription.value = currentText
            ? currentText + '\n\n' + data.transcription
            : data.transcription

          // Enable buttons
          copyButton.disabled = false
          clearButton.disabled = false

          // Hide status and progress
          setTimeout(() => {
            status.style.display = 'none'
            progressContainer.style.display = 'none'
          }, 1000)
        }
      })
      .catch((error) => {
        console.error('Error:', error)
        status.textContent = 'Error: ' + error.message
        updateProgress(0)
      })
  }

  // Function to update progress bar
  function updateProgress(value) {
    progressBar.style.width = value + '%'
    progressBar.setAttribute('aria-valuenow', value)
  }

  // Function to copy transcription to clipboard
  function copyTranscription() {
    transcription.select()
    document.execCommand('copy')

    // Visual feedback
    copyButton.textContent = 'Copied!'
    setTimeout(() => {
      copyButton.textContent = 'Copy to Clipboard'
    }, 2000)
  }

  // Function to clear transcription
  function clearTranscription() {
    transcription.value = ''
    copyButton.disabled = true
    clearButton.disabled = true
  }
})

This JavaScript code:

Initializes all UI elements
Sets up event listeners for button clicks
Manages audio recording using the MediaRecorder API
Handles sending audio to the server for transcription
Updates the UI based on the current state
Provides functionality to copy and clear the transcription

Creating a Recorder Helper Module (optional)

For more advanced audio processing, we can create a dedicated module:

# recorder.py
import pyaudio
import wave
import tempfile
import numpy as np

class AudioRecorder:
    def __init__(self, channels=1, rate=16000, chunk=1024, format=pyaudio.paInt16):
        self.channels = channels
        self.rate = rate
        self.chunk = chunk
        self.format = format
        self.audio = pyaudio.PyAudio()
        self.stream = None
        self.frames = []

    def start_recording(self):
        self.frames = []
        self.stream = self.audio.open(
            format=self.format,
            channels=self.channels,
            rate=self.rate,
            input=True,
            frames_per_buffer=self.chunk
        )

    def stop_recording(self):
        if self.stream and not self.stream.is_stopped():
            self.stream.stop_stream()
            self.stream.close()
        self.stream = None

    def record_for_duration(self, duration_seconds):
        self.start_recording()
        for _ in range(0, int(self.rate / self.chunk * duration_seconds)):
            data = self.stream.read(self.chunk)
            self.frames.append(data)
        self.stop_recording()

    def save_to_file(self, filename):
        with wave.open(filename, 'wb') as wf:
            wf.setnchannels(self.channels)
            wf.setsampwidth(self.audio.get_sample_size(self.format))
            wf.setframerate(self.rate)
            wf.writeframes(b''.join(self.frames))

    def get_temp_file(self):
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.wav')
        self.save_to_file(temp_file.name)
        return temp_file.name

    def cleanup(self):
        if self.stream:
            self.stop_recording()
        self.audio.terminate()

# Usage example:
# recorder = AudioRecorder()
# recorder.start_recording()
# # ... wait for user to finish speaking
# recorder.stop_recording()
# audio_file = recorder.get_temp_file()
# # Use audio_file with Whisper
# recorder.cleanup()

This module provides more control over audio recording, which can be useful for more advanced applications.

Modify app.py to Use the Recorder Helper (optional)

If using the optional recorder module, you would update app.py:

# app.py
from flask import Flask, render_template, request, jsonify
import whisper
import os
from recorder import AudioRecorder

app = Flask(__name__)
model = whisper.load_model("base")

# Rest of the code...

Running the Application

Now that all files are created, run the application:

python app.py

Open your web browser and navigate to http://127.0.0.1:5000/. You should see your speech-to-text application ready to use!

Conclusion

In this guide, we've created all the necessary files for our AI-powered speech-to-text web application:

app.py: The Flask backend that interfaces with Whisper
index.html: The web interface
style.css: Custom styling
main.js: JavaScript functionality for audio recording and communication
recorder.py (optional): Advanced audio recording functionality

In the next guide, we'll summarize the project, discuss potential improvements, and explore how this application can be valuable in educational technology settings.

AI-Powered Speech-to-Text Web App