Tag Archives: speech recognition

Get your speech recognized with Whisper

OpenAI published new speech recognition models that are very easy to use and work in many languages trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

In my case all I had to do to recognize some German test:

# create a virtual environment
virtualenv venv
# activate it
. venv/bin/activate
# install whisper
pip install git+https://github.com/openai/whisper.git
# run the test
whisper test.wav --language German

And my file got recognized correctly, though it took a very long time: for the tiny model speed = x32, i.e. 32 times the time of the speech file duration, was announced

How to get my speech recognized with Google ASR and python

What you need to do this at first is to get yourselg a Google API key,

  • you need to register with Google speech APIs, i.e. get a Google cloud platform account
  • you need to share payment details, but (at the time of writing, i think) the first 60 minutes of processed speech per month are free.

I export my API key each time I want to use this like so:

export GOOGLE_APPLICATION_CREDENTIALS="/home/felix/data/research/Google/api_key.json"

This tutorial assumes you did that and you started a Jupyter notebook . If you don't know what this is, here's a tutorial on how to set one up (first part)

Bevor you can import the Google speech api make shure it's installed:

!pip  install google-cloud 
!pip install --upgrade google-cloud-speech

Then you would import the Google Cloud client library

from google.cloud import speech
import io

Instantiate a client

client = speech.SpeechClient()

And load yourself a recorded speech file, should be wav format 16kHz sample rate

speech_file = '/home/felix/tmp/google_speech_api_test.wav'

if you run into problems recording one: here is the code that worked for me:

import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
sr = 16000  # Sample rate
seconds = 3  # Duration of recording
data = sd.rec(int(seconds * fs), samplerate=sr, channels=1)
sd.wait()  # Wait until recording is finished
# Convert `data` to 16 bit integers:
y = (np.iinfo(np.int16).max * (data/np.abs(data).max())).astype(np.int16) 
wavfile.write(speech_file fs, y)

then get yourself an audio object

with io.open(speech_file, "rb") as audio_file:
    content = audio_file.read()
audio = speech.RecognitionAudio(content = content)

Configure the ASR

config = speech.RecognitionConfig(

Detects speech in the audio file

response = client.recognize(config=config, audio=audio)

and show what you got (with my trial only the first alternative was filled):

for result in response.results:
    for index, alternative in enumerate(result.alternatives):
        print("Transcript {}: {}".format(index, alternative.transcript))