Tag Archives: google

Recording and transcribing a speech sample on Google colab“

Set up the recording method using java script:

# all imports
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(fn, sec):
  display(Javascript(RECORD))
  s = output.eval_js('record(%d)' % (sec*1000))
  b = b64decode(s.split(',')[1])
  with open(fn,'wb') as f:
    f.write(b)
  return fn

Record something:

 filename = 'felixtest.wav'
record(filename, 5)

Play it back:

import IPython
IPython.display.Audio(filename)

install Google speechbrain

%%capture
!pip install speechbrain
import speechbrain as sb

Load the ASR nodel train on libri speech:

from speechbrain.pretrained import EncoderDecoderASR
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_model")

And get a transcript on your audio:

asr_model.transcribe_file(audio_file )

How to synthesize a text to speech with Google speech API

This tutorial assumes you started a Jupyter notebook . If you don't know what this is, here's a tutorial on how to set one up (first part)

There is a library for this that's based on the Google translation service that still seems to work: gtts.
You would start by installing the packages used in this tutorial:

!pip install -U gtts pygame python-vlc

The you can import the package:

from gtts import gTTS

, define a text and a configuration:

text = 'Das ist jetzt mal was ich so sage, ich finde das Klasse!'
tts = gTTS(text, lang='de')

and synthesize to a file on disk:

audio_file = './hello.mp3'
tts.save(audio_file)

which you could then play back with vlc

from pygame import mixer  
import vlc
p = vlc.MediaPlayer(audio_file)
p.play()

How to get my speech recognized with Google ASR and python

What you need to do this at first is to get yourselg a Google API key,

  • you need to register with Google speech APIs, i.e. get a Google cloud platform account
  • you need to share payment details, but (at the time of writing, i think) the first 60 minutes of processed speech per month are free.

I export my API key each time I want to use this like so:

export GOOGLE_APPLICATION_CREDENTIALS="/home/felix/data/research/Google/api_key.json"

This tutorial assumes you did that and you started a Jupyter notebook . If you don't know what this is, here's a tutorial on how to set one up (first part)

Bevor you can import the Google speech api make shure it's installed:

!pip  install google-cloud 
!pip install --upgrade google-cloud-speech

Then you would import the Google Cloud client library

from google.cloud import speech
import io

Instantiate a client

client = speech.SpeechClient()

And load yourself a recorded speech file, should be wav format 16kHz sample rate

speech_file = '/home/felix/tmp/google_speech_api_test.wav'

if you run into problems recording one: here is the code that worked for me:

import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
sr = 16000  # Sample rate
seconds = 3  # Duration of recording
data = sd.rec(int(seconds * fs), samplerate=sr, channels=1)
sd.wait()  # Wait until recording is finished
# Convert `data` to 16 bit integers:
y = (np.iinfo(np.int16).max * (data/np.abs(data).max())).astype(np.int16) 
wavfile.write(speech_file fs, y)

then get yourself an audio object

with io.open(speech_file, "rb") as audio_file:
    content = audio_file.read()
audio = speech.RecognitionAudio(content = content)

Configure the ASR

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="de-DE",
)

Detects speech in the audio file

response = client.recognize(config=config, audio=audio)

and show what you got (with my trial only the first alternative was filled):

for result in response.results:
    for index, alternative in enumerate(result.alternatives):
        print("Transcript {}: {}".format(index, alternative.transcript))