Speech To Text in Python using IBM watson

So i had the task of developing Speech to text web application using both streaming and non-streaming service(I’ll explore more on the difference later in this post).

Developing this project was something new to me as i never explored speech recognition before and choosing one from many available services like Google cloud and IBM Watson required a bit of research.

At the end i chose IBM watson for a simple reason that watson has better documentation for beginners to get started and signing up for IBM watson is very easy with no requirement of providing debit/credit card details unlike Google.

Once you signup for Watson Speech to Text, you will be given username and password, that you will use as authentication for every request. The service comes with 100 mins of free monthly quota and beyond this requires the account to be upgraded.

Let me come back to the difference between streaming and non-streaming service. The Non-streaming service takes the audio file and sends back the text transcript after processing the complete audio, Whereas Streaming service works by listening to the audio stream via the user microphone, processing the speech and displaying the results in real time.

Now that we have explored the task at hand and the different speech to text service. we can see how i implemented.

  • Non-Streaming Service

Implementing non-streaming service was pretty straight forward task. IBM Watson provides a rest api and you just have to make a post request with the audio file and let the server do the work and sends you back the results. Making the valid post request requires username and password authentication.

Code Example:

import requests

# request url is given when you signup for speech to text service  
response_data = requests.post(url, auth=(username, password),                                      
                 headers=headers)
  • Streaming Service

Implementing streaming service was bit tricky considering i know very less javascript and it works by having a bi-directional websocket connection sending data back and forth between client and server. Now websocket sends data packet over a single TCP connection and thereby reducing connection overheads for every request.

In streaming service, the unique token is also sent along the with chunk of audio data and server responds back with the results. The unique token is valid for only 1 hour, which means you need to generate again for making a valid request.

Fortunately, Watson provides a python module which can be installed via pip. And you can make Get request for getting the token. I created a separate endpoint to which a request to retrieve token was made before making websocket connection.

Code Example for generating token:

from watson_developer_cloud import AuthorizationV1 as Authorization  
from watson_developer_cloud import SpeechToTextV1 as SpeechToText

authorization = Authorization(username=username, password=password)       
return HttpResponse(authorization.get_token(url=SpeechToText.default.                url),content_type='text/plain')

Now IBM watson has watson-speech npm module to work your way in making request and getting back data in real time fromt client-side javascript. But the problem is without front end dependency management, you have to download the complete library and all its dependencies and set its path in src attribute in your html script tag, which is unreasonable and not the right approach or you can choose how modules installed through npm package manager that can be used in nodejs application via require function, which is easier way to include third party modules in Javascript, i chose the latter one for its simplicity and code readability.

Now to use npm modules you need to have node and npm installed on your system and run a npm init command that will create a package.json file in your project that will house all front end package dependency along with project metadata.

Next, to use the nodejs require functionality we need to convert plain vanilla js into native nodejs code, to achieve this, thier is npm package called browsify which takes your javascript file with nodejs functionality and includes all dependency in a output file bundle.js(named by convension).

>> browsify app.js -o bundle.js

We have solved most of the problem and one last thing left is to write a app.js file that will make request to our endpoint to get the token using javascript fetch api and then connecting to watson websocket and sending stream of audio data for processing and receive the data with intermediate results in real time with can be set from javascript to html element.

Code to retrieve token from /api/speech-to-text/token endpoint:

function onListenClick(){   
fetch(‘/api/speech-to-text/token’)  
.then(function(response) {   
return response.text();   
})

To get the complete working code in python/django. You can refer to my repo here.