Implementing speech-to-text with Mozilla deep speech pre-trained models and their Python API.

2 min readMar 27, 2021

What is Mozilla deep speech?

Deep speech is an open-source speech-to-text engine implemented by machine learning techniques describes in Baidu’s Deep Speech research paper.

Why Mozilla deep speech?

This is a great tool because it’s an open-source project and it has several APIs in different languages like Python, C, .Net, Java, and Javascript. The biggest advantage is, This can be used for speech-to-text implementations for specific domains since we have the flexibility to train an acoustic model by providing domain-specific data. This means this will allow us to train speech-to-text models for different languages and use.

How to use Mozilla deep speech?

This article will be focusing on deep speech Python API. The basic idea is to download audio from youtube and generate the transcript.

Step1: Download required deep speech models and libraries

There are three bindings of the deep speech python library. I have used tflite compatible one here in the code because it’s lightweight. you can find out the other bindings here.

Step2: Download audio from a YouTube video

YoutubeDL library allows us to download the audio directly from the Youtube URL. we’re saving the file in “WAV” format.

Step3: Manipulate the audio file

Deep speech requires audio with a 16kHz or 8kHz sample rate and a single channel. This “DeepSpeechAudio” class will convert the audio to satisfy the requirements and it will also have the methods to manipulate the audio files.

Step4: Run deep speech engine to generate a transcript for the given audio.

There are two APIs to convert the speech to text.

Batch API

We can feed the audio into deep speech batch API and generate the text in a single shot. but this takes some time to process when the audio is long. Therefore this won’t suitable for the applications like voice assistants or live transcribe applications. We can use their stream API for that.

Stream API

Here is the whole code and you can run it on google colab.

Mozilla Deep Speech Github Repo: https://github.com/mozilla/DeepSpeech

Mozilla Deep Speech Documentation: https://deepspeech.readthedocs.io/en/v0.9.3/