I Hate Voice Notes (Transcribing speech using OpenAI Whisper)

07 Oct 2022


Listening to WhatsApp voice notes can be inconvenient or time-consuming. Some people love them, some people hate them.

Curiously, these tastes vary across cultures, for example in Argentina people prefer communicating using voice notes.

Be that as it may, being able to transcribe a voice note you received on WhatsApp with very low friction can be quite useful.

Enter OpenAI Whisper. OpenAI released this two weeks ago and made me wonder if I could scratch my own itch by building a transcriber bot.

There were two potential obstacles:

  1. How difficult is it to build a WhatsApp bot and retrieve the audio?
  2. How difficult is it to deploy Whisper?

WhatsApp API

This turned out to be fairly easy to test. You can set up a test account and use the API easily. Documentation is ok. It probably took me less than an hour to be able to receive a voice note and extract the OGG audio using API requests.

There’s a catch: Encryption. When you send a voice note to the business number it’s not end-to-end encrypted. So there are privacy implications.

OpenAI Whisper

Kudos to OpenAI to make it very easy to run. The github repo with instructions is here: https://github.com/openai/whisper.

On my Mac it was:

$ brew install ffmpeg
$ pip install git+https://github.com/openai/whisper.git
$ whisper audio.ogg

Here’s a voice note I received from a property agent:

The Whisper result:

“I’ve been the owner of the property just phoned me this morning to say that the property has been sold. She got a cash offer yesterday which is accepted. So yeah, I’m actually taking it off the market right now. But a very nice property. Okay, see you Monday. Bye.”

The only thing Whisper got wrong is the initial “Hi Ben”, which it transcribed as “I’ve been”. And with a South African accent. Impressive.


There are 3 components:

  1. The WhatsApp WebHook API (NodeJS + Express).
  2. A queue (Redis).
  3. A Python worker.

When a WhatsApp message is received, it is pushed on to the Redis queue, and the Python worker transcribes it and send the message back to the originator via the WhatsApp API.

I deployed it open Digital Ocean as a Docker app with no issues. You will need 2GB of memory though.


Here it is integrated into WhatsApp. I just forward any voice note to the bot account and it will produce a transcription: