Speech Recognition with AI

For those in a hurry

OpenAI's Whisper model has become a serious competitor in speech recognition, challenging established providers like Dragon Naturally Speaking or Wolters Kluwer DictNow.
Speech recognition is the foundation for many AI-based automations, such as meeting summaries, medical doctor-patient conversation documentation, voice search in corporate data, or specialized translation services.
Whisper achieves excellent results even in its base model (70 million parameters). This allows not only for the efficient summarization of dictations but also for logging conversations and automatically extracting the action items agreed upon within them.
With the language models now available, company-specific use cases can be perfectly implemented. Moving away from tedious typing toward targeted voice input and voice control of IT systems and machines.

Tip for trying it out

If you want to try out the use case of summarizing conversation content from MS Teams, Zoom, or Google Meet, etc., analyzing online meetings by speaking share, or extracting next steps, you should take a look at providers like fireflies.ai and read.ai. Both join a video conference as a silent participant, log every word precisely, and create a conversation analysis based on a predefined template.

The crux of the matter

In the automation of business processes, the ideation phase is always about finding the "crux of the matter" (Latin: punctum saltans). In other words: What proof must be provided for an automation technology to be classified as effective and purposeful for a specific use case?

In the case of AI-based automation, this is first and foremost the recognition rate (accuracy): Does the artificial intelligence deliver a correct result? This question is so important because AI is based on stochastic processes that provide the most probable result. This is not necessarily the correct one, depending on how (well) the AI model was trained.

In our linguistic use case, the crux of the matter is the correct transcription of the audio recording into continuous text. That is what we are focusing on.

Whisper - the whisper model

With its Whisper model, OpenAI provides an excellent speech recognition library. We integrate this using a Python library. The only current peculiarity is that Whisper can only process 30-second audio files. Consequently, we must split our recording into 30-second snippets to transcribe a longer recording.

Whisper reads these snippets one after the other and translates them into spectrograms. You can see an example in the title image above.

This is interesting! Whisper does not create a text file directly from the audio file but takes a detour via a graphical artifact - the spectrogram. From this spectrogram, Whisper then deduces not only the spoken language (e.g., German or English) through pattern recognition but also decodes, i.e., transcribes, the spoken text. This is where the GPT approach pursued by OpenAI becomes apparent: turn patterns into numbers, turn numbers into - in this case - text.

This happens with every 30-second snippet. In the end, the transcribed text is concatenated: the transcription is complete.

Cui bono?

Where can Whisper be useful in a corporate environment? In addition to its high recognition accuracy, the simplicity of using the Whisper library is impressive. It thus forms a welcome foundation for specific applications of artificial intelligence to problems in the company. Here are a few examples:

Summaries of conversations and negotiations (meeting notes)
Searching longer recordings for specific discussed content (semantic search)
Voice control of downstream software (e.g., ERP system) and communication with machines (natural language control)
Spoken interaction with AI agents for the automated creation of data analyses, evaluations, and dashboards (speak instead of type)

People who prefer to interact with computer systems using spoken language instead of typing will appreciate this capability. Whisper not only paves the way for the analysis and utilization of spoken words as an object of interest; it also opens up a new way of interacting with enterprise applications and corporate data, similar to what we know from consumer software like Apple's Siri.