Intuitive user experiences are a key part of modern software products. At Unlikely AI, we recently undertook an ambitious project to integrate automatic speech recognition (ASR) capabilities into our platform. This technical blog post describes our journey, highlighting the challenges we faced with niche domain-specific terminology, and our pivot to develop a custom model built on state-of-the-art open-source solutions.

Our initial investigation of ASR technology was driven by the desire to allow users to interact with our platform through voice as well as text. To achieve this, we integrated Deepgram’s speech-to-text API into our platform. Deepgram stood out for its streaming functionality and seamless integration with front-end frameworks, particularly React, making it a promising initial choice for our requirements. Its performance seemed impressive, providing accurate transcriptions for the majority of texts right out of the box.

Despite Deepgram’s overall effectiveness, we encountered a significant hurdle when dealing with specialised terminology and uncommon proper nouns. Transcription quality noticeably degraded in these scenarios, impacting the clarity and accuracy of some key pieces of functionality in our platform. Deepgram’s solution to this problem is keyword boosting — a feature designed to enhance the recognition of new vocabulary that the model had not previously encountered. While this approach helped to some extent, it fell short of our expectations for providing a reliable transcription experience.

As we scaled up our experiments, we discovered the limitations of keyword boosting. It did improve model performance by increasing the likelihood of recognizing specific words or phrases, but its effectiveness diminishes as the complexity and volume of specialised vocabulary increase. This limitation prompted us to reconsider our approach.

Realising that keyword boosting would be insufficient led us to explore alternative solutions. While most current ASR providers, including Deepgram, offer some degree of customization, they often do not support the development of fully custom models without significant financial investment.

Faced with these challenges, and based on our understanding that fine-tuning a pre-trained ASR model on bespoke data will give us the best performance, we decided to take control of our destiny by training a custom model. This decision was fueled by the availability of state-of-the-art open-source models in the ASR domain. These open-source solutions offer the flexibility, power, and cost-effectiveness required to develop a model tailored to our specific needs, including the ability to accurately transcribe specialised terminology and uncommon proper nouns.

Fine-tune Whisper Model for Custom ASR

Following our decision to develop a custom ASR solution, we chose OpenAI’s Whisper as the foundation model because of its robust performance across a variety of languages and accents. Our hope was that fine-tuning a powerful model like Whisper on our own dataset would lead to strong performance across our custom vocabulary.

Whisper’s architecture is designed to excel in both general and challenging audio environments, making it an ideal candidate for customisation. Its versatility and state-of-the-art performance stem from extensive training on a diverse dataset, encompassing a wide range of audio types and transcriptions. This pre-training means Whisper has learned a broad base of linguistic knowledge, which we aimed to build upon to recognise our domain-specific vocabulary more effectively.

Fine-tuning Process

Fine-tuning a model like Whisper involves adjusting the pre-trained model weights slightly to perform better on a specific task — in our case, improving accuracy for specialised terminologies within our industry. This process requires a dataset annotated with the correct transcriptions, including the specialised terms that we want the model to learn.

To begin with, we collected a small dataset of audio recordings and their corresponding transcripts which included paragraphs of text with the specialised terminology and proper nouns relevant to our domain. To our surprise, the model performed well even in field testing with our limited dataset. This showcases the generalisation ability and data efficiency of fine-tuning a large pre-trained ASR model.

Example Code for Fine-tuning Whisper

Below is a simplified example of how one might approach fine-tuning the Whisper model with a custom dataset. This example assumes you have installed the necessary libraries and prepared your dataset in the required format.

This code snippet provides a basic framework for fine-tuning. In practice, you’ll need to manage data loading and batching more efficiently, possibly using tools like torch.utils.data.DataLoader. Additionally, handling longer audio files may require segmenting them into smaller chunks that fit the model’s maximum input size.

Challenges and Considerations

Fine-tuning a large model like Whisper requires significant computational resources, typically needing GPUs or TPUs to complete in a reasonable timeframe. Moreover, the quality and quantity of the fine-tuning dataset are vital; insufficient or poorly annotated data can lead to suboptimal outcomes or even exacerbate the model’s existing biases.

After training, it’s essential to evaluate the fine-tuned model thoroughly, comparing its performance on a separate test set to ensure that it has indeed learned to handle the specialised terminology better without losing its general transcription capabilities. This is a phenomenon known as catastrophic forgetting where a model begins to perform poorly on data it was originally trained on, but does not see during fine-tuning.

Algorithm for Sentence Separation

Now that we had a well-performing speech recognition model, we wanted to focus on user experience. Running the ASR in a streaming fashion on short segments of speech allows the user commands to be processed with a shorter delay, leading to a smoother user experience. The problem now is choosing when to separate the audio as the ASR accuracy may drastically decrease if the speech is cut mid-sentence, and the entire meaning cannot be properly reconstructed from segments of transcribed text. Our investigations in this domain led us through a series of experiments, ranging from basic signal processing techniques to advanced deep learning methodologies in pursuit of a reliable solution.

Initial Approaches to Sentence Detection

We began by exploring simple techniques which leverage the natural pauses in speech as markers for sentence boundaries. Using the volume of the audio signal and implementing low-pass filters, we attempted to identify these pauses, but had limited success: the simplicity of these methods is appealing, but we found that they fall short in live audio environments which include background noise.

Our next attempt was to use Voice Activity Detection (VAD), a deep learning approach which focuses on the presence of speech to improve sentence detection. Despite its advanced nature, we encountered inherent limitations with VAD, particularly its inability to facilitate word-by-word live transcription and its vulnerability to ambient noise. These obstacles made it unsuitable for our use case.

Leveraging Whisper for Sentence Separation

Our adoption of OpenAI’s Whisper model brought a new dimension to our efforts. Whisper’s capability to output transcriptions at the word level introduced a potential pathway to reliable sentence separation. The challenge then shifted to managing the continuous stream of audio bytes via WebSocket and ensuring the accuracy and coherence of transcriptions in real-time.

While OpenAI’s provision of timestamps for words or sentences based on attention weights and forced alignments appeared to be a solution, it lacked the reliability needed for seamless sentence detection. The interpolated timestamps were not only absent during model training but also prone to inconsistencies, leading to disruptions in the transcription process.

Researchers From Oxford Open-Source WhisperX: A Time-Accurate Speech Recognition System With Word-Level Timestamps

The Majority Vote Algorithm: A Novel Solution

Our breakthrough came with the conceptualisation and implementation of a majority vote algorithm, which is designed to address scenarios where the model outputs different transcriptions as new audio bytes are coming in. The algorithm followed these steps:

  1. Candidate Word Tracking: for every position within a sentence, maintain a list of potential words predicted by the model, effectively capturing the dynamic nature of live transcription.
  2. Vote Aggregation: as new inferences emerge with the latest audio bytes appended, update the votes for candidate words, gradually building towards a consensus.
  3. Word Finalisation: establish a word at a specific index as finalised once its vote count surpasses a predetermined threshold, set to 4 in our implementation which resulted in the highest accuracy end-to-end.
  4. Sentence Punctuation Detection: analyse the punctuation of each word to determine sentence boundaries, using punctuation as a reliable indicator for the end of a sentence.
  5. Sentence Finalisation: consider a sentence complete when all its constituent words have been finalised, thus ensuring a consistent and accurate transcription.

This algorithm, distinguished by its accuracy and low latency, led to a significant improvement in user experience. This methodology not only enhanced the robustness of our transcription process against environmental noise but also aligned with our goal of delivering real-time, word-by-word transcription in live-streaming contexts.

Conclusion

The path to seamlessly incorporating Automatic Speech Recognition (ASR) technology into our platform has been both a formidable challenge and a source of valuable insights. We initially leaned on Deepgram’s speech-to-text API, drawn by its streaming functionality and ease of integration, especially with React frameworks. However, the limitations of keyword boosting prompted us to rethink our strategy. Changing to a custom model, in spite of the formidable engineering challenges, underscored our dedication to delivering a solution that not only meets general transcription needs but excels in handling the nuances of domain-specific language.

The challenges of integrating ASR technology, particularly in live streaming environments, highlighted the necessity for a robust method to ensure sentence separation and transcription accuracy amidst varying audio conditions. Deploying our own method led to significant improvements over off-the-shelf solutions. This exploration and eventual pivot to a custom-developed ASR model is a testament to our team’s culture of innovation and resilience. We are proud of the ASR system we have deployed, and while we expect to find areas that need work down the line, we are excited to be on the quest to keep improving it.