Hey!
Have you noticed something crucial missing in the transcription we created with Whisper AI? At first glance, the transcription seems much more accurate compared to the one from the SpeechRecognition library and Google Web Speech API. However, if you look closer, there’s a key omission.

Not only was the name Ivan transcribed incorrectly, but an important word is also missing in one of the sentences. Check the notebook again:

“I’m a sound engineer turned a scientist, curious about machine learning and artificial intelligence.”

Did you spot it? The word “data” is missing from the phrase “data scientist,” leaving it as just “a scientist.”

But don’t think Whisper isn’t good enough! Remember, we used the “base” version of the model, which balances performance with memory and resource efficiency. Now it’s your turn to dig deeper into Whisper AI and make the transcription even more accurate.

Here’s your task:
Deploy the “medium” version of the Whisper model and see if the transcription improves. Will it correctly transcribe my name? What about the missing word “data” in “data scientist”?

Become a speech recognition expert by making a small modification to the code we used in the lesson. Then, evaluate your transcript using WER (Word Error Rate) and CER (Character Error Rate) to see if it truly performs better.

Feel free to experiment with the even more advanced versions of the model for your own transcripts. And if you discover something interesting, share it with the Community!

Good luck! 😊

A Note on Variability

It’s worth noting that machine learning models like Whisper can sometimes produce slightly different transcriptions even when processing the same audio file. This happens because of the stochastic nature of the model’s decoding process, meaning that randomness is involved. As a result, it’s possible for the missing word “data” to appear correctly in one transcription and be omitted in another.

If you do obtain a correct transcript where “data” is included, you can still use the medium version of the model to compare results and see if it transcribes the name “Ivan” correctly this time.

This variability could also lead to different WER (Word Error Rate) and CER (Character Error Rate) scores compared to those in the lesson. Don’t worry if your results don’t exactly match the lesson examples—this is expected behavior and a normal part of working with speech recognition models.