AI-Powered Transcription Services Showdown: AWS vs. Google vs. IBM Watson vs. Nuance
In one of my previous blog posts, I touched on the topic of AI-powered transcription services on the market. There, I introduced the idea that, with this pace of multimedia production, traditional, human-powered transcription services is not the solution.
In the past 2 years, we’ve produced 90% of all the data our civilization has. At this pace, and a 9:1 ratio of transcribing multimedia files, human-powered transcription is simply impossible to keep up. It’s too slow, too expensive, too prone to error, and too vulnerable to data leaks.
Just like hiring an army of workers to dig a perfectly straight ditch of a 1000 miles is not the best option, we need to start thinking of how machines can help.
In this blog post, I’d like to dig a bit deeper and do better coverage of the 4 major transcription services: Amazon, Google, IBM, and Nuance. They are all good players, however only one can fully respond to all of your specific needs.
To help you choose the best transcription service provider, let’s make a little comparison between the four.
My Comparison Methodology
I’ll be covering the four providers from several different angles, so you can get a more comprehensive understanding of their value proposition for your specific needs. Here are the different angles I’ll be covering:
- Speed. The speed of a transcribe platform is a crucial factor. Given enough time, everyone could transcribe a multimedia content, but the point of the existence of platforms like these is to make that time as short as possible. But in some cases, speed may not be the ultimate, deciding factor. Some companies will be better off with a slower but more accurate solution.
- Accuracy is paramount to a transcription platform. Very often the worth of the transcription platform is measured by its accuracy. If the platform gives you a transcription that needs additional edits in punctuation and speakers, then that platform my friend hasn’t done much of the job for you. But again, in some cases, companies that have large amounts of transcripts, they’ll be better off with a slightly less accurate, but much cheaper solution.
- Price. No matter if you are a small company or a well-established vendor moving the market, everyone cares about costs. How much of a deciding factor this will be, depends on how large your budget is, and how important the other two metrics are.
Now that I’ve introduced the software packs and the methodology of comparing the 4 transcription services, let’s get started.
Amazon Transcribe Service
In trying to keep up the pace with the evolution of language, Amazon Transcribe platform is continually learning and improving. AWS Transcribe platform is designed to provide fast and accurate automated transcripts for multimedia files with varying quality.
Currently, Amazon’s transcription service is able to process multimedia content:
- Duration: maximum 2 hours,
- Custom Vocabulary: maximum 50 KB file size
- Sampling rate: from 8KHz (telephony audio) to 48Kh
- Languages: English and Spanish
- Formats: In WAV, mp3, mp4, FLAC
Thanks to AWS processing prowess, Amazon Transcribe is doing transcription at an astonishing speed.
The best thing about Amazon Transcribe is the accuracy of transcriptions. AWS has been the world’s most comprehensive and broadly adopted cloud platform for the last 12 years. This experience can be seen in the accuracy Amazon Transcribe shows in their results.
Namely, unlike other transcribe services, Amazon transcribe platform produces texts that are ready to use, without a need for further editing. To achieve this, AWS Transcribe pays special attention to:
- Punctuation. Amazon Transcribe platform is capable of adding appropriate punctuation to the text as it goes and formats the text automatically. This way producing an intelligible output which can be used without further editing.
- Confidence score. AWS Transcribe makes sure to provide a confidence score which shows how confident the platform is with the transcription.
This means you can always check the confidence score to see whether a particular line of the transcript needs alterations.
- Possible alternatives. The platform also gives you an opportunity to make some alterations in cases where you are not completely satisfied with the results.
- Timestamp Generation. Powered by deep learning technologies, AWS Transcribe automatically generates time-stamped text transcripts.
This feature provides timestamps for every word which makes locating the audio in the original recording very easy by searching for the text.
- Custom Vocabulary. AWS Transcribe allows you to create your own custom vocabulary. By creating and managing a custom vocabulary you expand and customize the speech recognition of AWS Transcribe.
Basically, custom vocabulary gives AWS Transcribe more information about how to process speech in the multimedia file.
This feature is very important in achieving high accuracy in transcriptions of specific use such as Engineering, Medical, Law Enforcement, Legal, etc.
- Multiple Speakers. AWS Transcribe platform can identify different speakers in a multimedia file. The platform can recognize when the speaker changes and attribute the transcribed text accordingly. Recognition of multiple speakers is handy when transcribing multimedia content that involves multiple speakers (such as telephone calls, meetings, etc.).
AWS Transcribe platform also allows you to specify the number of speakers you want to be identified in the multimedia file. The platform allows identification of up to 10 speakers.
The best performance can be achieved when the number of speakers you require to be identified, matches the number of speakers in the multimedia content.
The best part of Amazon Transcribe, unlike the other transcription services we discuss, is that you pay-as-you-go based on the seconds of audio transcribed per month.
Amazon Transcribe API is billed monthly at a rate of $0.00056 per second. Usage is billed in one-second increments, with a minimum per request charge of 15 seconds.
Thanks to all of these features, Amazon Transcribe service may be considered as highly accurate transcribe service. With its speed, accuracy, and price this transcribe service is one of the best, if not the best player in the game.
Google Speech-to-Text is available for multimedia content from different lengths and duration and returns them immediately. Thanks to Google’s Machine Learning technology, the platform can also process real-time streaming or prerecorded audio content including FLAC, AMR, PCMU, and Linear-16.
The platform recognizes 120 languages which makes it much more advanced than Amazon Transcribe platform.
However, despite this, Google still falls short on accuracy and price, compared to Amazon Transcribe platform.
Google Speech-to-Text accuracy improves over time as Google improves the internal speech recognition technology used by Google products. It includes:
- Automatic identification of the spoken language. Google employs this feature to automatically identify the language spoken in the multimedia content (out of 4 selected languages) without any additional alterations.
- Automatic recognition of proper nouns and context-specific formatting. Google Speech-to-Text works well with real-life speech. It can accurately transcribe proper nouns and appropriately format language (such as dates, phones numbers).
- Phrase hints. Almost identical to Amazon’s Custom Vocabulary, Google Speech-to-Text allows customization of context by providing a set of words and phrases that are likely to be met in the transcription.
- Noise robustness. This feature of Google Speech-to-Text allows for noisy multimedia to be handled without additional noise cancellation.
- Inappropriate content filtering. Google Speech-to-Text is capable of filtering inappropriate content in text results for some
- Automatic punctuation. Like Amazon Transcribe, this platform also uses punctuation in transcriptions.
- Speaker recognition. This feature is similar to Amazon’s recognition of multiple speakers. It makes automatic predictions about which of the speakers in a conversation spoke which part of the text.
Google Speech-to-Text costs $0.006 per 15 seconds, while the video model costs twice as much, at $0.012 per 15 seconds.
Considering the speed, price, and accuracy, Google Speech-to-Text is definitely among the best in the industry. However, its features are mostly based on language instead of meaning and inference. Which for now, gives Amazon Transcribe advantage in the game.
But, let’s move on and take a look at the other two transcription services.
IBM Watson Speech-To-Text
IBM Watson Speech-to-Text can transcribe speech form 7 different languages. However, the service does not support all features for the 7 languages. For most languages, it supports 2 sampling rates, broadband or narrowband models. It uses broadband for audio that is sampled at a minimum rate of 16 kHz and narrowband for audio that is sampled at a minimum rate of 8 kHz.
In addition to basic transcription, IBM Watson Speech-to-Text includes voice control of embedded systems, transcription of meetings and conference calls, and dictation of email and notes in a real-time.
When it comes to accuracy, IBM Watson speech-to-text pays special attention to:
- Keyword spotting. This feature enables search by a specific keyword. It basically identifies spoken phrases that match specific keyword strings.
- Speaker recognition. This feature is available for audio content in US English, Spanish or Japanese.
- Word alternatives. This feature enables requests of alternative words that are similar to the words in transcripts by acoustics.
- Word confidence. IBM Watson speech-to-text provides confidence levels for each word of a transcript.
- Word timestamps. The service also provides timestamps for the start and end of each word of a transcript.
- Profanity filtering. This feature censors profanity from US English transcripts.
The IBM Watson Speech-to-Text is priced at $0.02 per minute. This price applies to the use of both broadband and narrowband models.
IBM Watson Speech-to-Text has a wide range of possibilities. When it comes to accuracy, the features above say it all. IBM Watson Speech-to-Text is one of the most accurate transcription services.
However, all of these features do not apply to all languages and even more importantly, some of them come only with the BETA version. This makes IBM Watson Speech-to-Text described as such much more expensive in comparison with the previous two transcribe services.
Nuance Dragon Transcription
Nuance Transcription Engine can easily transcribe messages and conference calls in 43 different languages. The process takes up time according to the length and duration of the message and the traffic on the server.
The service pays special attention to accuracy and for that matter includes the following features:
- Multi-speaker identification. Nuance Transcription Engine can recognize and transcribe up to six individual speakers.
- Customizable language models. This feature is actually very similar to Amazon Transcribe custom vocabulary. It can identify various names using specialized vocabulary tools.
- Intelligent error correction. This transcribe service makes probability‑based suggestions for alternative words when the speech is too unclear to transcribe. This feature is very useful and significantly increases accuracy.
- Timestamps. Nuance Transcription Engine provides fully time‑coded and stamped lines which increase the clearance of transcription. Making possible to know who said what and when in a particular case.
Nuance Transcription Engine price is starting at $150 and it’s a lifetime deal.
Although this transcription service is one of the best on the market, when it comes to accuracy, it, however, differs much from the other transcribe services included in this comparison.
The major difference is that Nuance Transcription Engine focuses on transcribing voice messages and industry-specific transcriptions.
To be more specific, the Nuance Transcription Engine is one of the best, if not the best medical transcription software in the world. Which, unfortunately, means that if you are not a part of that industry, the accuracy of your transcriptions will not be as good as that of medical transcriptions.
Let’s Wrap Up
A research shows that the human brain can remember only 10% of what we read and 20% of what we hear. This is nothing less but an emphasis on the need for deriving value from multimedia content. And AI has proven to be the real deal when it comes to transcribing multimedia content.
Capturing and retrieving information from multimedia content using NLP and Speech Recognition has been the goal of Artificial Intelligence giants for the last decade. And they become more sophisticated every year.
In this comparison, I’ve decided to include only four transcription services which, by my research, are the best ones. I included three factors (speed, accuracy, and price) according to which I was leading the comparison. And based on these factors, I found out that:
- All four transcription services included in the comparison have some distinctive qualities that give them an advantage over the rest solutions on the market,
- They are all fast in processing and delivering results,
- They all show high accuracy of transcriptions,
- They all offer acceptable prices.
However, not all of them can equally respond to everyone’s needs. Take a good look at the comparison made above and decide which one will meet your needs best.
We at Armedia decided to rely on AWS and integrate Amazon Transcribe as part of our Armedia Legal Module for ArkCase.
What choice you’ll make, depends on your organizations’ requirements.
If you have any questions, do not hesitate to get in touch with us. Our team at Armedia is always at your service.