When it Comes to Transcription, Humans are Still Better than Voice Recognition API’s

typing

In baseball a batter who is successful 33 percent of the time is an All-Star.  The best United States presidential approval rating was 70 percent.  The average SAT score for Princeton University students is 2250 out of 2400 (or 94 percent).  The top automated voice recognition software programs produce an accuracy rating of around 92 percent.  While the All-Star baseball player, popular president, and Princeton students are all considered successful in most circles, the voice recognition software program that gets it right 9 times out of 10 might not cut it for some businesses and organizations that require transcripts of record to include not only every spoken word, but even the slightest interjection or emotion expressed.

Tech giants like IBM, Google, Microsoft, among several others, continue to develop application program interfaces (API) for accurate voice transcription.  These APIs have improved dramatically since their earliest incarnations in the 1970s – though not without their share of errors – for fairly clear, highly audible conversational speech; but when it comes to muffled or distorted speech, cross-talk between several individuals, or speakers who are far from microphones, these programs pale in comparison to human transcribers.  This is a primary reason why companies, like NCC, that offer transcription services produced by experienced transcribers and editors are still in high demand.

For NCC’s transcription projects, we are given audio or video files that are diverse in content, quality, and speaker setup: from a single, mic’d speaker to a room of a dozen unmic’d speakers frequently talking over each other.  Neither NCC nor our customers are content with a transcript that’s 90 percent accurate, so it’s up to our team of transcribers and editors to translate that less-than-perfect audio into the best verbatim transcript possible.

According to Roger Zimmerman, the Chief of Research and Development at 3Play Media, “Speech recognition technology is not anywhere near human capability and won’t be for many, many years, my guess is decades still.”  APIs run into trouble with disorganized impulsive speech, which includes hesitations, false starts, and mumbles.  Living, breathing transcribers are much better equipped to deal with these obstacles, as NCC’s transcribers do on a daily basis.  We have the ability to adjust audio channels to isolate individual speakers, train our ears to block out excess noise, adjust the speed and replay particular sections of audio; add to that our intimate familiarity with terms that are unique to politics, health care, finances, law, technology, and other fields, and that makes us a superior option to even the most accurate voice recognition API.

Technology is rapidly growing in voice recognition, as applications like Siri, Google Now, Dragon Dictation, and countless others have become commonplace.  Such products work well for simple, everyday use because they are typically discerning commands from a single, clear voice.  However,  until these cutting-edge voice recognition APIs are able to provide a near-perfect transcript of a meeting made up of 30 unmic’d speakers frequently talking over each other, companies like NCC will be called upon to produce verbatim transcription.  

Information from the Wired.com article, “Why Our Crazy-Smart AI still Sucks at Transcribing Speech” was included in this blog.