Microsoft Claims Human Parity Reached in Speech Recognition

Microsoft Artificial Intelligence and Research researchers yesterday announced that they have achieved a significant breakthrough by developing a speech recognition system that has transcribing capabilities on par with humans. 

The company’s speech recognition software has achieved “human parity”,  Xuedong Huang, the company’s chief speech scientist, claims. The software can transcribe human speech with the same or a fewer number of errors than a person, resulting in similar results to human transcribers.

As further detailed in their research paper, they used a combination of neural language models (convolutional and LSTM neural networks), combined with a novel spatial smoothing method and lattice-free MMI acoustic training. These models are capable of not only learning the sound of words but also their links to others, allowing for efficient generalization. They also utilized the Computational Network Toolkit (CNTK), an open-sourced software on Github can process deep learning algorithms across multiple computers which can train and process language data faster.

Researchers noted that when compared to human transcribers, the software’s word error rate (WER) of 5%, down from the 6.3% reported last month. The rate for the first time rated below 6%, and the first that computer has achieved a word error rate lower than a human. The researcher also stated that this is the first time a computer has the “lowest ever recorded” error rate in the industry standard speech recognition test.

Speech recognition has always been a challenging area in the AI field. Although this marks a major achievement in speech recognition, the “human parity” claim is not quite accurate. Research demonstrates that while the software’s error rate are similar or lower than humans, it can still make significantly different mistakes. The researchers acknowledge that the software still had issues in adverse environments such as background noise or distinguishing between individual speakers in a group of people. Human listeners are able to recognize such speech distinctions, and these errors will be a part of Microsoft’s long-term efforts to improve upon neural network algorithms.

While Microsoft has not officially announced plans for integration of this software, yesterday’s announcement implicates improvements in some of the company’s speech recognition products such as Cortana.

Scroll to Top