Can captions be generated automatically using speech recognition?

Date Updated:

04/08/21

The audio content of multimedia presentations is inaccessible to people who are unable to hear. If there is content presented auditorially, the accessibility solution is captioning that provides a synchronized text alternative to the audio track. For additional general information about captioning, see How do I make multimedia accessible?

Many educational entities produce large quantities of videos for their distance learning programs, outreach, marketing, and other functions. Also, a growing number of institutions are turning to multimedia as a means of enhancing their Web-based curricula. The cost of captioning all this video and multimedia content has many institutions concerned and exploring their possibilities. Many institutions are outsourcing on an as-needed basis, but must be careful to ensure they can receive the accessible media in a timely fashion. Often prompt turnaround requires additional cost. Other institutions are developing the expertise to provide captioning in-house.

Researchers continue to explore options for automating portions of the captioning process. Some educational entities and other organizations are using products or services that utilize some degree of automated captioning.

The best-case scenario would be fully automated captioning using speech recognition technology. Unfortunately, current technology is not accurate enough to fully support this approach. However, research and development toward this goal has been fueled by a rapidly growing market for video search and archival systems. In order to archive and index digital multimedia so that users can search its content, at least a portion of that content needs to be text-based. The first company to utilize speech recognition in this market was Virage®, whose VideoLogger™ application used speech recognition to capture text from a video, which it then used to build a structured searchable index. However, because of the accuracy limitations of speech recognition, this tool could not be used to generate entire caption tracks; it was used instead to extract sets of keywords, including only those words that the software can interpret with a high level of confidence.

The first step in captioning multimedia is creating a transcript of the audio content. Speech recognition technology has become a widely used tool for transcriptionists. In a process called shadow speaking, the transcriptionist (who has trained the speech recognition software to understand his or her speech) simply speaks along with the audio, repeating what the speaker is saying. Transcriptionists who are creating transcripts to be converted into captions will typically use an off-the-shelf speech recognition product such as Dragon NaturallySpeaking.

If a transcript already exists, products or services like CaptionSync™ by Automatic Sync Technologies can effectively use speech recognition to create captions from the existing transcript. This is possible, whereas fully automated captioning is not, because the speech recognition engine only needs to identify when a known word or phrase was spoken, which is a much easier task than identifying what what was spoken. CaptionSync is provided as a web-based service, where customers upload a video file and transcript, and within minutes receive a caption file via email.

For more information about what to consider when making a video that is accessible to all viewers view the video Making Videos Accessible.