Caption Everything

Using HTML5 to create a real-time closed captioning system.

November 13, 2013 •

A few months ago I attended an event where a young woman was sitting in the back row frustratedly signing with a friend. She seemed visibly bored, idling on her phone and unable to engage or maintain interest due to the lack of a professional interpreter. That struck me.

I thought about it quite a bit and began wondering if it’s possible to use open source apps for real-time speech-to-text captioning. Initial research for relevant libraries turned up empty, until…

The Speech API

This week I learned that Chrome (25+) features an experimental webkitSpeechRecognition API designed to create voice-driven apps like Siri or Glass. It seems optimized for short commands, but I immediately wondered if this could be a solution to create on-the-fly captioning.

See the Pen Closed Captioning with HTML5 by Dave Rupert (@davatron5000) on CodePen

Chrome does a decent job, I assume this the same tech behind YouTube captions. Obviously, auto-captioning is not perfect, but it’s a start! And with the current drive from major companies for StarTrek-like invisible UIs, I can only imagine that dictation features will get better over time. In fact, Google is aware of their imperfections and claims their speech recognition technology gets better every day.

Potential applications

With a direct audio feed from a soundboard1 or a laptop armed with a portable shotgun mic, I sort of feel the possibilities are endless2. Here are a few ideas I could see using the Speech API for:

  • Soundboard-to-web captioning for:
    • High School or College classes
    • Conferences (web or otherwise)
    • Local meetups
    • Religious services
    • Stadiums
    • More
  • In-person transcription over mobile, laptop, or Glass™3
  • IRC-style speaker logs
  • Captioning for WebRTC video chat
  • mp3-to-text transcription services for podcasters
  • Soundboard-to-Audio induction loop system over WebRTC
  • Real-time translation with the Google Translate API

Conferences, meetups, and the hard of hearing

I put together a quick survey to guage the state of hearing impairment and conferences. Thanks to everyone who participated. Looking at the data, it does seem that our deaf or hard of hearing subcommunities are unable to take full advantage of typical conference/meetup settings.

Do you feel a lack of captioning or sign interpretation keeps you from regularly attending events (both tech and non-web realted)?
Nope414%
Hardly311%
Somewhat27%
A little bit725%
Absolutely1243%
Source: docs.google.com

Personally, I can only recall a handful of events in my entire life that have offered captioning or ASL interpreters.

Let's not be quick to sly or shame to conference organizers for “not doing enough”. Every organizer I know would break their bank to try and hire an interpretor if requested.

Interpreters and captioning services, however, carry very real and significant costs to events or organizations. Especially those with conference videos4, all-day workshops, multiple tracks, or those with intentionally low margins to make the conference more publicly accessible. Over-demanding could be a set back for conferences we’re seek to build up.

Maybe a low-budget HTML5 Speech API solution of decent quality could open 99% of the doors that would be closed otherwise.

Wordcast: An Experiment

I started a repository called WordCast under the Accessibility Project with the goal of converting that CodePen into a Node/Express/Socket.IO app that can brodcast subtitles. I’ve never built anything on any of these buzzwords I just typed, so I could definitely use some help.

I don’t want to over promise and under deliver, but technically it seems possible to develop an open source app that broadcasts real-time captioning at a low cost of zero dollars. That app could then be used by offices, meetups, churches, conferences, or wherever you see fit.

This can also fail hard. Bad captioning will make or break it. 37% of survey respondents felt poor quality captioning is essentially worthless, but –and this is where I assume and may be wrong– doing something seems better than doing nothing. I think we could quickly build this out, beta test at a few small events, get feedback and see where the idea stands.

I’m not naive to believe this couldn’t be done with system-level dictation services, but I don’t know how that stuff works. But I do know the shit outta HTML5 and the web platform is the only platform shipping on all devices. That seems like an open pathway forward to solve such a universal issue.

If you’re interested in helping, let me know.

Updates

  • I just found out about Conf.io from last week’s NodeKnockout. Basically, the same concept tailored for presenting online seminars.
  • Michael Heuberger (@binarykitchen), creator of Videomail.io, a video e-mail app for sign language speakers, connected with me and is intersted in connecting with other designers or developers who are deaf and/or hard of hearing. http://binarykitchen.com/contact/
  1. Or some kind of web-enabled Arduino lapel mic?

  2. Even more endless if this ever becomes available on mobile browsers.

  3. For the first time in my life I think I’m excited about Glass. Google, if you want to send me a pair, I’ll accept.

  4. Conference Video #protip: If you upload your conference videos to YouTube, they caption it for you.

comments powered by Disqus