How GPT-4 and Text-to-Speech Make Visual Content More Accessible

GPT-4 and Text-to-Speech Make Visual Content More Accessible

When OpenAI publicly announced GPT-4 in March this year, it had already been made available to a small yet diverse list of collaborators. The group was hand-picked to showcase the large multimodal model’s capabilities, spanning use cases from education to fraud detection, and language preservation to accessibility.

Danish startup Be My Eyes was among these early adopters. It first launched its eponymous app in 2015, connecting users in need of visual assistance with sighted volunteers through a live video call. Instead of relying on a small group of family and friends to assist with everyday tasks such as checking an expiration date or locating a power socket in a hotel room, people who are blind or have low vision can use Be My Eyes to access a veritable army of helpers.

Now, thanks to GPT-4’s ability to analyze and interpret images as well as hold a conversation, Be My Eyes has built a digital visual assistant it hopes to release later this year. This Virtual Volunteer, currently in beta testing, further promotes autonomy among the blind and low-vision community by providing users with an effective alternative to human help.

While there are already impressive visual assistance tools such as Seeing AI, Lookout, or TapTapSee, they do not seek to mimic human interaction. GPT-4 is able to extrapolate from context to identify and communicate relevant information as well as engage helpfully with user responses. This allows the Virtual Volunteer to, for example, analyze a photo of the inside of your fridge and suggest some meal options, even providing recipes and instructions if desired.

‘Seeing’ with Speech

As exciting as that sounds, the key to making this text-based information accessible relies on technology that falls outside of GPT-4’s wheelhouse. Many people who are blind or have low vision use specialized software called screen readers to convert text-to-speech (TTS) or braille output. This assistive technology is designed to convey the visual elements of a display through non-visual means, allowing users to access content through hearing or touch.

Most devices come equipped with in-built screen readers and TTS voices for accessibility, but there are also a number of separate products such as JAWS and NVDA which are popular among PC users. For mobile devices, native screen readers such as Apple’s VoiceOver and Google’s TalkBack predominate.

Interestingly, the increasingly human-like TTS voices that are so desirable for AI dubbing and voice assistants are not necessarily the best choice for accessibility. The very qualities that make these voices pleasant to listen to at a normal speech rate can make them difficult to parse at the high speeds typically employed by people who are blind or have low vision for the sake of efficiency (often faster than an auctioneer!). Even ‘clever’ features such as correctly parsing misspelled words can be annoying for users relying on TTS to identify typos while proofreading.

Essential for Some, Useful for All

Apart from access to a speech synthesizer, screen readers rely on accessible design to accurately convey what is displayed on a screen, such as alternative text for images and semantic markup of structures such as headings and menus. The World Wide Web Consortium’s Web Accessibility Initiative provides guidance on designing for compatibility with screen readers, which has the bonus of benefitting automatic translation and search engine indexing.

By making it simple for content to be rendered as speech, it not only improves accessibility for the blind and low-vision community, but also for people with dyslexia and other cognitive disabilities, those who cannot read the written language, and even those who prefer to listen so they can multitask. As the Consortium repeatedly states, web accessibility is “essential for some, useful for all.”

Although it is no excuse for poor web and app design, multimodal models may provide another way to render these accessible. Be My Eyes experimented with training GPT-4 to understand the structure of web pages and found it was able to learn which parts to read or summarize, a feature that could help people who are blind or have low vision to navigate even cluttered or poorly designed websites.

The Text-to-Speech Bottleneck

While the combination of image processing, conversational AI, and screen readers can vastly improve visual accessibility, language support is ultimately limited by the availability of TTS voices (braille codes do not exist for all languages). At the time of writing, Apple’s VoiceOver and Google’s TalkBack each supported around 60 languages and dialects, and Microsoft’s Narrator around 50.

The elegance of Be My Eyes’ original micro-volunteering concept is that by relying on human ability, it could quickly expand to over 150 countries and 180 languages. Perhaps its Virtual Volunteer will encourage big tech to expand TTS support as well.