UK startup Papercup, creator of what its founders call a human-in-the-loop video translation tool — in short, partially-automated dubbing — has raised GBP 8m (USD 10.87m). The round was led by two VC funds, London-based LocalGlobe and Virginia-based Sands Capital Ventures, according to a December 10, 2020 press statement.
Papercup CEO Jesse Shemen told Slator that their main existing investors also participated in the latest round “and we added BDMI to help navigate the media world.” BDMI is part of media company Bertelsmann. “In total we have raised about USD 14m in funding,” he said.
Shemen and CTO Jiameng Gao founded Papercup in 2017, building a tool that combined speech synthesis systems plus a video translation pipeline. “Our first major achievements were in 2018 when we were able to perform cross-lingual speaker adaptation: voices that sound like the original speaker,” according to Shemen. They are now patent pending across several regions and are currently “improving our voices to reach human-level on the naturalness scale.”
While the founders homed in on automated dubbing as a use case, they “intentionally use the human-in-the-loop quality checking feature,” Shemen said. During Papercup’s early days, the duo received lots of demand for a range of applications — translating live video conferences, UN speeches, audiobooks, and even music.
According to Shemen, “What became clear, after just a few conversations, was an unsolved problem in video that had unusually large potential. We discovered that there is virtually an uncapped amount of video content that is shackled in a single language.”
He pointed to the vast ocean of YouTube videos, Netflix content, podcasts, online courses, and so on, saying, “The idea that we could unlock all of this content for anyone in the world was too compelling a path to turn down.”
How Papercup Works
A user selects a video to upload to the Papercup platform, or the team does it for the client. The video is then processed via an end-to-end pipeline, which includes generating synthetic voices using Papercup’s proprietary speech synthesis systems. To show what the tool can do, Papercup released a Sky News clip to the press and posted an interview with actress Kristen Stewart “speaking” Spanish on their LinkedIn page.
“In the future, we imagine that this will also work in parallel with lip-syncing, so that the audio would match the lip-movements and vice-versa”
Shemen explained: “We employ a human-in-the-loop process to make corrections and adjustments to the translated audio track. This includes correcting for any speech recognition or machine translation errors that come up, making adjustments to the timings of the audio, as well as enforcing emotions (e.g., happy, sad, angry) and changing the speed of the generated voice.”
The Papercup team is now focused on “better retainment and transfer of the original emotion and expressiveness” across many languages, alongside figuring out what exactly makes for quality dubbing.
The next step, Shemen said, is speaker adaptation, that is, “capturing the uniqueness of someone’s voice. This is the last layer of adaptation, but it was also one of the first breakthroughs in our research. While we have models that can accomplish this, we’re focusing more of our time on emotion and expressiveness.”
The Papercup CEO added, “In the future, we imagine that this will also work in parallel with lip-syncing, so that the audio would match the lip-movements and vice-versa.” (As reported by Slator, lip-sync tech got a small boost when Synthesia closed a USD 3m round in 2019.)
From Synthetic Dubbing to Multilingual Human Conversation
Founders Jesse Shemen and Jiameng Gao met at startup accelerator Entrepreneur First (EF) in 2017. Members have to pair up to be eligible for the final weeks of the EF program, which culminates with a pitch to an investment committee. Shemen recalled how during the early weeks at EF, Gao could not find a co-founder. The reason: “He was incredibly stubborn about working on voice translation.”
Shemen added, “I quickly figured out how incredibly smart and oddly obsessed with speech processing he is. So much so that he completed two Masters at Cambridge in machine learning and speech language technology, and wrote a thesis on speaker-adaptive speech processing.”
“Down the line, we think we’ll be able to extend our technology to human conversation — allowing any two people to speak with one another regardless of […] language”
Two examples of Gao’s research are published on arXiv: Real-time text-to-speech (“hopefully paving the way for real-time simultaneous voice translation down the line,” Shemen said) and breaking down phoneme units into phonological features (“which we’ve shown could be helpful to synthesize speech in a new language” even with no prior knowledge of that language).
At EF, the duo quickly buckled down to build a prototype that would be applicable in a production environment. By September 2018, they had closed a GBP 2m round led by LocalGlobe and participated in by British media outfit Sky, Guardian Media Ventures, EF, and angels investors.
Among the latter: William Tunstall-Pedoe, who founded Evi, which Amazon acquired to create Alexa; and Zoubin Ghahramani, former Chief Scientist and VP of AI at Uber, who is now on the Google Brain leadership team.
Papercup landed two Innovate UK grants as well, awarded in 2018 and 2020, worth a total of GBP 600,000. They also brought in two professors as science advisers: Simon King, Professor of Speech Processing at the University of Edinburgh, and Mark Gales, of the Cambridge Machine Intelligence Lab.
Papercup now comprises a team of 20 and “we certainly hope to add a handful more people across machine learning and engineering,” Shemen said, adding that theirs is “a growing team and is effectively very similar to the structure that Rev, Unbabel, and the like use.” (Co-founder and CTO Jiamen Gao referenced Unbabel in a recent Papercup blog post.)
“The ultimate aim with our technology is simple,” Shemen told Slator. “We want to make all voice-based content, whether a TED lecture or live news coverage of the Olympics, consumable in any language. But we also want to take it a step further. Down the line, we think we’ll be able to extend our technology to human conversation — allowing any two people to speak with one another regardless of what language they happen to speak. In other words, your voice in another language.”