The two talk about the underlying components of the Papercup workflow and outline the role that technology — speech recognition (ASR), machine translation (MT), and speech-to-text (STT) — as well as humans play in the creation of multilingual videos.
Simon, a Professor of Speech Processing at the University of Edinburgh, discusses the evolution of text-to-speech technology, the main technical hurdles in producing highly natural, emotional voices, as well as the adoption and acceptance curve for synthetic voices.
Jesse shares some of Papercup’s company milestones, which include raising a total of ca. USD 14m in seed and series A rounds. He also explains why there is room for many different startups in the multilingual speech and video translation space.
While Papercup has an ambitious goal of making videos accessible in any language, Jesse says startups will likely expand the market rather than replace traditional dubbing, particularly for high-end production environments.
First up, Florian and Esther discuss the latest language industry news — with a multilingual speech technology slant this week. The duo touch on NDVIA’s real-time MT offering, a mouse that transcribes and translates your voice at the press of a button, and Microsoft’s USD 19.7bn acquisition of AI speech technology firm Nuance.
In language industry-adjacent funding, the two discuss data-for-AI leader and Appen rival Scale, which doubled its valuation (to a whopping USD 7bn) after announcing they had raised a further USD 325m in funding.
Returning to the core of translation and localization, they talk about signs of a boom in the language industry, pointing to Super Agencies reporting strong results, anecdotal evidence from busier-than-ever LSP staff, a soaring Language Industry Job Index, and RWS shares (SlatorPro) that serve as a bellwether given company’s broad sector exposure.
Stream Slator webinars, workshops, and conferences on the Slator Video-on-Demand channel.
SlatorCon Remote March 2023 | Super Early Bird Now $98
A rich online conference which brings together our research and network of industry leaders.
Florian: Jesse, tell us about Papercup.
Jesse: What we want to do is make the world’s video and audio consumable in any language and we are doing that by creating synthetic voices in target languages that reflect the type of person speaking in the original video. Think about the most basic form like a Sky News clip on how London is cloudy today, so that will be a clip that is uploaded every day. If that is in English, uploaded and distributed on YouTube, we can take that same clip and translate it with synthetic voice into Spanish so that the Spanish speaking population can watch that content. The idea is we want to apply that across all languages and across any form of audio and video content. That is the dream.
Florian: Tell us a bit more about the team, where are you based? What has been the historical trajectory of the company so far?
Jesse: People are always surprised to hear that we do not make physical paper cups. The original logo is two tin cans. It is the most elementary form of communication. We want to facilitate dialogue and conversation and allow consumption of content and media, no matter what language somebody speaks so that is always the premise behind what we are doing. We saw that there was billions of hours of content that is stuck in a single language. When any form of video or audio is produced, to justify recreating that in another language is usually quite expensive, laborious, and complex. We realized that cannot just be solved by humans and that technology would be a fundamental player and so what we wanted to do is use synthetic speech to allow, again, any form of content to be consumed in any language.
That is where we started out and so we started the company a little over three years ago with my partner, Jiameng, who studied Speech at Cambridge and did a master’s in Speech Processing. We are now a team of around 20 people, primarily based in London filled with people that studied machine learning, as well as software engineers, and people on the customer and commercial side.
Florian: Simon, tell us a bit more about your background on the speech processing side, the University of Edinburgh. It is a very renowned center for AI in language?
Simon: Edinburgh has one of the largest groups in that large area of language processing so that includes machine translation, speech recognition, transcription, and then what I do, which is speech synthesis, speech generation from text, and then all the other kinds of areas of natural language processing that is around that. We have got over 100 people researching those areas in Edinburgh plus, lots of PhD students. We run specialist masters programs and we have been developing this speech synthesis technology since 1984. I have been there since 1993 and I have stuck it out long enough to become the director of the group.
Florian: Tell us a bit more about how the underlying technology is built. What do you buy, what do you build?
Jesse: Let us take a sports commentary on BMBA, my favorite topic. If you have a video that is originally English, we take that file, we pump it through our pipeline, and that basically has three fundamental technologies. The first is what is called speech recognition, which is taking the audio from the actual video file and converting it into text so in this case English audio to English text. Then we use machine translation systems to translate that original text from English into say Spanish or German and then we use our own speech synthesis system to generate the voice based on that translated text.
We have not invented either of these steps, but the first two we typically do outsource because those are big problems that big tech giants and other companies are trying to solve, the transcription and translation. Then we use our own text-to-speech system to generate the synthetic voices. A big part of the Papercup product is piecing together the pipeline, of that end-to-end pipeline and the complexity that is involved in that. As well as more specifically the text-to-speech system, which is the last leg of that pipeline, which is more bespoke to us.
In my mind and obviously, I am biased, but I find the emotional component most interesting because text, not that it is rudimentary, only has so many dimensions to it in terms of how it can be interpreted whereas speech and voice can have incredible dynamism to it. There are so many different contours and manifestations of how you can vocalize even a single word or sentence. It is so interesting to think about how to create that appropriate voice in any language and there are no bulletproof answers. It is an incredibly challenging problem.
Simon: Another way to think about it is that the synthesis needs to be sort of bespoke to this particular use case whereas the transcription just needs to work. It just needs to transcribe the words correctly and once it is done, there is nothing much left to improve there. That is what is good to buy in but the synthesis needs to be tailored, needs to be controllable and needs to have the speaker identity appropriate to the picture and all of those things. That is why you would want to make the synthetic voice part before you tried to make the other parts. They can be very general-purpose.
Florian: Is machine translation a major issue for you at this point? Or is it more transcription level where at some point it is just going to be a somewhat accurate translation?
Jesse: The errors do propagate because we are dealing with three systems here so if you have a poor transcription then your machine translation might talk about a completely different topic, which I do not think your customer would be thrilled with. This is also why we have built what is called a human in the loop system that allows us to quality check the videos before they are generated and the translation because you cannot rely on the equivalent of a Google Translate. Maybe to get around the corner to a coffee shop but for topics like coronavirus, it is obviously not reliable. You are right to point out that we also are hinged to the progress and the quality that the transcription translation providers can generate. That being said, they have obviously made progress over the years and they are certainly in a far better position to try and solve the final percentage points of optimization and accuracy than we are so we do not even try and solve those.
Esther: Once you have got this human in the loop, the process and the tech down, how do you go about patenting and protecting it from the outsider? What is it that you can begin to patent once you have come up with that?
Jesse: From my perspective, ASR and machine translation have matured to a degree where there have been a lot of commercial applications that have exploited it to a reasonable degree. For example, take ASR, you have Otter and some other tools that take transcription notes on Zoom calls and you have providers like Rev.com. These models make document, website or customer support translation simple. When it comes to text-to-speech, it is still relatively early to compare it to those technologies in terms of their commercial exploitation. Not because there is not a market to pursue, but because the quality of text-to-speech only recently hit this naturalness level, such that it could be more widely consumed across different forms of use cases, whether that is an article reader or for audiobooks. There is still a large amount of technical progress and change that needs to be made in the world of text-to-speech and as a result of that, the commercial applications are still incredibly early in terms of how speech can be exploited. Simon, before I answer the patent question, do you agree with that?
Simon: Yes, you certainly timed the founding of the company perfectly in terms of that quality jumping in synthesis. We look back over my career in synthesis, you pass various milestones, there was one a while ago now where we could say it was as intelligible as human speech. That was a major event when that happened. Before that it was less intelligible so that got solved and then quite a lot of time later, we can almost say that naturalness got solved in the sense that sometimes people have an equal preference for natural speech and synthetic speech under some circumstances.
Now we have moved on from solving the basics into what do you really want to do with synthetic speech? Do you want to make it expressive? Do you want to make it sound like a particular person? Leading into that, I am sure Jesse can try and patent and protect all sorts of things, but there is actually quite a lot of craft in making it, and the technology is available in the open literature and the code is indeed available. Pretty much anyone can have a go at it, but it is quite a lot of craft to get the most out of it, having the right data behind it, which is proprietary to the company, having people who really care about detail and spend a lot of time listening to it and so on. That is not patentable stuff, that is know-how.
Jesse: That is very true. I also think unless you are a tech giant patents are often overstated or assigned more value than they probably should be at such an early stage. I am not suggesting that we should not file or pursue them, but it feels more like a hygiene factor that people are waving around rather than something that has practical value. Unless you have an army of lawyers that can defend and litigate, which is just not the interest in the startup, because there are a few things that you can do and focus on and one of them is usually not going to court. You want if anything to stay out and just build a product that people want.
Esther: Simon, you talked about some of the developments in text-to-speech but how have you seen it evolve?
Simon: Yeah, in a way the problem has always been the same. The goals have always been the same, which is nice. It is a question crossing various milestones. What is the first thing you want? You want to be able to understand what this machine is saying and so for a long time intelligibility was a problem and then that got solved. Ever since then we have been in a machine learning paradigm. Almost since the seventies let us say speech synthesis has involved taking data, taking recordings, usually of one person, lots of them.
Esther, if you want a synthetic voice we are going to harvest all of these podcasts. You have got beautiful audio quality so we will just go and chop out all that audio and get the transcriptions. Probably hundreds of hours there across the history of this podcast. That is enough to make really good quality voices. It is always based on good data in the studio. What has changed has been the use of deep learning neural networks. That is the latest innovation, but internally there was not an enormous paradigm shift because before we were just using other sorts of machine learning, and before that some other sorts of machine learning. For us, it is just the latest sort of machine learning and let us do this again in 10 years and laugh at those neural networks we were using and say, what an old model that was, now we are using something else.
Jesse: Would you say that there has been progress over the past two to five years relative to what it was five to 10 years ago? Is there more attention paid to text-to-speech?
Simon: It has been steadily getting better. Compare it to something like speech transcription, that crossed various thresholds. There was a time when it was so bad, you would not use it for anything and then it crossed the threshold. It is now good enough to dictate a letter. Then it crosses the ‘I am willing to have subtitles made on YouTube for it’. Then it crosses the usability threshold and that is not because suddenly the error rate halves, the error rate has been steadily going down and it happens to cross some threshold where it meets a certain product, viable or not viable. I think synthesis is the same as you are steadily getting better and it suddenly hits this acceptance. At the same time, maybe people’s acceptance changes as well because we are so used to listening to quite cruddy audio on Zoom calls and Skype and all the rest of it. We are probably quite tolerant of some of the nastiness in speech synthesis.
Esther: How do you go about removing some of those hurdles for acceptance? Is it a technical thing or is it an education thing?
Simon: In the case of people who are visually impaired like blind computer programmers, if you have ever met one you will be unbelievably impressed that they can write computer code without seeing anything on the screen. For something like that, because they are very highly motivated and they probably do not care about naturalness, they care an awful lot about speaking rate because that is the bottleneck. Speaking is slower than reading and they have always been willing to learn so they are willing to adapt to quite bad speech synthesis.
Their acceptance level is in a completely different place to someone who is listening to an audiobook for pleasure so synthesis crossed that threshold a long time ago. The moment it was at all understandable people said, I will use that. Nobody was listening to audiobooks with this 1980s speech synthesis. That would be fatiguing, to say the least. I think people might move their thresholds around over time as well and people’s quality threshold might lower because they realize if I am just willing to listen to slightly less good audio now I can do all these things.
Jesse: I think it is a big education piece that we have to embark on because the substitute and alternative is the human voice and so that is where the expectation lies. Synthetic voice, especially in its current form today will consistently underperform, whatever the expressivity and the type of voice you use with natural humans. I think part of what we have to do is explain to people, yes, there is a differential, but the question is not whether or not there is a difference. The question is whether or not that difference will be accepted. Oftentimes you are surprised that people will accept a lower quality voice. What you have to do is just trial and test it and see what the engagement retention looks like.
Florian: There are zero hours of me speaking my native language German on YouTube so if you took my accented English, ran your models and then had the voice then speaking German. Has that been tried?
Simon: Absolutely. That is one of the USPs of Papercup. At the moment we will get someone that sounds appropriate for what you look like because this person watching the dubbing video does not know what the real you sounds like so it is you that cares most, that it is appropriate. It might not be a perfect match, but this idea of personalizing synthesis for my group has been a very long-standing interest.
The most recent spin-out for Papercup’s is making voices for people who cannot speak or are going to lose their voice for example because they have motor neurone disease. In that case, you are personalizing it to them. That is within languages, but the same technique works across languages. I hate the word voiceprint but that captures it for most people. We can distill your voice down into some numbers and these numbers when fed into the synthesizer will make it sound like you. The system, the one model can generate many voices.
Jesse: Something that is interesting is this still unchartered territory and what you do see is still a lot of experimentation and testing and just seeing what people respond to. There is no right answer to what you should sound like in an alternative language. I do not think that has necessarily been solved. What is interesting is how in the dubbing industry, where we have spoken to a ton of agencies and voiceovers, they will try to mimic certain aspects of the original actor in the original film or TV show. Though you cannot just assume that it should be that distinct voice in the target language, but instead, at least at this point it is getting something that is comparable and similar to it.
Esther: In terms of the potential use cases you have got YouTubers, corporate education videos, some unscripted content. Then when we are thinking about audiovisual content, it goes right the way up to these premium Hollywood productions. How do you view that continuum of customers or of video use cases? How do you see that adoption curve?
Jesse: First of all, it is exciting because it is all untapped and that is why I think there are going to be a bunch of successful startups and companies in the field of speech because it is still so under-exploited today. You are right that there is a continuum for video content, but equally for all other forms of text-to-speech applications, even article readers, voiceovers, audiobooks or call center automation systems. There are so many different applications to pursue, which is why I think the companies that will be successful are the ones that figure out the combination of product as well as the text-to-speech engine that underpins it. You need that as a bare minimum, as a prerequisite for whatever you are building. There also needs to be a functional use case that you are actually solving for and the product needs to appeal to that.
There is a spectrum of content that is really interesting and it starts with probably your semi-professional, so a YouTuber who uploads tech reviews of smartphones, just as a side hobby, all the way to studios. Each of them will have a different set of requirements, not only in terms of voice quality and the level of expression that you can add but equally in terms of the number of types of languages, accents and dialects. The way in which you actually look at that spectrum is what do people need? What do they want? What are they asking for? That is how you can figure out which use case you can target depending upon what you are able to offer at that moment in time.
Esther: Have you defined that for Papercup yet? Is there a segment where you are seeing quite a bit of traction?
Jesse: That is part of the difficulty, to be honest. You see varying demand from a ton of different types of use cases and it is our job to make sure that we are focused on being more exclusive to one or the other. Part of the exercises that we are undertaking now is which ones do you want to be solely focused on because we are getting pulled in a lot of different directions. Certainly, we want to stay within the world of video content and I would say focusing on the studio side of things is probably still too ambitious relative to what top-performing voice actors do. I think the top-performing titles that require a genuine performance will still be exclusive to humans and I do not think we are trying to replace that in any sense. We are probably on the earlier side of the spectrum, for lack of a better way of describing that.
To me what is most enticing is not what is dubbed today, but what is not translated or dubbed or localized because of the traditional infrastructure and process. To me, that is what is most exciting. We are not in this game to try and replace the dubbing industry. I am fine with it existing by all means but there are literally billions of hours of content that are untouched because they cannot necessarily afford the traditional method of localizing.
Florian: What is hard about extending your product into more and more languages? Is this easy or is this a major project with milestones?
Simon: This is a journey that every text-to-speech company has taken. Probably not one of them has set out to do one language only. It is a question of how you add another language. It takes data and it takes some language expertise. It does not mean you need to be a native speaker. You need some linguistic knowledge or expertise of how this language behaves and how it is different or similar to other ones. All the writing systems have their own idiosyncrasies so you need this linguistic expertise. Those are the sort of people that do the masters program in Cambridge or Edinburgh. You need someone with that skill. To scale across languages often involves hiring a person with some knowledge of that language or having a team that is got it between them and then going off and getting the data.
The primary data is recordings of a native speaker in that language in a studio to sufficient quality of the tens or hundreds of hours that you need. Then unfortunately that essentially repeats for each language. There is a sort of fixed cost there. You cannot get away with not having that recorded speech and you have to have that linguistic process that deals with the massive text and turns it into pronunciations, and that is repeated for each language. The actual software code and the methods and the models are the same everywhere. There is a fixed substantial cost for each language.
Jesse: Are there any languages that have historically been particularly painful for researchers in TTS?
Simon: This is a classic linguistics question, which language is the hardest in the world and which language is the easiest in the world? They are all equally easy for babies to learn. All languages are trivial to native speakers. We are all evidence of that. It is more that each language has something that gets you, something that is easy, something that is hard. Japanese has three different alphabets, complicated stuff going on with the writing system but the pronunciation is trivial. It is the same phonemes, almost the same sounds of languages as Spanish. Every language has got something easy, something hard, but it is the knowledge. That is what you need expert knowledge for.
Esther: When you are bringing on board the people with the right skills and expertise, it seems like it is a super competitive market. How do you go about finding and bringing them on board?
Jesse: It is not easy. I think also what is interesting is that if you rewind just a few years ago, there were not as many people that were as interested in the broader domain as there are now. That is helped because it means people are setting that out as a career trajectory that historically they learnt.
Simon: Speaking as somebody who runs a master’s program and does admissions for a number of applicants, we have five to 10 times as many people wanting to do this master’s program as we can handle. It is quite hard to start a new whole degree or educational program to train people for this field but it is growing in general.
Florian: What is driving this? Is it just general AI?
Simon: I think it is partially that. There are a lot of people passionate about language so they did a linguistics degree. They just love language, not just learning them, but taking it to pieces and putting it back together again. This is one place they get to do linguistics in an engineering sense of doing something both creative and technical with linguistic knowledge, but they have got to get the technical skills. They do not get those in the linguistics degree. To give them that and then to let them move forward, that has been a long-standing trend. Equally, all those AI degrees and computer science degrees out there, these people do not all want to be software engineers. That is not very interesting. You want to do something a bit cool or be in a company with interesting people doing interesting things and they are also coming into language from the outside.
Florian: You started out with a grant, but then you raised a seed round and now a series A. How did you connect with those investors? Why did you choose to work with them? Why is not bootstrapping ever an option at all in something as fast-paced as you are in?
Jesse: Bootstrapping is always enticing. It also probably becomes more enticing for people after you have raised money and you realize that it would probably be nice not to have all of the pressures and noise that comes along with fundraising. Fundraising to me is a means to an end and oftentimes there is too much fanfare and excitement around it. I think the reality of working within deep learning and text-to-speech is that it is not simple. It is not something that you code overnight and start selling tomorrow. It requires that maturation period and it needs to be nurtured and you almost need to have that capacity of time to allow even researchers to just dwell in it. That is one of the fundamental reasons why we raised capital because we knew we needed it.
Then what you need to do is to think about who cares about this sort of space. Text-to-speech is more popularized commercially through things like Alexa and Google Home. More investors started thinking about different applications of text-to-speech. That is why we have one of the top seed funds. Local Globe is one of our backers as well as a few ventures and media companies, including Sky, the Guardian and Bertelsmann. We also have a US investor by the name of Sense Capital, which has been fantastic and then a bunch of angel investors. Angel investors are mainly people that want to stay close to the company and are interested in its future. They can be quite helpful and constructive over its history and so that is why we have tried creating a pretty diverse range of people that are close to the company.
Esther: Do you have any plans to do any future raises this year or is that it for now?
Jesse: Thankfully we do not need to raise for a little while. Once you have announced a fundraising round, that is when you also see a flood of more inbound requests. Thankfully we are okay for now. We also have Innovate UK projects which are some of the government grants in the UK that fund compelling research, but it is primarily standard venture capital as well as the grants.
Florian: Let us close on the product division in the next two to four years. What is in the pipeline? What are the plans? What are the ambitions?
Jesse: To me, it comes down to the quality of voices, breadth of languages and just applications that you can go after. Again it is such a wide territory for us to try and tackle and I want us pushing on all those verticals. They are all tough in their own right because voice quality is different from languages, it is different from the product, it is different from commercial application. Focus will always be the name of the game, but I am just excited to try and push into different territories as we go.