Voiseed CEO on How Game Localization is Trailblazing Speech Synthesis

SlatorPod #158 - Voiseed CEO Andrea Ballista

Voiseed CEO and Co-founder Andrea Ballista joins SlatorPod to discuss the machine dubbing startup’s approach to operating and developing their AI-based virtual voice engine, Revoiceit

Andrea talks about how his passion for music as a child led him to founding audio localization studio Binari Sonori, which he sold to Keywords Studios in 2014, and why he is now launching Voiseed.

He shares his impressions from this year’s Game Developers Conference where there was a lot of interest in new technologies, voice cloning solutions, and the development of emotional synthetic voices.

Andrea unpacks Revoiceit’s ability to understand the voice and emotional profile of the user and transfer both profiles into the target language. Voiseed has been profiling vocal delivery and creating a data set, so the system can have a wider knowledge of human emotion in terms of voice and language.

Subscribe on YoutubeApple PodcastsSpotifyGoogle Podcasts, and elsewhere

On the topic of large language models (LLMs), Andrea is not worried about the implications of LLMs like ChatGPT as they have had two years to build a dataset on vast amounts of voice content. He talks about Voiseed’s early financial backers and shares the story behind applying for blended finance (grants and equity) from the European Innovation Council Fund and LIFTT.

Andrea shares his experience hiring during the lockdown and their approach to investing in ‘cool’ people and helping them grow through their incentive plan. The pod wraps up with Voiseed’s product roadmap, where they aim to improve features with more emotion control and work on how to create new voices that can be used in multiple projects.

Transcript

Florian: For people who don’t know Voiseed, give us the 60-second elevator pitch.

Andrea: Basically we are a startup that is building an engine technology, a tech engine that is able to produce emotional virtual voices in multiple languages, so this is very, very reform. The idea is to produce any voice with any style in any language. This is the vision and we are building it step by step.

Florian: Perfect and you were also in our 50 under 50 list. We published a list of like the 50 most exciting language tech startups, language AI startups under 50 months, so I learned about the company from that as well. So tell us about your professional background and your kind of route to co-founding Voiseed because you’ve been in the kind of audio dubbing, et cetera business for quite some time.

Andrea: Let’s say that everything started from the passion of music and singing when I was a boy and of course during my teenage time I was having the band, playing music and then we started to say, okay, and what about graduation? Oh, maybe we can do computer music. This is something that is still not here but is going to come. Of course, computer music in the 80s was a kind of something very strange. Okay, it was not something like you can see now and of course that was my path to computer music and then of course, that was a multimedia age. Multimedia, so starting to be curious about video, voice, time and singing and of course being also a singer, singing in the advertising for say, the usual jingle on the TV, I was really able to know about the studios and what happened in the studio. In this respect I was starting to do my first company in 1994. It was called Binari Sonori and we started to do game localization and this has been growing until 2014 when the company was basically kind of 60 people with an office in Tokyo and an office in Los Angeles. Then we found Keywords, okay, and the idea was pretty clear. Keywords was doing mobile translation and LQA and we would love to leverage the offering to another step, so it was pretty a natural say joining the company. Technically it’s been sold. We’re selling the company since Keywords has made an IPO, 2013 in London. And then basically I’ve been staying there for four years and something, being an Audio Director, taking care about the say vertical integration, increasing the studio. It was a very exceptional ride to say, but then at the moment say, yeah, but maybe we need something more. We need not just humans since the games are so much increasing in demand, always crushing and timing was complicated to keep on all the retakes and all the things, so we were really thinking to do something else. So basically I’ve been leaving Keywords in the end of 2019. I started Voiseed in 2020 with the vision to do something more than the Google Tacotron. Okay and do expressive voices, expressive synthetic voices, yeah.

Florian: Interesting, so Binari Sonori was one of the first acquisitions of Keywords after their IPO, right?

Andrea: You are right. It was really at the beginning since we knew Andrew and Giorgio, CEO and the Founder well before. Since in the game space there was not so many, I mean the usual suspects. Okay, so they were the kind of 20, 25 companies, so we all knew each other, of course.

Florian: Since then they’ve bought like 50 or something companies and there are like 3 billion in market cap. Quite the ride there. You mentioned you just came back from San Francisco, so you were at the Game Developers Conference a couple of weeks ago, the big one, right? That kind of books out almost the whole of downtown, so tell us your impressions also regarding like localization, dubbing, machine dubbing. What were some of the takeaways?

Andrea: Absolutely, so first of all, it was finally back the GDC after GDC 2019. The last year I was there but it was not so crowded. Okay, so definitely has been super welcome to be back on the floor and see the usual GDC with a lot of people, can bump into each other say hi after such a long time, after the cause of COVID time. Having said that, of course lot of interest in development, new technologies and, of course, expressive voices is something that is coming after machine translation. Since in our industry we have seen the machine translation coming, say kind of 10, 15 years ago. Now it’s basically, it’s normal. Okay and also some of the synthetic voices already came through the Google Tacotron. There’s a number of voice cloning solutions out there but definitely to have them more emotional and with a great variety of voices in multiple languages that was definitely super interesting. I talked to a lot of folks that are really trying to understand, they are curious, they know that it’s coming, so it has been really a pleasure to get there.

Florian: Very interesting. So who would you look for when you walk the floor there primarily, like what type of role, what type of company would be best for the current stage?

Andrea: Absolutely, so we are starting from what is called game dubbing. That is basically also my basic background and definitely the ultimate clients are say the publishers but they need a language service provider that are experienced in this area for doing the translation and, of course, doing the voice recording. We are a tech company so we are not a service provider, so our main focus is to create a technology that can be used by people that is experienced in the industry, adding one new service to the existing service portfolio that is already out there. So integrating the local offering, the current offering with, of course… And having the idea that if you can’t use the actor for any reasons, because actors remain a top level solution, but you can’t also use the text-to-speech since it’s not expressive enough and it takes long, they have a number of other issues, then you can try different way. Let’s try voice it and Revoiceit in specific, so language service provider for games has been our definitely first target there.

Florian: You mentioned Revoiceit, so is it a product? Like would you see it… Or is it the core, is it your platform or is it one product and you’re going to move on to another one or how do you think about it?

Andrea: The focus of the company is to create a new technology. Okay and this is the core that we have been already patenting and this is new thing, it’s not the Tacotron. The other thing is that definitely you need to make the technology available and usable and so this is why we have built Revoiceit. It’s the first, say, implementation of this new technology that is devoted to the game dubbing world, in which you can take advantage of the new technology that is able to understand the voice profile of who is prompting the system, understand the emotional profile. Of course, the line and transfer both of these profiles in the target language, so Revoiceit is the first application in which you can use this new technology.

Florian: You mentioned prompt, so what’s the input? Is it just I speak and then I click and then it becomes different emotions or is it textual input or how does it work? Both?

Andrea: Let’s say it’s multimodal. So the system, the Revoiceit platform is designed to ingest source text and source voice file, okay. Also ingest the translation and generates the output voices automatically as a special feature called autocasting that is able to understand the voice profile and generate what we call sound alike. So it’s not exactly the source but it doesn’t really matter if you are doing the dubbing, since if you are, say, seeing the dubbing in other languages, voices are always different, so it doesn’t mean that it has to be the same. And of course, the emotional delivery of the line is… remapped into another language, since it’s not just the take it as it is and move it to another language. Every language has its own way to express, okay, the emotions and to put the words one after the other. So definitely the technology behind it is making the kind of magic so you can hear that, yeah, it’s a similar take but it’s told in another language. That’s the thing.

Florian: What a challenge. They express emotions in different ways, of course, in terms of the words, but also just how you express emotions in your voice, right? I mean just Milano and Zurich, very different.

Andrea: The fact is that this is coming from quite a big research you have done on the psychological emotions. Since all the humans have similar emotions, the way they are conveyed outside in terms of culture and vocal apparatus are a little bit different. So we have been profiling a very specific set of, let’s say, vocal delivery and we are creating our data set, okay. Mapping all of this so the system can have a wider knowledge about what is, say, human emotion in terms of voice, which kind of voices that you can have and, of course, in terms of languages.

Florian: This must be an incredibly competitive and highly, very resource intensive kind of baseline research you need to do. There’s got to be a lot of academic research going into this elsewhere, maybe some of the Big Tech companies are also looking at this.

Andrea: Yes, everybody is getting there. So the idea is to have what we call universal TTS, but it’s not a TTS, it’s a speech-to-speech system in which you can say, basically emulate multiple voice profiles, multiple types of emotional expression in multiple languages. So if you start with a voice prompt with its own text target, you can render this voice prompt into another language very easily and the system is, of course, has been designed to be a multiplatform, sorry, a multi-user system, so multiple user can attach to the system. You can have multiple projects, you can create project teams. Since when you have to approve one line in one other languages, of course, you need to have native speakers. They are not just have linguistic skills, they need to say, oh, I love how the line has been translated and delivered, so we are also, say, inviting people to operate on the platform. It’s a platform in which you have human-in-the-loop. Exactly like machine translation, okay, when the machine translation is suggesting one other the translation and they have the post-editing that is approving the translation, eventually changing it little bit, whatever is needed. So at the same time the system is suggesting one take in another language and the linguist or let’s say tester or what you call it or subtitler that is smart enough to understand, oh, this is a good delivery. And if it doesn’t sound exactly cool, you can change the pitch, you can change the length duration of each word, you can change a number of things from the platform itself. So it’s definitely a different system and definitely can be very useful for NPCs specifically, or multiple characters, or one liners or a few characters that can integrate existing technology since the idea is that there’s so much content out there that there are so many projects that are just translated and not voiced. So the basic idea is to voice the unvoiced projects and give emotions and voice emotion to billions of stories that at the moment cannot be told in the native language of the users. That’s the thing.

Florian: It’s incredible. I loved how you had to think about what that human-in-the-loop is actually called. Do you think there’s going to be some kind of job title? Or I mean, it’s such a complex task so on the one hand, that person needs to be a native speaker, right? Probably has a bit of passion for games, understand what the game is about, and then also kind of be attentive to the pitch of the voice and be able to tweak it. That’s a pretty interesting profile.

Andrea: There’s a lot of chatting about prompt engineer nowadays. Okay, so the larger language model that in this case can be seen as a large voice model, needs to have different, say, profile that are able to use it. So you need to prompt the system so that the system can send you feedback, and if you like the feedback, it’s okay, if you don’t like it, you can tune it. And this will be a kind of new work profile and then we need, of course, training our customers, our partners, so that they can have the best result, the best acceptable result in the least time possible. So you really need to releverage some basic skills, integrating with additional skills that can be, say, everybody is learning something when chatting with ChatGPT. So I’ve never been chatting with the ChatGPT before. Then I’m becoming more used to, say, I want to stimulate the system so that he can respond with what I like. I’m driving the system, so that’s the kind of thing that we are reassessing there.

Florian: You’re a tech company, you provide the platform, but you’re educating your users now on how to best use the tool.

Andrea: Yes. Training, how to use it, how to create new workflows since they had different needs. Okay, for example, in a project in which… The project is getting bigger, Florian, and most of the time some languages cannot be recorded with voices. But if you have a large project with few major actors and a bunch of NPCs, you can say, hey, wait a minute, maybe you can do the voice integrating NPC for virtual and real actors, so you can ideally have more recordings, okay. Of course you need to be smart in understanding what is the type of tool you are now using and, of course, this comes since you know what is your job and you need to integrate this new service in your service portfolio. And of course actors remain the top solution, but sometimes project are dropped. Sometimes you even can think about to deliver a kind of little trailer in different languages in 24 hours. It’s not possible. You can’t plug in humans in the web, so there’s no way and you have seen this happening with machine translation, of course. There’s no human can really translate this amount and then you have game as a service. How can you deal with that? That’s a really kind of different thing, requires some attention, but definitely it’s a big thing.

Florian: You mentioned trailer now, but let’s say your ideal kind of user scenario right now. Would it be kind of the trailer like kind of more the promotional material around? Or if it’s in game, is it like AAA game, but like certain parts of it or maybe more mobile? Or where do you see kind of most traction or fit at the moment?

Andrea: Let’s start on the game space, okay. For the game space, definitely you have a bunch of mobile games that even you can’t consider since they’re free to play, there’s no budget, okay? So you can’t invest to have voice in this free to play mobile game, but maybe to have a more affordable solution, you can start triggering this. Maybe the game is becoming popular, then maybe you can put some big names into recording this in the future. Or some languages from middle-sized game or large games are totally dropped. Also in Italian, sometimes this is costing too much. You can’t do that and you know that, for example, Italian, Italy or Spain or whatever have very, say, dubbed content territories. So if you do not do that, most possibly the game will not sell very well. But I’m not just talking about the major territories. Major for me. Major is China and the US, of course. European are definitely less population, but there’s a bunch of Eastern European territories, a lot of Asian languages that, of course, are exposed to localization from one territory to another. And there’s billions of games that can be told, there’s more than 12, 15 thousand games per year. How much of them are voiced, okay? The parameter of content is putting a really top tier, okay, at the very top and these are the one that can be voiced by humans. Also because you don’t find billions of actors in all the world, it’s impossible. Like for the translator, it’s not possible. So try to identify exactly which are the solutions that we can offer to the users that would love to hear the projects and the IP in their own languages. This in voicing the unvoiced remains our main focus.

Florian: What about if you go outside the game space, maybe like media, entertainment or specifically like these YouTube, the big YouTube creators, maybe? Like YouTube just opened up access to their audio tracks to a bunch more creators. Would that be a logical next step or are there other areas, other low-hanging fruit?

Andrea: The fact is that our focus is voicing emotions, so we need to stay where emotions live. So media entertainment is a natural evolution in which you do not just have single lines, but you have multiple takes in a timeline with the video and the possibility of the system to be adaptive to multiple say voice profile source. Doing autocasting will be really, really an interesting other solution, so media, entertainment is definitely one of these. There’s also, of course, game development, but development is a little bit different also because you need to have a lot of iteration when you are developing a game to have some placeholder to try to have the game engine working properly. When you need to integrate and tune it until you have something that is okay, finish, and then you can decide to go into the studio, record it to make it with top level actors. Or maybe if you don’t have the budget to do that, keep some of these and use it for publishing purpose. In this respect we are talking about media, entertainment and game development in respect to game dubbing. On top of this, advertising and eLearning especially are becoming more and more, say, interesting since there’s much more emotion. So we like to stay away from the simple narration of things. Of course, can be done, but there’s a bunch of solution and there’s a lot of good text-to-speech system that are coming out. So when it’s not important the voice of the source, you can use any other voice. I mean, we are becoming less important there. So that’s the kind of thing and, of course, who is using the text-to-speech system will need to craft the emotional score of each, say, segment. Since one of the things that you can have when you do text-to-speech is, say, a generative emotional… that you can use on the target. So if we want to move the narration, you need to craft it, like scoring for an instrument. When you see a score of an instrument, say staccato, accelerando and all these kind of things, so you need to put additional information and this is slowing down the process.

Florian: How does that labeling work? Is there like a set of like 100 accepted emotional terms?

Andrea: We have been trying to make it usable, so to try not, say, make things too complicated, reducing the key set of emotions and let the engine use multiple, say, nuances. Also it’s not just emotion, it’s emission that’s very different if you say something… Or hi Florian, how are you? And how the frequency range of the voice can change in respect when you are whispering and where you are shouting. So we perceive this as the same voice, but from the frequency point analysis, it’s a totally different thing. So our ear is trained to interpolate things and recognize as one entity that is talking, okay. That kind of sound like the piano forte, for example. The piano, the frequency is totally different where you are touching very quietly the keywords or you are really, let’s say, hitting this very, very heavily. But for us is a piano and this is a miracle of our ears and how they are trained to communicate emotions. Okay, so last example, if you let me. When you’re phoning to one of your… people and this person is responding to the phone, in half a second, you say, what’s going on? You definitely understand that the tone of the voice and the expression of the voice is not the usual one. This means that our sense is extremely developed and, of course, we narrate stories and so this is a great way to convey emotional voice. Yes, of course. I’m, of course, on that side, so definitely I’m in love with all this emotion and voice and singing and music, but definitely is something relevant for the humans.

Florian: What an incredibly complex challenge from a technology point of view as well and also it’s in a sense you wouldn’t have a ground truth or gold standard like you may have in translation, for example, because you can always debate, right? When that human post-editor/sound prompt engineer, they tweak it in one way, maybe to them it sounds great, but to somebody else, well, maybe not, right? So it’s very also kind of subjective I guess in some sense.

Andrea: You’re right, it’s very subjective. And of course, there’s 1000 way to say hi, Florian, how are you? How many ways? And of course, who is approving this, like a voice director in the studio, okay? You simply need to drive the actor to the best acceptable performance for what is the target in the minimum time possible, since also studio time is very precious. So audio people is trained to get the best in the forecasted time. Also linguistic people, this means to be professional, to have the best acceptable in the shortest time possible. That’s the thing. But yeah, it’s definitely not easy also to test since this system can be hit by billions of inputs and create billions of output. So it’s not three inputs are coming in for output, okay, easy to test. So this is why we need to have smart people that know about the language, know about how the lines delivery can be done, that can operate it, and usually these are living in the language service provider rather than in the publisher. The publisher has a different need. Of course, there are different colors into this in the outsourcing line of the publisher in respect to the LSPs. But multiple solutions, multiple today.

Florian: I think in six to nine months, we’ll have a great job title for what you’re describing, something that people… Fascinating. Hey, another thing, we didn’t touch on it now, but I mean, obviously there’s the whole large language model, ChatGPT, GPT-4, whatever the kind of term we want to use. Boom. Hype. I guess hype is the wrong word because it’s a hype, but it’s also just everybody understands it’s a groundbreaking thing that’s happening. What are the implications of this when you’re at your stage of a startup life cycle? Kind of like two years in, you have your roadmap, you have invested in certain capabilities, and then something like this happens. What do you do? How does it feel? Do you pivot? Do you not? Is it relevant or not?

Andrea: The strange thing Florian is that we started to build it two years ago and it was already a kind of generative system, okay? And this well before ChatGPT, so basically now we can have more ammo to define what you’re doing. We are building a large voice model that is structured as a generative system that can create motion in different languages, having seen prompts in source languages. So this is helpful for us to describe this in a better way and, of course, the general hype is getting bigger, so there’s much more attention, but we were starting before. So since we started not from the technology, but from the use case. So the question was, I have a lot of contents, okay, voice content, expressive, built by multiple voices that cannot be, say, moved and translated and voiced into other languages. How can we solve it? So we created technology and built a dataset for solving the problem. Not, oh, Google has done the Tacotron, let’s use it. That’s the difference. It’s not technology integration. We have been trying to raise the bar. Okay and this is why we need funding from the beginning, since if you want to challenge this, that’s pretty intense, okay. It’s not something that you say, okay, I go out, go to the movie and build a website. No, it’s much more deeper.

Florian: Let’s talk about the funding. You raised initially, I think a seed round, when was that? Just recently, right? And then you got some funding from like a European Innovation Council, but is there also some VCs involved? Just tell me a bit more about that process.

Andrea: The thing is that we started in 2020. We rent the office 1st March, then lockdown in Italy has been very, very hard, okay. Very hard. Basically, it was a kind of very complicated start. We have been applying anyway to Berkeley’s SkyDeck JP program and we did remotely. The Berkeley startup accelerator program, so we learned a lot, but we say, okay, we need some more funding and we were trying to browse around and see what Europe was doing and basically Horizon Europe was starting and we say this looks interesting. You can have a grant and maybe if you choose blended finance you can also have some equity support. So we be kind of brave and say let’s do that, let’s make the application and we won. Okay, for the first round of, say, Horizon Europe that was, say, winning companies has been notified in October 2021 and we applied for the blended finance. This means that you have the grant for the project and then you can have also equity support. So we’ve been starting to develop the technology with the grant and to unlock the equity support. We have found venture capital, an Italian venture capital LIFTT. We made the press release and they were interested of what we’re doing and we say okay, we found a good ground together and European Innovation Council is following, is a follower investor. Since, of course, they had the chance to evaluate hundreds of startups, you need to have say, very specific proceeding your due diligence, understand everything is done. So the first equity round has been closed in the end of January, but this is coming on top of the ground.

Florian: Now with the funds, I’m assuming you’re hiring a bit as well there. How do you find it hiring currently because there’s kind of two different forces, right? There’s some of the Big Tech players are cutting staff. Then there’s the AI boom that’s probably making every machine learning person in the world so much more valuable, so how do you see that for the hiring environment?

Andrea: Of course, we have been start hiring during the lockdown. It was complicated and we have some nice school in Milano that is called the Politecnico and what we do is so specific, Florian, very specific. So you basically can’t go around and find people that knows already… So we’ve been a big investment in selecting cool people first of all and make them grow with us through our incentive plan. That’s the kind of thing that every startup is doing, but definitely we need to be very keen in operating in this area. But also what we do is becoming more and more specific that we need to train the people that is coming into. You can’t just, okay, I want someone to do something like this. There are very few people experienced in this and what we do, we do it in another kind of way, so we know the investment in, say, in people is a key point. So that’s the kind of thing and we are trying to, say, to find smart people that want to embrace a challenge since we are still a startup. So everything is not written yet, we are writing it as we go, so that’s a kind of risky thing. Of course, very promising since there’s a lot of interest and after ChatGPT that the hype is getting bigger. But okay, we have a vision.

Florian: Mostly you’re hiring in Milano or would you also be open to just go fully remote or you want to have people in the office?

Andrea: We have been lucky since there was a lot of people coming to study, so we have a very international team but predominantly all of them are now living in the area. But we are now, of course, being more and more interested to hire remotely since, of course, we need to learn how to work remotely since the startup need to get close, okay. Since the team feeling is extremely important and if we will be able to succeed it will be a team effort. Of course I’m bringing my experience from the industry finding use cases but team is super cool since they are really challenging things that has not been sold before, so that’s a lot of admiration and support. Team is a team but the start of the team is the key point.

Florian: Anything about the product, any kind of new features coming up, anything you can share for 2023?

Andrea: Revoiceit is out there, okay, so you can ask for demo or whatever. It’s dedicated, its a B2B, is not B2C, first of all, okay, and definitely we are improving with new features with more emotional control, emission control and also working on how to create new voices that you can use in multiple projects. And of course, proceeding with the media and entertainment including video and in game development, so these are the two things. So game is a sweet spot for us at the beginning for game localization but definitely game development together with media and entertainment so including video but preserving the highly emotional content that is what we are targeting since this is our, say, sweet spot and proceeding using the other places in which our core technology can be deployed, inventing new workflows. Yes, absolutely.

Florian: It’s kind of a custom onboarding though, right? I don’t see like a SaaS option where I can go in and I get the, I don’t know, pro account for $500 a month.

Andrea: The thing is that we know that we have something cool now but it’s going to be much better in the future, okay. Also ChatGPT, the very beginning of the version was okay, we can do this. So at the moment since also the adoption requires some attention since you need to define new workload, we have set a very low entry barrier for, say, testing the technology. So all our partners and customers have the chance to test the system and we support for training. That is basically pay per use. So we win if we deliver and the clients need to be happy since what the meaning of having a high subscription if you don’t use it. So we need to be also listening to the feedback of the customer and keep on challenging ourselves to create something better and improve it.