Synthesia CEO Victor Riparbelli on Personalized Video in 40-Plus Languages

Synthesia CEO Victor Riparbelli joins SlatorPod to discuss the company’s approach to personalizing video content

Synthesia CEO and Co-founder Victor Riparbelli joins SlatorPod to discuss the company’s approach to operating and developing the world’s first and largest platform for video AI generation.

Victor talks about Synthesia’s journey and the rapid progression of media technology, where video content demand now outpaces the speed of production. He unpacks the role of academia in the company, where PhDs and professors make up nearly 50% of staff.

The CEO goes over the evolution of text-to-speech in the last decade, from the appearance of deep learning and voice-cloning to multi-dimensional speech with emotions, pitch, and style. He also discusses the difficulty of extending voice into multiple languages when there is no data to support the neural network.

Victor reviews the success of video content over text and how this ties into working with global companies, which not only want to train and communicate with their employees but also improve the customer journey. He shares how he sees the localization and translation industries as partners and an integral part of creating multilingual content.

Victor talks about Synthesia’s funding rounds and shares the story behind connecting with their first investor, Mark Cuban, after struggling to find funding. He gives advice on finding the right type of investor and what to expect from venture capitalist’s unfamiliar with the tech space.

The podcast wraps up with Victor’s view on deepfakes and the impact on their approach toward harmful content, and the company’s vision of creating more storytelling rather than informative content.

SlatorCon London 2024 | £ 980

SlatorCon London 2024 | £ 980

A rich 1-day conference which brings together 140+ industry leaders views and thriving language technologies.

Buy Tickets

Register Now

First up, Florian and Esther go through the poll results from May 21, where respondents weighed in on Translation as a Subscription, with only 12% thinking of it as “the future.” The duo discuss the pros and cons of the subscription model and reference the Pricing and Procurement report, which highlights its simplicity and predictability.

Florian talks about KUDO’s latest public relations win as billionaire investor Bill Ackman tweeted about using the multilingual conferencing platform for an investor presentation. Ackman used KUDO to run the presentation in 11 languages, where he announced buying 10% of Universal Music Group from Vivendi SE.

For the third week in a row, RWS pops up in language industry news as it partners with speech recognition system CEDAT85 to launch a live subtitling and captioning solution for online meetings and events. Esther touches on possible competitors in the space, such as Ai-Media, Redbee, and Verbit.

Florian discusses SwissText’s 2021 conference competition, which saw Microsoft’s winning approach toward the recognition and translation of the Swiss-German dialect into standard German text.

Subscribe to SlatorPod on YoutubeApple PodcastsSpotifyGoogle Podcasts.

Stream Slator webinars, workshops, and conferences on the Slator Video-on-Demand channel.


Florian: First, give us the elevator pitch of Synthesia. In a nutshell, what do you do? 

Victor: At Synthesia we are operating and developing the world’s first and largest platform for AI video generation, which essentially means that you can take texts, assets, and you can turn them into video. Our vision is to make the world’s information visual so if you think of all of the world’s information today, 99.99% of it is only stored as text. It lives inside of textbooks, articles, newspapers. Most of what we know as the human species is recorded in text. That is not because most people prefer to consume the information as text, it is because for many years it has been the only scalable way that we have had to communicate with each other. Typing on a keyboard is easy, it is fast, and you can share with anyone immediately. 

What we are seeing is that video audio interactivity is taking over. Young consumers especially do not want static and textual interactions. No one is longing for more textbooks and more traditional New York Times. People want podcasts, YouTube, Instagram, TikTok, et cetera and something we have discovered is that this is also very much true in the enterprise. People who work in companies and consumers at home default to using video so Zoom with our parents or friends, YouTube for learning stuff or Udemy or Netflix. We are moving to this video first world and if you take something like TikTok, which is the world’s largest and fastest-growing social network, it is almost purely video. It is a clear progression from Twitter, which was text. Facebook, which was text and images. Instagram was images and now TikTok is purely video. That is all great for consumers, but it is not so great for companies and for businesses because producing video content at scale is really difficult. It is a physical process. You need cameras, studios, film crews, post-production, and demand has outpaced the speed of production. 

We built a platform that uses artificial intelligence to make video production really fast, affordable, and accessible. It really is as easy as creating a slide deck and all you need is a laptop and internet connection. The way it works is that you log in to our platform, you select an AI avatar, this could be one of our stock actors that are built into the platform or it could be yourself with a three, four-minute recording, and then you just type your script. We support 50 different languages. You click enter and your video will be ready in a couple of minutes. It is very simple and if you go on our website, you can get a visual explanation of it. It is always a little bit hard to explain just by talking. 

This is valuable because the world is moving to video so our customers use Synthesia to enhance their existing content. One of the learnings we have had is that we are not a replacement for video production. We are a replacement for text. This is not about replacing all the kinds of video shoots you would otherwise do as a company. It is about taking all the materials you have as text and turning that into video. That is a very simple pitch from a technical perspective. We are a deep tech company, mostly PhDs and professors solving some really difficult problems and taking what you could think of as Hollywood visual effects and democratizing it at scale and making it available for anyone anywhere for a starting price of $30 a month. 

We have taken video production and abstracted it purely as software and as code. Not only do we do things today much faster and cheaper, but we can also do a lot of new things that were not possible before. For example, things like personalization. Instead of making one video that you show to all of your customers, your employees, you can make a customized video for every single one of your employees or your customers by taking in various data points and serving it to them. Just like we get personalized emails from your bank. We do not all get the same email because there is different data about us and we have different bank accounts. It is a new medium that is going to be defined by a new generation of creators. 

For us, the eventual goal is you can make a Hollywood film on a laptop. We are pretty far from that still, but in 10, 15 years, I definitely think that is going to be possible. It sounds a little bit crazy to a lot of people saying we can make a Hollywood film on a laptop and it may be, but what is happening here is that media production is moving from cameras and microphones to something that we can do with computers. That has happened to a lot of other types of media before. You can open up your laptop and generate an entire chart-topping hit without the need for anything else than just your keyboard. That is because we can synthesize pianos, guitars, amplifiers and effects to a very high fidelity. The same thing with images, open up Photoshop, and you can create more or less anything you can imagine. With texts, we take this for granted today, but there was actually a time where you had to sit down with a piece of pen and paper and write something. Now I can just type it on a keyboard and that obviously enables an entirely new information ecosystem, with the internet and social media. I see this as a natural progression of media technology.

Florian: Tell us a bit more about your personal background. How did you get into this? How did you get started with this idea? 

Victor: This all started in my childhood. I am a massive sci-fi fan. I have always loved computers as a kid and in my late teen years, I figured out that my interest in computers and software could be turned into a career. I started building local webshops for local businesses here in Denmark. I slowly moved into more startup type companies in Copenhagen and after I had been doing that for six years, I went to Stanford for the last semester of my bachelor’s degree. This was Silicon Valley, it was going to be big and people are going to be thinking much more ambitious. That was to a large degree fulfilled. I met a lot of interesting people and when I got back from there, I realized that I loved building things and products, which is what I had been doing before, but I was not so passionate about building accounting programs and traditional business software. I needed to make this sci-fi interest my actual career so I moved to London because Copenhagen is not really the place to build a deep tech company. I met one of the professors who was behind one of the original deepfake research papers and when I saw that piece of technology, I instantly saw that this is a glimpse into the future of content creation. There is obviously a lot of talk about the negatives of this type of technology.  I saw that this is a step change in how we do media and then I decided to pursue that idea and we got a strong founding team together of half academics and half business and product people to go on this mission of making it easier to create video content for everyone. That is almost five years ago now.

Esther: Can you tell us a bit more about who the co-founders are and what the current leadership team structure is? 

Victor: The co-founders of the company include Professor Matthias Niessner, who used to be at Stanford. He is now at TUM in Munich, and he has done most of the seminal research in this space over the last 10 years or so essentially using deep learning methods to generate various media content. He is a world leader in this space, the same for Professor Lourdes Agapito. She is based in London at UCL and has also been one of the pioneers in combining computer vision with deep learning and then the last founder is Steffen Tjerrild, who is the business and finance guy of the equation. A noteworthy mention is our CTO, Jonathan Starck, who joined very early before we had raised any funding and Jonathan comes from a very interesting background of actually building, what we think as, the Photoshop for visual effects artists. Everything that you are seeing in Hollywood films, most of it has been developed or created on technology he created. We combined the academic world with high-end visual effects from Hollywood and some traditional SaaS and startup building.

Florian: Let us talk a bit more about the underlying technology. What is built, what is bought, what is customized?

Victor: Basically our tech stack is more or less proprietary and we are a deep tech company. Today around 45% of our staff is PhDs and professors solving blue skies research problems. We are investing heavily in driving research in the field and I had to say we are ahead in even the academic space in a lot of what we do. You write a research paper. For this, you only need it to work once out of a thousand times. We can show it to the world and it is now technically possible. To actually create a product is a lot of work. Especially a product that can scale. Today we are creating thousands and thousands of videos for our clients every single day and that requires a lot of proprietary technology, which is not just about making the video, the images, but also how do you reduce costs to a point where it is actually something you can sell and make a profit on? How quickly can you do it? If it takes an hour to make a video, it is not going to be a great user experience. There are lots of other things that pop up once you start creating a product and so most of our stack is built to support this particular technology, which is very new. 

Florian: Is it super compute-intensive? What are some of the challenges there?

Victor: One of the big challenges is that to scale this, you need to have a version of it running that is not super compute-intensive. The starting plan is $30 a month, which allows you to generate 10 minutes of video so we have got it down to a point where it is pretty manageable and it is only going down from here, but it is in general, a very compute-intensive process. There are two types of compute-intensive processes when you are doing the kind of technology we are doing. There is one, which is generating an avatar, so if we were to take a video of you and create you as an AI avatar, you would send us 10, 15 minutes of video content. We would take that and just train the model. This process is quite expensive, but once we have trained the model and it is affordable in the scale that we need to be to be able to make a product.

Florian: What are multilingual avatars? Tell me a bit more about the concept of avatars, natural versus artificial.

Victor: From a high-level perspective there is very little doubt that we are all going to have various digital representations of ourselves in five to 10 years time. What we have today, for example, you have a LinkedIn profile, which is static and its text. You have an email address. You have all these things. I think we are going to move into a world where video is going to be the dominant medium. We are not all going to want to be on camera 24-7 so we are going to have these kinds of avatars of ourselves that we can use to create content with. Just like we use an email address or a social media profile to communicate with our friends. 

The basic concept there is you have two types of avatars. You have one which is of a real person. The way this works is that you film yourself, we have a process you can follow, it is pretty easy. You do not need any special hardware, technically you can do this with your iPhone. As with most other forms of deep learning, the quality of the data determines the quality of the output so we generally suggest going to a green screen studio, have some nice lighting, and get a good camera. You record these lines to the camera, the recording processes, it is roughly five minutes, you send us that data and then our machines look at all the data. They analyze it, then they basically learn to simulate new videos of you saying whatever you type in. That is the core component of our technology. 

Then there is another concept which is super synthetic humans and synthetic voices, which are characters that are not based on a real person but are generated. I usually explain this as when you play computer games. In a lot of games, you get a screen where you can choose your hairstyle, eye color, facial shape, all these kinds of parameters and that is something that we are working towards. For still images, which you can also use in our platform, there is already quite good technology out there. You might have seen something called this person does not exist. That was in the media a lot and essentially it is going to be a version of this for video as well. The way we work with our presenters on the platform now is that these are real people and they earn money for every video that is generated with them so there is a royalty fee going back to them, but this is like being a stock photo actor. You put your stock photos up on Shutterstock or Getty, and every time someone downloads them and uses them, the actor then gets paid. 

Florian: Why would you need to have real people? Is it better than just coming up with an artificial avatar? Is there a component for turning the human into an avatar that makes it more realistic? 

Victor: Creating synthetic humans in video is incredibly difficult and I would say it is still years out from being actually solved in a way that it will work and be as good. There is also a lot of appeal for our clients in having real people. For example, if you are one of the world’s biggest fast-food chains, you want to have the character that you are using to be someone who is in all your swag with a cap on and a branded t-shirt. We have a lot of clients also where it is someone from the leadership team whose avatar is being used to educate so then you have some kind of relation with that person beforehand. I think that will continue to be the case. For some use cases, especially for very small companies that need to create learning material, maybe they will choose to use someone who is not a known face but for bigger companies, we are definitely seeing a pattern that most of them want to have a real person that you can relate to.

Esther: When you are thinking about the technology behind text-to-speech specifically, how have you seen that evolve over the last five, 10 years, or even the last few months?

Victor: If we take the 10-year timeline, what has happened to both voices, but also the video part is that deep learning appeared and deep learning started working so this is the AI machine learning part of what we do. What that meant was that we went from having these generally bad, very robotic sounding voices to having what is referred to as neuro voices, which just makes it sound a lot better. It is still not perfect but the big acceleration thing here has been that deep learning can look and listen to lots of different voices and learn how to replicate that in a more natural manner. Whereas beforehand what you would do is you would record a bunch of words and then you could concatenate them together and you would get something that sounded like a sentence, but intonations and inflections were completely off. Now the systems could actually understand sentence structure and become quite good at that. 

In the last five years, we are starting to see what I would call voice cloning, which means we get a lot more voices. With deep learning technologies, the community has managed to reduce the amount of data required to create a voice by quite a lot. Five, 10 years ago if you went to Google or Amazon, they would have maybe two voices because it is so much work to create just one voice. Whereas now you can actually get really good results even with just 15 minutes of data. The quality goes up the more data you put in, and if you want to have really good voiceover quality, you need to do more than that but it is at a point now where we can actually scale the creation of voices. 

In the last six months, what we see is very much the same as we are seeing on the video side, these technologies are maturing from being very one dimensional to now there are lots of different things you could do with it. We have seen emotions in TTS voices that are actually starting to sound really good. This means you can do things like questions, for example, means you can have a voiceover that is sad or that is happy and of course it is going to be reflected in the video as well. We are also seeing some companies starting to work across languages, which means taking your voice in English and then creating the Mandarin version of it, still with your voice. That is a very difficult endeavor but this is what we are seeing.

Then I would say the last part of it actually relates a bit to synthetic humans, which is what I call humans that do not exist. We started to see the proliferation of entirely synthetic voices, where you are also pretty soon going to have some TTS providers where you can go in and say, I need a voice and I want the pitch to be a little bit lower, I want it to speak a little bit faster or slower and then you can actually create a voice, which again, is unique for your particular brand or use case. Then the last thing, style, which is very important and that is one of the things that is still a challenge when you are working with AI generated content. The style of speaking is very important, if you are doing a sales pitch, for example, but it sounds like an audiobook reading, that is odd. I do not think human voices will ever be replaced. I think it is going to live side by side but also slowly start to progress. You can actually start to control the style a little bit more. 

Esther: You were talking a bit about extending the product to multiple languages. What makes that so challenging?

Victor: It is hard because you are asking a neural network to do something that is not possible in the real world. If you are teaching a system how to replicate videos of me or you speaking, that is something that actually exists in the real world. I can go and get a lot of voice clips of me and the system can learn from that. If you are asking it to create a version of my voice that speaks Mandarin, there is no data for that. I do not speak Mandarin and even if you were to force me to speak Mandarin it would sound completely wrong. It gets into a dimension of reality that actually does not exist. You are not just replicating reality, you are creating a new version of it. It is hard getting the data and this is about disentangling this different signal that is not a voice, and I am simplifying a lot here. Being able to disentangle all these different parts out is very hard, requires lots of high-quality data so from a technical perspective, it is just incredibly difficult. 

The last part, which might be non-obvious for people who are not actively building in this space, is around synthetic media and how we actually use it. We can have all these functionalities. For example, there is a company that is doing amazing work in voice and they have now generated a system that has emotions. The way it works is that it analyzes the whole text and it figures out if this sentence is supposed to sound happy or sad or surprised or anxious. You are basically saying the AI will also figure out when it should be sad when it should be happy. Another way of thinking about this is you highlight the parts of a sentence that you want to be happy or sad. This is a very simple example, but how do we control all of these different parameters? 

The reason this is interesting is I see what we are doing as a democratization play. Our app is very easy to use. I tested it with my mom who is 60. Can she use this? Can she make a video if I just tell her to make a video of this? That is important because there is got to be a big market for professionals who are used to using video production. It is going to be a big market for them and they are going to be using these technologies a lot but the interesting part here is expanding the market and making people create more videos than they did before and making everyone into a video creator. 

We all create or write text every single day. It is an integral part of all of our jobs. We probably would not call ourselves writers. It is a function that you do while you are on your job but if you go back 30 or 40 years in time, that was not most people’s jobs. You would not sit and write stuff all day. There were some people in the company who would write and distribute things and I think that is going to be the same thing with video. We are all going to be creating videos more and more, but we are not going to call ourselves video producers. It is going to be a new way of communicating that is clearly superior to writing emails, for example, and we are all going to slowly switch to that. Building a UX that works, not just for people who understand everything about video and sound signals, is important. It is still to a large extent an open-ended question, especially as these technologies become way more advanced.

Florian: Let us talk a bit about the addressable market here so you are expanding the market into the B2B side. How do you see that adoption curve scaling up? 

Victor: We are seeing rapid growth right now. It has been an absolutely crazy year since we released our self-service platform. I said this before, but we are not replacing video production. We are not competing with someone saying, I am going to go out and pay 15k to do a nice, smooth video of our CEO addressing something. That is not the market. The technology is not nearly good enough and it has to be more authentic if it is a real video. We are working with big companies that usually have a global footprint and several headquarters around the world. They need to train and communicate with the organization in many different ways. They all have a massive problem, which is that they know that video is more effective than anything else, but they cannot scale the production of it. 

Take an L&D program in a Fortune 1000 company for example. Usually, you would have some video content, so let us say that out of 100 courses that you need to take over the course of a year as an employee, maybe two or three of them, the most important one’s are video. The rest are going to be text or slides that you click through. We are going in and we are helping these companies make all of that into video content. Let us say a big pile of your workforce might be more blue-collar by nature. They might not be as literary as a white-collar office worker. It is just so simple. If you want to train people and you want them to actually remember what you are training them in, take someone whose first language might not be English, for example, do they want to read a five-page PDF or slide deck, or do they want to watch a two-minute video in their native language? It is a very simple equation. Watching video is a lot nicer than trying to click around on a slide deck and you can sit people down, make them watch it and have them ask questions. What we are seeing right now is a lot of internal use, and it is very much tied to how good the quality of the technology is. 

What we are starting to see is that people also use this externally for things in the customer journey. If you have a help desk part of your page, if you have a physical product, for example, there are probably a lot of things you want to tell your clients or a lot of things you want to teach and educate your clients about and text does not cut it. With this technology, you can create 500 videos in the span of a couple of months and pop them into your website. When something changes, which it does all the time, you just go into the video, quickly edit it and publish it again. 

We are releasing our API platform which is going to enable anyone to make personalized content by simply connecting it to your email marketing program or your CRM database. All this can be done without writing a single line of code. It is super easy to use. I would say right now, it is very clear that we hit something here. It is experimental and maybe the quality is not good enough, but people who are consuming content care most about the experience, not so much about how it was made. What has been proven is that it is much more effective to use AI video, AI voiceovers than text. Again, very important here to underline that this is a market expansion where we are taking the market from text. We are not taking the market from actual video. 

Florian: What is your interaction so far with the dubbing studios, media localization companies? Are they knocking at your door, are they trying to actively ignore you, do they feel threatened? What is the interaction there?

Victor: It feels like we have been through 10 lifetimes already. When we set out to create this technology, initially we knew there were going to be several stages of it. There would be a very manual stage where it would take a lot of time but then eventually we could hopefully get it to the scale that we are at today. The first iteration of the technology was the one we showcased to the world with our David Beckham video, where he speaks nine different languages and this technology was all about video editing. What does that mean? That means you take an existing piece of video content from a film, an advertisement, and you translate it. What we are doing today is video generation, which means you are generating video entirely from scratch. You are not working with something someone else has shot at some point in time so this video editing is the first thing we actually took to market and had quite a bit of success with. Big advertisements, for example, for continental Europe where you just shoot one in English and then use AI to translate the 15 other ones. 

When you are working with existing video, a film or ad or something that is origin shot, the AI systems that we have today still have a hard time coping with all of it. It is difficult to scale. It was a quite manual process and if you are starting to do things like Hollywood films then most of the time, for the tech, frontal shots of someone speaking to the camera is the best. You can probably handle a bit of head movement, but if someone is running through a forest with blood on their face and there are trees flying across their face, it is very difficult for this technology to work. Also, most cuts and edits in Hollywood films are very short. It is an average of five seconds per cut in the film so getting the training data, and you need to train a model more or less for each scene, is quite difficult. 

We spoke to a lot of dubbing companies and everyone is very interested in this and it is definitely going to happen because other companies are chasing this now, but we actually decided to not pursue this particular market outside of a few PR projects that we do because that ends up becoming a cool visual effects tool, not a company that can fundamentally change how we create content. The reason for this is if you take the Hollywood scene, for example, here you are competing with video and visual effects. Artists care a lot about their films. It is very high stakes, everything needs to be pixel perfect. It is very difficult to sell technology and then to work with this type of client when you are trying to build a rapidly scaling technology company. This is going to be adopted, but I do not think Hollywood is going to be adopting this technology the fastest. 

When I say Hollywood films on laptops, I do not think it is going to be Hollywood films as we know them today. It is going to be some weird thing that we do not really know yet and that is going to be built by 20-year-olds who are sitting at home with their friends and spending months creating something cool and it is not going to look like a Hollywood film. It is going to be something completely different. The way that I explain this is, once synthesizers were invented you did not need a drummer or a piano player or something like that. You just play it on the keyboard. Obviously, the quality was not as good back then. Now it is very difficult to hear the difference between a real and a synthetic instrument, but that did not just mean that everyone started creating rock music with synthesizers. It gave us new genres. Techno, for example. Electronic music is something that exists because we created the synthesizer and I think it is going to be the same thing here. It is going to be something different. It is not going to be a replacement for it and it is not going to be the industry veterans who pick this up the fastest. 

In terms of localization, translation industries, we see them very much as partners in this. We already work with a few of the big translation companies where they offer our service to their clients because translation is an integral piece of making multilingual content and it is not something we have the technology for and I do not think we will for a very long time. If you want to translate your video into 30 different languages, which is something we do a lot, first of all, nobody is going to trust an automatically translated piece of content, not a big company. Secondly, even if it is something that is okay, it is never going to hit the tone of voice that the brand wants. There is a big art to translating or dubbing things so we see them as part of the supply chain in delivering awesome video content at scale. Since we can work in text, it just makes everything easier. You just send them a word file or upload it to a platform. They translate it and send it back to us. We fundamentally are doing different things. We are focused all on content creation and translation is a different beast in some way. 

Esther: How is it that you approach or get connected to these translation partners? What gave you the idea of essentially partnering to provide solutions? 

Victor: Most of them have come to us after they have seen some of our work. It is quite interesting for translation companies and language companies because what we are doing is actually massively expanding the market for translation in a big way. It is interesting to be able to offer this type of service as well and be able to work with our clients. There are lots of synergies and they have a different skill set than we have so it has been pretty organic and a natural match. Also, a lot of our clients that we work with already have a translation agency so we just become part of that value chain of someone they are already working with. I Recruit Talent. Find Jobs

LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. I Recruit Talent. Find Jobs

Esther: What has been your experience of hiring talent in machine learning? It is a super competitive space so what roles are you currently hiring for? 

Victor: We are always looking for amazing people and we are growing really fast. It is across the whole organization, machine learning as a part of it, but also just web engineers, customer success, sales. We are expanding everywhere so if anyone finds what we do interesting then definitely feel free to ping us. It is very true that the market for machine learning, deep learning talent is incredibly competitive. There are a lot of factors here. Obviously, technology is working and it is developing a lot faster than the education system can spit out enough candidates that know how to use this stuff. That is one part of it, and the other part is that you have a lot of big tech companies, Microsoft, Google, Facebook. They obviously care a lot about machine learning but they would rather have a really good machine learning researcher sit and play Tetris all day on their bankroll than have them work for a competitor. 

It is some pretty crazy salaries that are being offered to the top tier talent and basically, they can pick and choose. If you are someone who has done a PhD and is interested in this field, you do not want to go into a role where you just have to sit and produce software. You want to keep exploring the frontiers of what is possible. I think that we are a great fit because what we do is literally the bleeding edge. We are actually doing blue skies research, solving problems that no one else on earth has solved before and that makes it a lot easier for us to attract this type of talent. The academic world is different from the business world. There are different ways that you measure progress and how happy you are at work and you need to build an organization that can encompass all these things. Otherwise, you are not going to get the top tier of talent.

Esther: How did you go about connecting with the investors? First Mark Capital was the lead in the latest one. Why did you choose to work with these investors in particular? 

Victor: We actually raised our first round in 2017 from Mark Cuban, a million dollars. Raising money is probably not the thing that most founders enjoy the most, but it is obviously a very important part of building a business. You need to buy partners and especially with building technology like ours, the R&D cycles are very long. They are very expensive. It is difficult to just say, we will hack something together and see if we can sell it. It is a pretty traditional founder story when we raised our first round back in 2017 with Mark Cuban. We were probably close to giving up. We had spoken to every single venture investor that we could find in London and in the US. It was not working. No one understood what we tried to do. People just could not see how this could become a billion-dollar company, which is what investors are looking for. Steffen and I sold all of our Bitcoins. We kept paying the PhD. We were still trying to hustle and then eventually, Steffen found Mark Cuban’s email. Actually, Sony got hacked and all the emails were leaked and he found it, sent an email to Mark Cuban and Mark Cuban replied back within five minutes. I did not believe him, of course. I just published the website. Mark Cuban was the first one that signed up with his Gmail account. We emailed with him for 14 hours and basically got a million dollars after that. 

Then we did our second round in 2019, and most recently with First Mark. I would say that the most important thing is to get the kind of investors that are right for your startup and for us, that is someone who likes big visionary things and bold bets. Especially a couple of years ago, this could either become really big or become nothing. If you do something that is a little bit more, dare I say, vanilla, like an accounting system, it is easier to understand, you can angle it to a lot of other companies that you know. It does not require a lot of creativity to see how it can work. It is a lot easier for most investors to say, I know what the total market size is for accounting programs. I know roughly what the other accounting programs are doing right and how this could potentially build something even better. For something like this, you are basically saying the camera is going to be replaced by code in 10, 15 years. I do not have much to show for it yet, but you should believe me because I have a big vision and some initial technology I can show you.

The difference between raising the first round and the one we just raised, is now we have proved that we are a very high growth and profitable business and that obviously makes it easier. Matt Turck from First Mark, who we worked with in this round, was always super intrigued but his company invests in later-stage companies. When it came to this round, we got talking again. It is a lot about, can you work together for five or 10 years? Do you like the person? Do you believe in the same things? That is probably the most important part when you are picking your investors, is that you are aligned. If you get an investor who wants you to sell after two years or wants you to build something different, that could be fine at the initial process but that is going to come out negatively over the years because you are going to have a different vision to what is happening. 

Florian: If you talk to 100 VCs that are in the SaaS, ML, AI space, do 90% of them understand the components you are putting together here? How educated are these VCs in 2021?

Victor: It is better than it was in 2019, but it is still very far to go. There are fundamentally two ways of building a startup. One, which is you find a problem that is out there in the world, you build a solution for it and go off from there. That is how most SaaS companies are built to some degree. You find a problem, you build a solution, you can usually prototype something that at least demonstrates that it works quite fast. Then there is the other way of starting a company, which is saying, I know something that no one else does and I have this vision of how I think the world should change and we have got to do that. This is a harder sell because there is a lot less data. There is a lot more betting on the person and the team that is behind it and when you were doing something like what we were doing it was a very abstract idea, probably still is to some extent, and it is by nature a niche area. 

It is very difficult to understand how hard it is what we are building. We have been doing this for a long time. We look at some of our competitors, who have been doing this for two years and their quality is still half of what ours are. Do not fall for the media headlines of all this tech is everywhere, you can just dabble something. Yes, you can, but there is a lot more that goes into it and this understanding is just not intuitive and that is not so weird because it is still a very small market. The market cap of synthetic media is probably still less than $300 million so I do not think it is that strange. There is an element of what we do, now it is more obvious, but back when I started Synthesia, a lot of the people with the skill set that is on our team, we are building self-driving car companies, for example. Self-driving car companies are a good idea. If you can make a self-driving car that actually works, there is clearly a massive market. You do not have to convince anyone of that. You have to convince them you are the right person to build that idea. 

There is no one who is going to challenge you that much on that. What we had here was a good idea that looked like a bad idea. Our pitch from back then was like, how is this ever going to scale? Right now it takes one PhD two days to create one video. How are people gonna feel about watching synthetic videos? Is this ever going to work? Is there going to be regulation? I do not understand or believe this, and I think a lot of that is also because VCs are not content producers. If you have not actually spent a lot of time making video content, you do not know how complicated and time-intensive it is, even to make very simple corporate content of someone speaking to the camera. Now it is a lot more obvious and a lot more interesting. It is very clear that there is a lot of traction and growth in the market.

Florian: Where is deepfake going? What are some of the guidelines or not, or lack thereof? 

Victor: Deepfakes, that word has taken on so many meanings now. I think to most people when you say the word deepfake, you are thinking of something negative. You are thinking of the negative use of AI generated content. The media narrative has basically been, there is this new technology, it is going to be very bad, you are going to have fake news everywhere. You can make speeches of politicians saying things they do not say. They are certainly true and this technology is definitely going to be used by criminals or people with bad intent, just like they use smartphones and cars and telephones and every other technology out there. I think what has been missed is that it is being equated to this one negative use case and the reality is that 99.99% of the use of this technology is for very vanilla business use cases that are not particularly salacious but create a lot of value. 

It is important that we do everything we can to reduce harmful use. There are two questions to unpack here. One is how do we ensure that our technology at Synthesia is not used for bad? That one is relatively easy. We have an on rails experience. We do not provide access to the underlying technology and we verify everyone that gets on the platform and we have a strict rule to never synthesize content for anyone who is not given consent. That sounds obvious, but there are a lot of companies who will do an Obama or Trump, not to spread misinformation, but as a PSA. The second, more important question is, in the information ecosystem, how do we make sure that the use of these technologies are generally reduced outside of what we do? There are a few things to do there. Technology-wise, people are creating deepfake detectors for example. Using AI to detect if a video has been manipulated or not. I do not believe that these are going to work, but they might be a part of this broader solution and I also think that the problem with these is that even in five years time, most content you consume online is going to be synthetic in some way or form. 

Even today TikTok and Instagram filters are AI generated, so where do we set the boundary? If I could give you a program today that could detect if an image was photo-shopped, that would be great. But 99.9% of the images on the internet are photoshopped. The more interesting part is media provenance, which is creating a chain of accountability. Where does this piece of video or image or sound come from? This does not just solve the deepfakes, but it is a general tool to ensure we can know where our content comes from so we work with Adobe on this with the Content Authenticity Initiative. The way I usually explain this is, you want to have a Shazaam for video. Shazaam is an app on your phone. You can listen to a song with it and it will tell you what song it is. We are going to have a similar system for video that can say, this video you are watching now was originally uploaded by the BBC six months ago and what you are watching now is 97% the same, but something is a little bit off here. Click here to watch the original video. This is what YouTube already does with music. If I make a video of myself dancing to Michael Jackson in my living room, YouTube will detect that and know that there are copyright holders who will want some money for them to play that on their side, so they will put an ad in it. Give some money to the rights holders, take some money themselves and they will put in a description that this video contains a soundtrack by Michael Jackson owned by these rights holders. 

The last part is education so we have been able to forge and do bad things with images and texts for years. This is now also going to be possible for audio and video to some extent, and we need to make sure that people know that. One part of education is that you educate people like you do the traditional way. I spend a lot of my time doing that, but the most important one is exposure. The more synthetic content that you consume, the more that you know that is not reality and I think it will be very natural. We just did a campaign with Lionel Messi, for example, we can create a personalized message for yourself and a friend. 4 million people have now used this. They obviously know that it is not a real video of Lionel Messi. That is just going to give ourselves or everyone a sense that this is now possible so do not believe everything you see in the video.

Esther: Tell us a bit more about what we can expect from Synthesia in the next two to three years. What is your product vision in particular?

Victor: The next couple of years is all about making all of our AI machine learning algorithms better so that with the avatars you are using you can add emotion, you can add gestures, you could add several people on the screen. That is probably the product roadmap you would expect. Another part of it is simply our web platform, which is essentially a video editor that lives in the cloud and you can do a lot of things with that. We are building this for synthetic media, which means that things like the personalization of videos are ingrained into the platform and what we want to have is, in a couple of years, a simpler version of Adobe Premiere that sits in the cloud. It is made for everyone to use and it is made for creating personalized content and videos that are generated on the fly. That is where we are going now with the product and for the next couple of years it is definitely going to be business communications that we are focusing mostly on so video content, video chatbots, it is going to be a massive thing. Cool stuff coming out soon and then in a couple of years time, we will maybe start looking more at the cloud entertainment side of things and creating more storytelling rather than informative content.

Florian: How do you manage all those competing ideas? How do you keep the company focused without getting dragged into all kinds of directions? 

Victor: In general, you need to have a long-term plan of where you want to go because even if it is not something that you think about every day, it determines all the micro-decisions that you make in all your teams. Having division is extremely important, but that said when we are thinking about what features do people want in our creation app, for example, that is of course something that is driven. I would say there are two types of these PM positions. One is what people clearly want. What are people asking for? There are a lot of features that people are asking for because we are still an early platform. A lot of them are not very exciting. People want a way to download the videos or publish them on YouTube. Not super exciting but makes your workflow a lot easier. 

Then there is the second part of it and that actually comes a little bit back to all the things that people cannot imagine because they do not know what the tech can do, and they cannot imagine what it could potentially do. That is very natural. I sit and think about this stuff 24/7, and I have been doing so for five years so our mental model at Synthesia is very different. It is balancing those two things but I think that the high-level plan of what we are going to be doing in three, five or 10 years feels relatively set. It actually has been for the beginning of the company. We knew that there were a couple of hurdles and milestones we needed to get to and we are actually following that plan.