Monsters, Aliens, and Lip-Synced Dubbing with MARZ CEO Jonathan Bronfman

SlatorPod #182 - Lip-Synced Dubbing with MARZ CEO Jonathan Bronfman

“The face is the uncanny valley. The nuance and detail in the face and the way it moves is a very difficult thing to animate,” Jonathan Bronfman, CEO of MARZ explains on SlatorPod. 

Canada-based MARZ, which stands for Monsters Aliens Robots Zombies, was founded in 2018 to leverage AI and bring quantifiable metrics like time and cost to the business of visual effects (VFX).

Bronfman recounts how their research for Vanity AI, a solution that creates effects such as de-aging, cosmetic, wig, and prosthetic fixes, led to them becoming an “expert at the face” and launching LipDub AI, a solution to solve the “mismatch between spoken word and visuals for viewers”.

According to Bronfman, users can take an audio stem (the dialogue track), the original piece of footage, and push both through LipDub and have a lip-synced visual in 30 seconds. Bronfman also explains that their USP is that LipDub is Hollywood-grade and works on 4K and ProRes files.

Subscribe on YoutubeApple PodcastsSpotifyGoogle Podcasts, and elsewhere

The two also discuss how actors feel about the emergence of lip-syncing visual effects and why Bronfman does not think an expansion into the creator market is on the cards for MARZ.

Florian shares that as a life-long consumer of dubbed content he sees the massive potential in a solution that improves the viewer experience and makes it more natural.

Transcript

Florian: Jonathan is the CEO of Monsters Aliens Robots Zombies, also known as MARZ. And Jonathan, let me just try to introduce MARZ a little bit for our audience, AI-enabled visual effects company. Not something we have on the podcast every day and since launching in 2018, you worked on like 100 plus high-profile TV projects like Umbrella Academy on Netflix, Moon Knight, Wandavision from Marvel, or Invasion that people probably know from Apple TV+. Then MARZ is also the creator of Vanity AI and of course now LipDub AI, which we’ll talk about today, so thanks Jonathan for joining today. Tell us more about the origin of MARZ 2018 and kind of maybe position it more broadly for our audience, kind of in the visual effects business.

Jonathan: Like you said, MARZ was founded August 2018 as a visual effects company. For those who aren’t familiar with visual effects, visual effects is effectively manipulating pixels on an image to create something that you couldn’t capture in camera. So you need an explosion, you need extra blood, you need a flying hippopotamus in outer space, visual effects can execute that for you. But when we started in the visual effects market, the one thing that was quite apparent to us was it’s quite homogeneous across vendors, so to say, most visual effects companies in the world are really the same. They go into Hollywood or whatever part of the world. They say, we do film, we do television, we do commercials, here’s the reel, here’s the images that we can show you, we’re nice, you should come use us. We felt like differentiation was required in the space and so the first thing we did was we said we want to go after premium episodic for a number of reasons, but namely, we want to compete on quantifiable matrices. Quality is table stakes in visual effects, and it’s nebulous. You don’t want to compete on quality, it’s too subjective. But time and cost are quantifiable matrices to compete on and time and cost in television in particular is quite constrained. So their budgets often aren’t as large as their film counterparts and the time that you have to execute often isn’t as long as the film counterparts, so that was the first thing. So our North Star is time and cost and when we were looking at that in the context of differentiation, we wanted to make sure that we weren’t just putting lipstick on our company or a soft wrapper of differentiation around our endeavors. We want to be able to say, when you actually look under the hood of this company, we are truly different than the other companies that exist on the planet and that’s when we said AI. Now we said AI, we were like, well, we don’t know what that means or what the use case is, but AI seemed to be a form of technology that we could excel at in the context of speed and cost. And then it’s finding those use cases and the way we look at that is we want to look at where deep learning is getting good and broad use cases for visual effects in Hollywood and when you could find those intersections, you can attack a product. So the first product that we went after was a product called Vanity and Vanity is effectively digital makeup. It’s a 2D cosmetic solution, so it’s quite prevalent in Hollywood. So you want your forehead wrinkles, your crow’s feet, your laugh lines, those types of features on your face, softened, de-aged, made to look better in 4K, 6K, 8K. We can do that and it exists in Hollywood and it’s quite broad. But typically, for someone to beautify their face, it would take an artist, a visual effects artist, anywhere from a half a day to two days to execute that shot. Whereas in our software, we could do it in 20 minutes, so by doing it so quickly and by using less labor, we’re able to offer it at a price point that is much more affordable and also a timeline that’s quite substantially faster. And let me back up for just a moment to put this in layman terms. Every second of a movie is 24 pictures strung together, so we call them frames. So the traditional art of visual effects is painting over 24 images to get 1 second of content. Now, the power of Vanity is you paint over a select few images in our software, and the software interface is much more simple than the tech stack that exists in visual effects today, where you can paint over those images, one, two images in our software, which, again, painting in our software takes 20 minutes, painting in traditional software takes days, and the technology extrapolates the image you’ve created across the rest of the images in the shot to create seconds of content from 20 minutes of work. Moving forward as it relates to this conversation, the next product we went after is Vanity, or rather LipDub. As we were pursuing Vanity, we started getting good at the face and the nuance of the face. And LipDub became a product and a use case for us that seemed applicable in that machine learning was getting good and the use case is quite broad and we announced it about 120 days ago. We’ve gone to market, the markets responded in a very positive way, and that’s a long winded introduction into what MARZ is.

Florian: Great, so you guys raised some funding back in 2021, but since then, at least, our research didn’t turn up anything, so it seems like you guys are getting a lot of traction without additional funding. Tell us a bit more about the company, current size, I don’t know, locations, team, et cetera. Who’s on the team?

Jonathan: The company is different today than it was before the writers and actors went on strike, so we have to create this caveat in this moment in time. This is not the normalized ecosystem. So there’s two components of our business. There’s the traditional visual effects business, and there’s the machine learning research and development. When we raised in 2021, it was a raise for the entire business. But the reason we haven’t had to raise capital since then is because we’re capital efficient in that our traditional business helps to finance the research and development, so we like to create this chart. So today traditional business is supporting the research and development side. In the future, the hope is that the research and development side is actually supporting the visual effects side. But no, we haven’t raised but obviously the squeeze that’s being put on the entire industry right now is creating strain across the board. And at our biggest, we were about 300 people, and we’ve reduced substantially on the traditional side because the reality is there’s not a lot of work out there today. But whenever the actors, writers, and studios come to an agreement, I hope it’s tomorrow, I don’t think it’s going to be tomorrow, we look forward to firing the engine back up and continuing our growth journey.

Florian: Let’s go back to LipDub AI. I want to talk a bit about the writers strike, maybe in a second, so you said that you got really good at the face, and that kind of led to LipDub AI. But was it obvious that dubbing, or kind of the visual component of dubbing, was something you were going to go after? Or did it just become apparent after you realized, well, we’re good at the face, and so this is kind of the logical next step?

Jonathan: It seemed like the right piece of technology to go after. It’s hard to say exactly why we went after it, but I’ll take it a different way in that to do lip manipulation traditionally in visual effects is impossible. It’s impossible for a few reasons. The first is the face is the uncanny valley, so the nuance and detail in the face and the way it moves is a very difficult thing to animate. So if you saw someone trying to animate lips into a language, I don’t know that it would look photorealistic, but to exacerbate that consideration, financially, it would be insane. You’d be spending so much money to get to that solution and so what LipDub is able to do is it’s able to take an audio stem, so the dialogue stem, the dialogue track, the original piece of footage, and you push those two pieces through our software, and in 30 seconds you have your result. And then one of the things that separates us from other sort of consumer type research teams is we’re going for Hollywood, so we’re working on 4K footage on ProRes files and need to pass quality standards that are superior to any other visual quality standard on planet Earth.

Florian: Yeah, I was going to ask you about that. You’re saying you’re meeting Hollywood standards, but you just answered that question with 4K, like super high res, et cetera, so that’s what would differentiate you from kind of these consumer avatar generation things?

SlatorPod – News, Analysis, Guests

The weekly language industry podcast. On Youtube, Apple Podcasts, Spotify, Google Podcasts, and all other major platforms.

SlatorPod – News, Analysis, Guests

Jonathan: Yeah, that’s right. What works for an influencer on YouTube is not what works for Marvel and it’s not what works for Netflix and there are other people going after different parts of this market. But we believe we have a serious research advantage in the high quality market because our solution is fully automated. We created these parameters… The parameters we set out, so we don’t want to create any friction in the traditional production process. You guys keep doing everything that you’re doing the same way that you’re doing it and because we’re familiar with the industry and we have years and years of experience here, we understand what assets we can pull out from productions. So it would have been a lot easier to concede and say, oh, well, if we go and have the production record the voice performer and track their lips, would that make our solution a lot easier to operate? Yeah, and we probably would have brought it to market a lot sooner, but we didn’t feel like that was a viable request or something that we could push on to the production. It’s like they need to do things the way that they’ve been doing it for decades, and we need to be able to conform to their workflow and so we’ve done that. We’ve conformed to their workflow. The audio stems exist, the original footage exists, obviously, and there’s no human-in-the-loop on our side. It is 100% automated and that’s the power, because any labor or human-in-the-loop that you’re going to introduce to a solution like this is going to result in a price point that’s unaffordable for any user. There needs to be willingness to pay on the consumer’s behalf.

Florian: Your key clients, I mean, they are the studios or is there somebody else kind of in the distribution chain that you would work with?

Jonathan: Yeah, I don’t know if I want to push too far into that because it’s a meaningful company strategy. What I’ll say is we’re creating this solution for Hollywood, so Hollywood is our North Star. We don’t want to just be creating footage on a go forward basis for Hollywood, but we also want to reach back into their catalogs, into history and see if we can remaster catalogs for studios.

Florian: Maybe I’ll try another tough question. Any kind of frequent points of pushback that you’re getting from studios?

Jonathan: It’s been received really well. Certainly there are some edge cases that we still need to solve. So without going into the details, there’s a million different edge cases, so I don’t know… If the face has already been manipulated, if there’s smoke or mirrors or extreme pose or extreme camera move. We’ve solved pretty much all of them, but there are still a few things that we’re dialing in, and we look forward to doing that in the next few months. But broadly speaking, it’s ready to rock, and the studios and other potential clients have been excited. We’ve been testing, we’ve been working, and it’s been happening really quickly.

Florian: Now, how do you see studios maybe ranking content for LipDub’s suitability, as it were, right? I mean, is there a certain part of the market that you’re seeing that would be very easy to adopt a technology like that right now? And maybe studios are a bit more reluctant to use it for something like Oppenheimer or something like that, like a major show or major feature. Can you just walk us through the different types of content where you see the adoption curve going through?

Jonathan: The intention is to have a product that can do Oppenheimer or Barbie or Barbenheimer or whatever they’re calling it these days. I’d say right now there would be shots in Oppenheimer that aren’t viable in the product. But I think in six months to a year, we’ll have resolved all of those edge cases. I don’t know if I’m going to answer it exactly how you’ve asked it, but I think one of the interesting pieces here is how much is a studio willing to pay to remaster or visually dub a piece of content? And that’s something that we’ve been thinking about for a long, long time. You can’t just offer a product to Hollywood where, yeah, it makes sense for The Avengers to dub into Mandarin, because if they have to spend hundreds of thousands of dollars, it’s like, yeah, the lift that you’d get in China would be massive, so it makes sense. No, you have to consider the entire catalog and all the films, so the product needs to be offered at a price point that is broadly appealing to all segments and all types of productions.

Florian: Have you looked into any of the kind of linguistic challenges, or has that been a topic at all? I don’t even know from which angle I would want to ask this question, but just kind of the language aspect, the linguistic aspect in all of this.

Jonathan: The technology is language agnostic, so in every language on planet Earth, there’s only so many sounds: t, ch, k, p, and there’s only so many different lip shapes that your lips move in when you’re speaking. So it creates a box that is, again, language agnostic. Now having said that, this is something we’ve come across and has been solved, but if I were to dub a piece of… Like if I were to dub Avengers into Mandarin and I would look at our results, I’d be like, oh yeah, those results look great. But if you ask someone who actually speaks Mandarin, if the results look great, they would say you have to be language specific. So as further to that, people who speak Mandarin tend to be less expressive with their lip movements, so their lips seem more muted versus people who speak Spanish seem to be more expressive. There’s more tongue, there’s more mouth movement, and so you need to sprinkle in a little bit of language-specific data so that the output of the software resonates and feels native to that speaker, if that makes sense. It’s something that if we dub something out of Mandarin into English, it would look great. You’d think it looks great. We speak English. I only speak English, so I know the English results feel photoreal and natural. And when I looked at the Mandarin results, I’m like that feels photoreal and natural, but we have people who speak Mandarin at our company and on our research team and they’re like, we need to dial that in specifically so it’s native to the tongue.

Florian: The showcase you have is Squid Game, right, on YouTube, and I mean, that was probably Korean into English and that looks fantastic when you switch.

Jonathan: We had a little bit of fun with that. There’s another one, I’m not sure if it’s in the public domain, but we went Korean to English. But then we also rewound one more time and then we went into like Tagalog and Portuguese and Cantonese. We were having fun with it, right? It’s like showing the power and we have fun at the office, the research guys. I don’t know if you saw the CNN piece, but it was fun. We put Liam Neeson’s voice into Donie O’Sullivan’s.

Florian: He was blown away.

Jonathan: Yeah, it’s just fun. Just having fun with it and using it for good. I will actually say this is, to digress for a moment, people ask us about the ethical use of AI and we are here for good. Honestly, there was one example, and I’ll keep them nameless, where they wanted us to put words into a real person’s mouth and have that person… like misrepresent what that person was actually saying, and we said no. We have platform guidelines and we plan on using it for good. We’re not going to allow people to misrepresent other people or use it for nefarious reasons. We want it to be fun. We want this to be something that the world can enjoy and localize and make foreign content more engaging in international markets.

Florian: What do the actors think about it? I mean, especially kind of the top actors, if you’ve gotten any feedback, like are they generally open to this because it kind of broadens the reach and makes it more natural? Or some of them, no, this is me, I speak English, don’t change my mouth.

SlatorCon Silicon Valley 2024 | $ 1,340

SlatorCon Silicon Valley 2024 | $ 1,340

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

Jonathan: I hope the agencies aren’t watching this one, but the actors just want to make more money, so that’s what the conversation looks like with the actors. I say that a little tongue in cheek, but also a little serious. The way I look at it is there’s a big difference between creation and augmentation, so creation would be like a full digital avatar, right? Like, I’m creating you, Florian, and your whole face is redone, and I could do whatever I want with that face. That’s not what our software does. Our software augments the face, specifically the area around the mouth, so my question to you is somewhat rhetorical is, is that different than what we did for Vision in WandaVision, where we took Paul Bettany’s face and we augmented his face with the digital assets that were required to make him look like Vision. This is my big pitch, so I hope this sells. I hope it sells to your viewers. I hope it sells the agencies and the actors is this is a win, win, win, win for everyone. If we are able… I just cherry pick on Avengers, I love it, so if we’re able to take Robert Downey Jr. and turn his lips into Mandarin and put that in the Chinese market, the theory or the thesis is that engagement will increase in that market, therefore revenues will increase for the studios in that market, therefore, the residual compensation that’s owed to actors will increase. So that’s why I think it’s good for everyone so long as everyone’s aware that this is happening. And obviously, we don’t want to misrepresent, we don’t want to slander, we don’t want to use people’s likeness that we don’t have contractual rights to do. We want to be clear and clean about it, but I think it’s a winning proposition for everyone. I think there’s other solutions out there, and you could see it. It’s like a viper’s pit in the context of the negotiation with the WGA and SAG and the studios. I don’t think we’re in there. I’m sure we’re part of it, but there’s other pieces of technology around the actors and ChatGPT and drafts scripts, and with the actors deep fakes and full face replacements. We’re just augmenting. We’re trying to create value for projects and increase engagement and it’s a localization place so that everyone could do better.

Florian: Yeah, and I mean, as somebody like me who’s grown up on dubbed content, it would be a massive improvement if the lips actually were in sync.

Jonathan: Here’s a really cool one. This is super interesting, I found. So Netflix obviously have their backend analytics and then take this with a grain of salt, so just fact check me, but this is as we understand it. They did a survey and they said, do people prefer subtitles or the dubbed audio? And the survey was about, I don’t know, call it even, call it 50/50. But when you look at Netflix back end analytics, it’s like 90% audio dub. So people like to say, like, oh yeah, I like reading the subtitles. It’s like, no, everyone wants to hear it and everyone wants to see the lips match the audio dub.

Florian: For me, it got less natural the better my English got, right? So as a child, I didn’t speak English at all, so I wouldn’t really notice. But then as your English level increases, he just said I’m sorry, he didn’t say whatever the German equivalent is, right? So it kind of shines through below, so yeah, massive improvements. You said only the lip, but like other parts of the face, so is there a certain area where you kind of stop the, I don’t know what you call it, the manipulation of it?

Jonathan: The idea is going back to what we spoke about at the beginning. The face is the uncanny valley and performances need to be preserved. Those performances are directed, the actors are professionals, they take pride in their performance. Why would we change a part of the face that’s already performed naturally and in its best state? So, yeah, we are taking the bottom half of the face and the idea is, if you have, let’s say it’s a scene and the person’s angry, well, the dubbed actor is going to sound angry, and so the lips, when they’re speaking, will also look angry and will match what the top part of the face looks like, which is angry. So said differently, if we had that angry footage and you had someone laughing on the dub track, it wouldn’t look that good. It would be a mismatch in the performance, but this, to ramble for one more second and you’d know this, Florian, and I think this relates to your audience. Dubbing is an art form and revising scripts so that the sentiment is understood in that specific language, but also, what they do now, and Squid Game did a really good job dubbing, actually, all things considered. The audio performer will try to speak at the same time as the on screen performer, so there’s little tricks of the trade here that already exist, and again, we’re just conforming to that standard that’s been in place for decades.

Florian: Do you intersect at all with the translation component, translation provider, or not at all?

Jonathan: No. You guys go, you get your voice performer, you want to get a synthetic voice, you want to get an actor to come into a recording studio. We don’t care. You just give us that track and we’ll make it sync.

Florian: Just one other, cartoons. I mean, is this on your radar at all or not at all?

Jonathan: I think someone came to us a couple of months ago and they’re like, can you do an M&M? The answer was, no, we can’t, and the answer is, no, we can’t do cartoons. Maybe in the future, but that’s not on our roadmap right now. It’s real live action, and there’s technical reasons as to why that is. There’s landmarks on the face that we have to anchor to and a Smartie, sorry, Smarties Canadian, an M&M doesn’t have a nose, so it doesn’t work.

Florian: Really high production value kind of cartoons and animated shows, kind of like the lip movements, you can see the dub through, even if it’s an animated series. Now, on the audio side so, yeah, you just said the voice, like you get the file, so you’re not involved at all in the audio production at all. You just get the file.

Jonathan: No, there’s great synthetic voice companies out there. I won’t name any of them, because I don’t need to cherry pick and they could do that for themselves, but there’s really, really good synthetic voice companies out there where they could make me sound like you, or you can type it in and they can make it sound like anything. It’s really, really cool stuff. It’s a very different research challenge than what we’re attacking. Our research team is very uniquely positioned in academia to attack this challenge, which is localization on the face as well as graphics and imaging, and if you like… Our Chief Scientist, a guy named Danny Cohen-Or, who’s based out of Tel Aviv University, this is exactly what his lab specializes in. If we were to go and try to attack synthetic voices, we’d need a completely different research team. It’s not applicable to what our guys know.

Florian: Now, of course, on the synthetic voice side and language side, there’s been this massive breakthrough, large language models, et cetera. Has your field also had something similar, just like kind of quantum leap style progression over the last 12, 18 months? Or were you not really impacted by this whole AI? Was it more linear, I guess, in your field than in the textual field?

Jonathan: Let me see if I could catch up to that question. The first part is generative AI is very sexy right now. Everyone’s talking about it, everyone’s wanting to explore it. We started this in 2019, so it wasn’t like we’re just a hype company riding the hype wave.

Florian: Did you see a step change between 2019 and in the last 12 to 18 months? Did something happen that totally accelerated your research efforts?

Jonathan: I’m Canadian, so we use the hockey stick analogy where if I showed you our results from a year ago, you’d be like, oh no, it was orders of magnitude worse than what it is today and something in our research… I have a couple of ideas, we brought in some really good people, but you can see like in March, around March, something clicked in a super meaningful way. We’ve been working on it for two years, so there’s so many different aspects of preprocessing to tracking, to integration. It’s such a difficult challenge. Something happened in March that just like a light went on, and we went so far down the field in a short amount of time and obviously culminating in the release of the product. But the solution, we will continue to research it and we will continue to improve it. This is an endeavor that will be pursued for years and years and years to get it to the absolute highest fidelity, highest standard, et cetera.

LocJobs.com I Recruit Talent. Find Jobs

LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry.

LocJobs.com I Recruit Talent. Find Jobs

Florian: Yeah. I mean, you mentioned before that you’re very much focused on that Hollywood-grade content. So I guess there’s no plans to go on the kind of creator side of things and YouTube or not at all, that’s not really on the roadmap?

Jonathan: We’re looking everywhere. In our conversations with creators, it just seems like the willingness to pay isn’t there. I think there are creators, MrBeast stands out to me. I don’t know ,MrBeast but I’m a fan of his, and he’s been trying this localization thing for a long time. You probably know it better than me, and he’s doing a great job. Obviously, I don’t have to say that. It’s obvious, but it doesn’t seem like there’s a willingness to pay on a per video basis, so I don’t think that’s going to be for us. I think that’s going to be more like, I know YouTube’s trying to create something for their creators. It’s like, let YouTube do it. It’s like you want to watch it on your phone and you’re a creator and so the quality bar needs to be lower because you’re just going out on TikTok or YouTube or whatever the viewing on your phone. The quality standard is much different, and the use case is much different, and the willingness to pay is much different. It’s not what we’re looking at and part of going after Hollywood in the highest bar in the world, it’s defensibility, it’s creating a moat, a technical moat around what we’re trying to do.