Microsoft’s Christian Federmann on the Translation Quality of Large Language Models

SlatorPod #161 - Christian Federmann on Microsoft Translator

In this week’s SlatorPod, we are joined by Christian Federmann, Principal Research Manager at Microsoft, where he works on machine translation (MT) evaluation and language expansion.

Christian recounts his journey from working at the German Research Center for Artificial Intelligence under the guidance of AI pioneer Hans Uszkoreit to joining Microsoft and building out Microsoft Translator.

He shares how Microsoft Translator evolved from using statistical MT to neural MT and why they opted for the Marian framework.

Christian expands on Microsoft’s push into large language models (LLMs) and how his team is now experimenting with NMT and LLM machine translation systems. He then explores how LLMs translate and the role of various prompts in the process.

Christian discusses the key metrics historically and currently used to evaluate machine translation. He also unpacks the findings from a recent research paper he co-authored investigating the applicability of LLMs for automated assessment of translation quality.

Subscribe on YoutubeApple PodcastsSpotifyGoogle Podcasts, and elsewhere

Christian describes how Microsoft’s custom translator fine-tunes and improves the user’s MT model through customer-specific data, which degrades more general domain performance. 

He shares Microsoft’s approach to expanding its support for languages with the recent addition of 13 African languages. Collaboration with language communities is an integral step in improving the quality of the translation models

To round off, Christian believes that the hype around LLMs may hit a wall within the next six months, as people realize the limitations of what they can achieve. However, in a year or two, there will be better solutions available, including LLM-enhanced machine translation.


Florian: First, tell us a bit more about your background, kind of, of all things NLP. What led you to focus on machine translation and what have been some of the kind of milestones in your career?

Christian: I ended up studying in Saarbrücken, and Saarbrücken University has a strong computational linguistics and a strong computer science department. So at some point, for me, that combined languages and computer science, two topics I’m very excited about, so that was the logical choice to continue my educational path on. And at some point, you mentioned it, I worked for German Research Center AI under the guidance of Hans Uszkoreit. The man, the myth, the legend. He was my first computational linguistics professor and so he had a typically heavy job, research assistant job when I just started my studies and there were two jobs available. One was for helping him beef up his PowerPoint slides. The main thing to be gained from that was a Microsoft Office license and/or do some programming on computational linguistics. I chose the latter, and that basically ultimately led to me doing my PhD with Hans. Yes and then shortly after finishing that, or before officially finishing that, I joined Microsoft in Vancouver first, and then in Seattle and Redmond, and now for a brief stint, I’m back in Munich. Major milestones, I think when I did my interview trip, there was some information about what we would be working on, but then I joined in earnest and had signed my name under that contract. And then they told me, oh, by the way, we are doing something different, which is speech translation, which then later on got published as Skype Translator. So that was the first big milestone because it was really bizarre to witness how these things evolve within the inner workings of a large machine such as Microsoft. So that was really before I used a Mac, I was at a university, I used my Linux and Unix, and then I joined Microsoft, and suddenly the whole thing was much bigger, much more different, but very exciting. Then internally in our team in Microsoft Research, we underwent this whole migration from statistical MT to neural MT, so that was big. In 2018 we published this human parity paper for Chinese-English, and then in the aftermath of that got butchered all across academia. So that was an interesting experience in the sense that from a human eval standpoint, the setup for that human parity paper was done in maybe a little bit of a naive way, but I followed what I also did for WMT at the time. I’m one of the co-organizers of WMT since 2013, so of course I had a feeling this was done well. But then of course everybody was like, no, you can’t really say machines achieve parity. No, this is not good. It’s been fun. Five years later, this whole thing has become an internal joke and we always refer back to that paper. It certainly moved the scene. An interesting thing is that one of the first criticisms we received for the human parity paper was that, oh, you can’t evaluate human translation or machine translation like that because obviously a human translator would have context and use context. The… paper and also the… paper they hammered on us for not using context enough. And now here, 2023, we are still talking about a lot of metrics who completely disregard context. So I think we were onto something then and now everybody as a field decided, okay, context is king and then we sort of collectively fail to really push for context. So that is one of the nice side effects of the large language models as a whole, right, because they have much wider context windows. So I think the time is ripe now to jump, to go away from that segment level to a much larger input context to have better evaluation.

SlatorPod – News, Analysis, Guests

The weekly language industry podcast. On Youtube, Apple Podcasts, Spotify, Google Podcasts, and all other major platforms.

SlatorPod – News, Analysis, Guests

Florian: First tell us a bit more about Microsoft Translator. There’s Google Translate, there’s like DeepL and Microsoft Translator, obviously huge product, huge distribution. So tell us more about language coverage, kind of connectivity to other apps maybe within the Microsoft ecosystem and the framework that it’s built on during research. I wasn’t aware that it’s on Marian, the Marian framework, so just tell us a bit more about that.

Christian: Microsoft Translator is an offspring of Microsoft Research’s translation tool which originates from, I don’t know, 2006/7. Initially a pure research project working on, of course, state of the art statistical and machine translation and now at this point in time, we are bleeding edge neural MT. You mentioned Marian. Yes, that’s our framework of choice. Marcin, who is the maintainer and main author of Marian, also Roman, who collaborates with him closely. They’re both on our team, so that means we of course use Marian for our models and make this accessible to people within Microsoft. We operated our own translation API as the small research team within MSR from the first models we shipped and these days anything within Microsoft which uses machine translation is basically using our models. So that means if there’s a discrepancy between what your favorite or non-favorite Microsoft product offers in terms of translation and what we offer on or, then this is based on some decision to not ship it just yet. In principle, all the internal first-party customers have access to our models and could use them. Everything is API-based so you can just integrate our models into first and third-party products. We have a quote-unquote generic vanilla general domain translation API. We also have something called Custom Translator which is similar to Google’s AutoML and other toolkits where you can upload your own data to customize the model towards your own content, your own domain. We also have recently worked on going away from the segment level to documents. So there’s a tool called Document Translator which allows you to put in or upload your document in one of the Microsoft 365 formats and then you get a translated version of that document and all the formatting, et cetera, et cetera would be maintained. Which is typically one of the first thing you learn when you come from academia is people don’t care about plain text as much as researchers do. Actual human beings want formatting to be preserved and documents to be functioning. Yes, so everything… A last bit, we also have something called disconnected containers. Obviously there’s all of the online world and then there’s customers who would prefer if you just sold them access to the models and they can use them offline for various purposes. And now the latest and greatest is obviously the switch from specialized models towards large language models. In our case GPT because that’s what Microsoft heavily invested in recently.

Florian: When you say that, are you transitioning to that or how does that work, like in production?

Christian: I want to say like six, seven months back when the internal GPT wave started rolling and picking up speed, we started looking into the quality of these models. And it’s obvious from research, from literature that large language models do have their share of multilingual prowess and they can translate. Hence the whole premise of the GEMBA paper in the sense that if it can translate, can it also maybe assess the quality of translations, right? That was the GEMBA paper in a nutshell and so we started looking at these models and figure out, okay, how could they help? To be perfectly bluntly, honest here, the initial assumption was that we can showcase and prove that these models are not just there yet, right? I mean, they are good contenders, but not as good as we would like them to be to actually consider switching. But throughout the course of our internal investigation, we figured, okay, for some language pairs, for some domains, these models are already really strong. Which means if we are considering ourselves sitting in this pot of boiling water, for some language pairs, the water has become really boiling hot and we want to jump out, maybe. So for now, there is a universe in which maybe in a few years, maybe sooner than that, for certain language pairs, what we call general domain translation quality is actually served by large language models. So in principle, if somebody just scrapes enough data and puts that into training and makes sure to train very large models on top of that, they will have a certain level of generic general domain language translation quality for the top end languages. And then I would reckon over the course of a few years that end will grow, so initially it’s the typical English-Spanish, English-Chinese, English-French, English-German. Maybe at some point it might be the top 20, top 50 languages. And yes, we are internally exploring how can we use that to sort of create hybrid translation models which benefit from that. One main issue right now is, of course, cost to serve. Our specialized models are highly optimized and really efficient for us to serve so that we can actually sell them at a good price point. If you now jump at GPT-4, yes, you can use it for translation, but it’s so enormously expensive that it may not be your best call to deploy your limited access to GPT-4 on large-scale translation.

Florian: When you look at an LLM, as in a machine translation system, this kind of latency, of course, throughput restrictions like with GPT, if I copy paste 10 pages, it says, well, you exceeded your whatever quota and I got to go reduce it to like eight pages and then it’s also super slow, right? It’s like literally typing in front of my eyes. So obviously you guys have privileged access to GPT-4, but is that a major technical issue? How hard would it be to use this at massive scale with millions of words instantly, like you obviously can do now with Microsoft Translator?

Christian: It still is a hard problem, maybe not infinitely hard. It’s not infeasible, but what we experienced internally is that at least for some ramp up period, the internal access was also really limited. So it’s not that we have much better access than anybody else because this is customer facing. Everybody now is really intrigued and excited to use GPT access via Azure’s OpenAI services, for instance, which means at that point, if I ask for internal access and there’s a paying customer and the decision is who gets it, it will be the paying customer, right? So that means for the GEMBA paper, we had something very limited as our daily quota of requests to be sent to the model, which is why this took a couple of weeks to collect enough data points. It’s still much better than anybody externally who doesn’t really have a big account with Microsoft, I want to say. But there have been access issues and that was hard. So scaling it to everybody who wants to use it will require time and more GPUs than currently we have available. I mean, this will be an interesting question for the next couple of months to see how much that can be reduced. I would say it was similar in some sense when we switched to full on neural models a few years back. And then over the last years what we’ve seen is that various frameworks have improved inference time or decoding speed and fault. So for now, this is an issue like, I don’t know, two years down the pike or whatnot access will improve and latency and et cetera will also be much less of a problem, I want to say. The main problem would likely be how do you decide when a certain input is complex enough that it actually warrants to be decoded by one of the large language models as opposed to it being sent to a normal cheap to serve specialized model? And that is interesting, so when I started my old MT journey in whatever 2006-ish, when we talked about hybrid, it was rule-based and SMT combined in some way, shape or form, right? And these days it’s not some SMT and neural, it’s more the decision, okay, do I go large multilingual model or do I go for specialized model and figuring out a good decision criteria in which optimizes cost while also optimizing quality? And that will be an interesting battleground for the next couple of months to explore.

Florian: Now let’s just take a step back and generally how do LLMs translate? I recently listened to a podcast with Steven and he broke it down super simply like it’s just kind of guesses the next word, yada, yada. Obviously it’s infinitely simplified, but if I prompt ChatGPT for like, okay, translate and then copy paste a couple of paragraphs or maybe a page, is the thing I’m trying to translate, is that kind of part of the prompt in kind of almost abstract terms? Or would the large language model say, well, okay, this is the text and here’s the prompt and now I go about translating it? Or is the actual source part of the giant prompt I’m inputting? I’m not sure if that question even makes sense, but learning in public here.

Christian: I guess working in public here. In a sense, the prompt is sort of an instruction to the large language model to do a certain encoder-decoder step. These are basically big transformers. You pipe data in, they go through all the layers, and then eventually on the target side, you get something out. And in a sense, the model will have learned that if the instruction is translated from X to Y, that some data is more X than other data, and then it will morph it into this target language Y, and then spill something out. So it’s an interesting question if you want to consider that part of the prompt or part of the… In a sense this is the templatic part or the variable part of the prompt template, right? And this is also why we now see this whole prompt hacking, prompt engineering frenzy happening in front of our eyes because at this point a lot of that seems to be, I want to say, good-natured, positive attitude hack work and we desperately need to move this from hacking to maybe some more principled approach. There’s specific templates for translation where people say these work generally very well and then you read this and okay, it looks like an obvious thing. Okay, you are the machine translation system. You translate from A to B. Here’s the source, what is the output? It makes sense. It’s not inherently clear why that should be the optimal translation prompt. There’s also research where people find that if you run experiments using different prompts that sometimes prompts win or are most optimal, which are not even human understandable, which gets really, really funky. I mean, if at some point there’s a random repetition of something which resembles a non-word, and somehow that’s supposed to be the best translation prompt, I mean, that makes my hair go any direction because it’s like, oh, that doesn’t make much sense to me as a human, and maybe it doesn’t have to. But at least it would be good to move away from people typing stuff and hoping for the best to becoming a little more principled in how we explore these prompts. And then at that point, once we go there, at least you have something of a principled method to determine what is your best prompt. And then I guess it doesn’t really matter if the translation payload is part of the prompt or not, as long as the output is stable.

SlatorCon Remote June 2024 | $ 180

SlatorCon Remote June 2024 | $ 180

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

Florian: One thing I wanted to run by you, I always get the feeling, or not always, but sometimes get the feeling that the LLM just goes off and uses some kind of normal general purpose machine translation portal. Can we exclude that or not? The output sometimes is quite close. I mean, or maybe there’s some cached DeepL or Microsoft Translator or Google Translate, something.

Christian: By definition, these large language models have been trained on very large quantities of data. This is how they come into existence, right? And so the logical choice of getting all that data is most likely scraping it off the web as much as possible, as massively as possible. Which means there’s a really good chance that a lot of the standard test sets from, say, older WMTs or from old NIST competitions or from old speech workshops somehow make it into the training data. Similarly, any of the openly accessible translation APIs might be somehow part of the training data, in which case, yes. And then there’s the typical examples people use for testing any machine translation service are very simplistic, mostly, right? I mean, a lot of people type, hey, how are you? Hello words. All this stuff, which typically we would not even send to the engine, but sort of hard code, because these are common easy sentences. And that means very similar to how we experience this problem when building specialized machine models, machine translation models, where we actually see that the more data we scrape bing… Bing has a lot of data it gets off the web. And more and more, detection of what is machine translation and what is human content becomes the key differentiator between being able to ship a better quality model or sort of not achieving that and that fully applies to any large language model. So if you’re using web-based sources, or if you consider the internet as your main source of input, then in some sense you have a cut off date and that’s what GPT has. At some point it tells you, okay, as of my cutoff date of blah, I can tell you this. So the best way of addressing that is whenever you run competitive evaluates, I want to say, you would go and choose a test set which the model has not had any chance of seeing. One of the things… Last year’s GPT3-3.5, ChatGPT models, they all end before WMT22 test sets have been released publicly. Which means right now, if you want to compare GPT translation quality to anything you have, using WMT22 test sets makes a lot more sense than WMT14 because those are more or less guaranteed to be in there somehow. This gets very different when we talk about large language models being combined with the plugin architecture as it’s now happening, which is able to query in actual current data. At that point, you have a completely hard problem because unless somebody tells you here’s some way of encoding what training data we have used and what data we have looked at, if all of that is closed and you can’t look into that, then you have a really hard problem because at that point you cannot tell. Which will become a problem in research because at some point you have a hard time arguing that the model is great versus the model has seen everything you asked it to do because you simply don’t know. So the plugin stuff will make this even more messy, I would want to say.

Florian: The plugin is not really available widely yet, right? I’m seeing a lot of people that are saying they’re getting access to it, but it’s not like I couldn’t get it, or I’d have to put myself on some waiting list or something.

Christian: Conceivably, similar to the app stores of choice for various platforms, at some point, the logical thing to do with large language models is to enable them to get current information in some principled way. And once you add that, it gets even more bizarre because then you can actually expect that model to… There could be a plugin which just sort of gets DeepL translations and uses that whenever something akin to a translation request comes. And then you’re really in a problematic space because then you don’t know, is this based on the model? Is it based on the plugin? Is it based on what? So that will become an even more interesting setting.

Florian: There’s so many layers to this, so let’s assume it goes through the plugin and it goes to DeepL, and then maybe there’s some kind of additional quality layer on top of it. Let’s talk about these quality layers and metrics, so just maybe a quick refresher for the listeners. Like, what are some of the key metrics to evaluate machine translation historically and now? And also I want to touch on your paper, that large language models are state of the art evaluators of translation quality that you guys published in February 2023 and where you had this GEMBA metric, right? GPT estimation metric based assessment. 

Christian: As any good MT researcher knows, the only and sole metric of quality is BLEU, the infamous bilingual evaluation understudy from back in the day. So WMT always has a metric shared task, or mostly always had one. And I think from the very first edition of that metric shared task, results showed that BLEU was actually never the strongest metric. So early on, once ChrF approached the scene, I think results showed that ChrF itself was a more stable and higher correlating metric. But still, everybody, I mean, everybody continued to use BLEU and then everybody used BLEU using different organizations. So all these results were not really fully comparable because it was never clear what people actually had as their units of comparison. So then when Matt Post released SacreBLEU as a tool, at least that helped the field forward because that made these scores. So I had a version of your SacreBLEU scoring, and then you at least had a guarantee that if this system scores X and this scores Y, which is higher than X on the same version, then this meant something. Also quite nicely, Matt added ChrF and translation error rate to SacreBLEU, so now it became easy to get three metrics during the same call. So I mean, at some point we adopted that and we switched to having these three metrics internally to have a look at how quality evolved. Then I want to say two, three years back, everybody jumped over to looking at embedding based metrics, COMET being the most notorious, but also BERT and others. We had a paper called “To Ship or Not to Ship”, where we summarize our internal choice of metrics. So we typically prefer COMET scores, followed by COMET-QE scores without a reference, followed by ChrF, followed by BLEU scores, which we mostly have internally because again, there’s various pockets of resistance within Microsoft localization where people still use BLEU because that’s the only metric they have really used over time. We have historical numbers, so we want to be able to compare to those. And then with the advent of large language models, what happened is one of the big downsides of COMET was suddenly when you ran the training, you had to also use COMET as a GPU needy model. So decoding test sets took a while and required a specific or a more specific machine type. And then of course, API access to GPT models came about and became accessible to us. And then we were like, okay, maybe we should just prompt these models to figure out what do they think in terms of translation quality. That has very nice properties in the sense that you literally just send an API code somewhere on a machine which is connected to the internet anyways, or to our internet network, and then that can be a very slow machine because all it takes is a little latency and a little bit of waiting time. You aggregate results and this is how we then ended up with the GEMBA metric, which basically uses any GPT model, or any large language model really, and prompts it to tell us, given this input, given this output for this language pair score, tell us what you think about the translation quality. And of course, similar to our evaluation of generic GPT based translation quality, the assumption initially was, okay, this might work a little bit, and then oftentimes it will fall flat on its face. What we didn’t expect to happen is that what we found, looking at the WMT22 metrics shared task data, that this GEMBA metric achieved state-of-the-art performance. So, I mean, we tried four different prompt types which emulated different human evaluation tasks, direct assessment, a typical one to five star ranking, and two more. We played with a total of seven different GPT versions as published in the paper. Meanwhile, there’s more because GPT-4 has been released and is still excruciatingly slow to access for us, so we are working on getting enough data points. But that showed that GPT-2 doesn’t really understand what you want from it. Ada, Babbage are sort of confused and don’t really… We have some examples from Ada output in the paper and the appendix. So these models clearly have an idea of what you ask about, ask them to do, but they just cannot perform the task. And then starting with Curie, things get better. And it seems Davinci models are indeed the sweet spot and then this is similar to a paper a colleague of mine released, our team released, in terms of assessing translation quality, where also Davinci proved to be the sweet spot. So it appears that anything… Starting with Davinci GPT models, the models are powerful enough to understand what you want, and their multilingual abilities are refined enough, say, to support this. That wasn’t an unexpected, but of course, welcome finding. And now what we are doing internally is we published the prompt templates. We published our proof of concept code so that people can replicate by sending these prompts to GPT/other LLM endpoints. And now, of course, we are trying to figure out how can we use that to improve our quality metrics. So at some point in the near future, we hope to have some of that in our internal production. And one of the interesting, intriguing points is what happens when we now use this to evaluate GPT output, translations, right? I mean, it becomes circular at that point, and the model sort of assesses quality on its own model generated output, so it’ll be interesting. I Recruit Talent. Find Jobs

LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. I Recruit Talent. Find Jobs

Florian: I think everybody right now is like starting to process the circularity of some of these things. And I think we’re in the more benign corner of the LLM spectrum here with LLMs assessing LLM’s MT. There’s some more scary, I guess, areas there that will come up. Now, you historically looked at large-scale human evaluation. So when we’re saying state-of-the-art evaluators, state-of-the-art incomparable to human or just to other automated metrics?

Christian: The nature of WMT’s metrics shared task is that metrics, different various metrics people propose for that shared task are compared against human performance, human references, human evaluation of, basically. And the higher the correlation, the better a metric scores. And the best performing metric is deemed to be most similar to what humans would have decided.

Florian: Okay, so you’re comparing against human and being state-of-the-art, meaning you’re closer to human. 

Christian: There’s lessons learned from the human parity paper here, right? And we tried to make this very clear in the GEMBA paper. So this is a first glimpse of what GPT and large language models might achieve in terms of translation quality assessment. This in no way, shape or form denotes any supremacy. It certainly doesn’t mean these things are better than human annotators, because that’s possibly a pipe dream and hard to achieve. But it was interesting enough from our perspective to release this to show, okay, there’s a lot of hype going on. I expect a lot of that hype to vanish in thin air in a couple of months. Right now, everybody’s so oversold on anything GPT or large language model solving any problem people might have. This is bound to fail. I mean, we will hit a brick wall and then, I don’t know, height perception goes down a little bit, and then we fix the issues that have been unearthed and then we do the next step in terms of quality very likely. This was interesting enough from our perspective to showcase that these models can perform super well, so that’s why we published, right? And the next thing now is to figure out how can we ensure that this actually works predictably well. One of the asterisks in our results is that this doesn’t work well enough on a segment level just yet. So one of the big caveats is you can send a single translation input-output pair to GEMBA and get something back from the large language model and it will tell you something, but it’s not yet really robust and reliable on that segment level. If you compile this for a couple of hundred sentences so it becomes system level, then what you get is actually state of the art and works well. But yes, one of the questions for us now is how can we figure out what content might be most well suited to be sent to GEMBA so that we get reliable outcome even on smaller sets. The ultimate choice is always a human, but then humans also have a strong variance, right? And one of the issues right now is at some point you also don’t have enough human annotators, so I’m very much similar to MT. I see the AI part here as a complementary solution which combines human and machine at some point, hopefully. So I’m, in that sense say techno-optimist, so it will not replace all these annotators. It just very likely, from my perspective, will mean we can focus on the more interesting part where we invest our human annotation budget and for some of the more easy stuff, or which we feel we have a good handle on using GPT models, then we apply it there.

Florian: Do you see this as primarily right now an evaluation technology or could it also become like in production, kind of a self improving kind of layer, right? So it says, okay, we think it’s 90%, but let’s get it to 98 and then it kind of automatically improves it, so layer upon layer.

Christian: Again, it gives you this whole circularity problem. Obviously at some point when you optimize your engine towards using some of these GPT models and then you use the same models to tell you if you’re improving or not, then there’s a way where you might actually evolve into something which the machine really likes and which becomes less and less what a human would like. But that is one of the potential use cases. Trying to get to a point where what these models allow is to scale more or to get to a larger scale of having unseen data for which we don’t have a reference and get a verdict on that. Which at the very least, should allow us to become more robust against the model going completely off rails and maybe not based on the model output. The model output might allow us to just go from a few thousand samples we send in for quality control to millions maybe. And then for a certain subset of those flagged by the GEMBA metric, we would then ask humans to confirm the findings. I think that is actually a very likely application.

Florian: Let’s briefly go back to the prompting. I think you touched on it, but we also recently covered a paper where they claimed that you can prompt the LLM by saying you’re now a machine translation system and that somehow influenced the quality. What are your thoughts on this?

Christian: Yes, good old black magic. It’s hard. Some of that… It’s obviously credible because different prompts may trigger different output. For a lot of tasks the question now becomes what is the best prompt for that task so you can get highest output quality? There’s also a little bit of a deterministic computer scientist within most of us, and we would like these things to behave in a somewhat well-behaved, principled manner. And having random, I mean, this is not even one of the random permutations of a prompt. But yes, some of that becomes black magic, really. And we need to come to a principle, the way of dealing with prompt design and measuring its quality. From the GEMBA experiments, I can say what we had is for the four different prompt types, for some of them we told the GPT model that this was actually a machine translation output, and for some we didn’t. And then we figured out that was sort of inconsistent and we changed that to become consistent and we didn’t really see much of a difference. So there it was, I mean, the important bit was this is translation from X to Y rated quality, but telling it it was a machine translation didn’t really change much. But yeah, this is also one of the interesting voodoo points where we should see more.

Florian: It’s been only a few months, right? Another one is like training LLMs to do machine translation. Is there any way currently users could train this? Is there any second layer you’re seeing? Anything in production already?

Christian: The obvious way right now would be anything akin to a knowledge distillation approach. There’s many, many different publicly available translation prompts out there. You can craft your own with a little bit of playtime, and then you have something which more or less reliably transforms input language X into target language Y. And if you run that for your domain data, for some amount of data, you can collect enough samples to train your own student on that. It’s just super expensive. It’s straightforward, it’s not necessarily a budget-friendly way. The next interesting step would be few-shot learning. Our GEMBA experiments notably are all zero-shot, so we didn’t even apply any few-shot example guidance to the model just yet. The next step we are working on, but there is very encouraging research ongoing where people use few-shot examples and quality goes up quite nicely. So maybe that is actually a better way if you want to build a more domain optimized translation engine for your own good. It becomes really fascinating or very complicated depending on how you want to see it when you talk about new languages. It will be very interesting to see how do you build some language which has nothing in common with say the 50 or 100 or 150 languages GPT knows? Can you just ingest some amount of data and fine-tune the model and then it sort of extrapolates from that to a nice enough degree so that you get a translation model? Or is that something where you should just whip up your own specialized transformer model and you will fare better, right? So the whole language expansion dimension is currently very frontier like in the sense that there’s lots of fun to be had.

Florian: A little less frontier is Custom Translator, which is very straightforward, online, you go to the link, you can fine-tune, enterprise individuals. How do people currently or users, and there’s going to be millions of them, kind of fine-tune their own MT? Who’s choosing to do that? What’s kind of the key user profiles? Enterprise, SME, individuals?

Christian: I want to say more enterprise, SME than individuals, right, but there’s also some individuals who like to customize their stuff. The generic approach is very simplistic in the sense that it’s a neural architecture. So people or end users supply their own data which is supposed to be from their own specific domain of interest. If I’m a big automotive customer, then I have my manuals for my cars and they have specific term basis or terms inside and I have specific controlled language maybe even of how I write my manuals and then they have that, we don’t. And then what happens is customer uploads their data into our Custom Translator. We run a couple of epochs of fine-tuning on top and a few bells and whistles and then you get a model which on customer-specific data improves in quality and maybe degrades on more general domain performance. So that’s why a customer will always see quality gains on their own test sets if they supply them or on randomly held out data of a small sample of their own data. So we can see in domain how performance of our model versus their own fine-tuned model behaves. Yes, and that’s more something enterprises and SMEs do than individuals.

Florian: Is there usually like an initial obviously kind of data training cycle and then they come back a year later with maybe some updated documents? Or is there a way to continuously update it as people, I guess, human translate more content?

Christian: We would love them to keep updating their models often because then that way we actually make a little bit of profit but realistically, what happens is there typically is an exploration phase where people learn what data to upload and how long it takes to run these trainings. And then they get to a point where they either fail, in which case they don’t use it, or they come to a point where they showcase enough internal value that they train all the models they need and get them deployed. So the cycle is basically you supply your data, you train fine-tuned custom translation models and iterate until these are good enough. And then the models you like, you deploy so they become available via the same API we have for our normal customers. It’s just you point to your own specific models, which only you can access and then you can use them in your own products. And then as companies create more data by sending more and more input documents to their own internal human translation processes, they typically come back, I don’t know, once a quarter, once a year, somewhere in between. I mean, anything more often is a little wasteful in terms of budget resources. We also use that same model for new languages where we try to get new languages into custom translation very quickly. So in the case of Inuktitut and Inuinnaqtun languages from Nunavut province in Canada, we enabled these in Custom Translator so that now the respective local government who had helped us build the models by supplying their translation data, they have a human translation process ongoing continuously and they just collect more data every quarter and then once a year they train an updated model in custom. So they can use that model before it even becomes publicly available and we get a good feedback signal from their trainings. At some point the scores improve so much that we say, okay, great, now is a good point to upgrade the publicly available model, so it’s a win-win for both.

Florian: You mentioned the government, but like on the enterprise side, who would be a typical key point of contact key user here? Would it be, I don’t know, the head of language service within the enterprise? Or who do you see engaging most with the Custom Translator?

Christian: Oftentimes it would be localization/globalization teams. That is at the very core of their business in the sense they are aware there’s different MT suppliers, they have a certain amount of content they need to globalize and localize and they find that normal general domain translation quality doesn’t cut it. So there’s a feeling that, okay, adding some of our data, which is of known high quality because we produced it on top of an existing large enough supplier may be exactly what we need and that’s how it typically starts. And then this is also interesting because, of course, there’s a chance that something within the properties of the specific domain or of the data is not fully compatible with what we have run in terms of our infrastructure. And then, I mean, for some select customers, if there’s enough push or enough interest or something goes up, then we also try to figure out how can we resolve that, how can we improve quality? I mean, obviously our main idea here is to improve, make the customer have a great experience.

Florian: Low-resource languages. Now you recently added 13 new African languages. So tell us what’s driving the additions here? How do you select and then what’s the technical process like? And then finally, do you pivot or there’s no pivot language, like is it going direct or no or is that a secret?

Christian: Let me start with the actual process. So a few months before Facebook released the No Language Left Behind hashtag, we internally use No Language Lost. I mean, everything is the same, right? The big guys, the big companies, the big corps, we all have this notion of AI for good in some way, shape or form. We keep maintaining that despite the recent budget ups and downs and there is a very tangible… It’s very easy to go to some language community and tell them, look, this is what we do, we do translation and then oftentimes the first question you get when you go somewhere is, okay, how can we add our language? What can we do? The typical answer is yes, you need some data and then some machine time to train models and see what works, what doesn’t. So we are quite aggressively scaling our languages. Over my last 10 years, we went from 42 to now 125. There’s more coming. We are working on a lot more. This is more and more becoming an engagement with language communities, as opposed to us doing all of the driving. This is also a nice paradigm shift. I want to say 10 years back, we would have all just taken any small amount of data, trained a model, dropped it somewhere, told the respective community we saved your language, yay. These days it’s becoming a much more collaboration which makes more sense. We we have had engagements where people didn’t really want to push much for their language going out there. We respect that. I mean, this is also a change in attitude, I want to say. And then what typically happens is there is a master list of or a main spreadsheet containing many languages which I’ve had received requests about, where we found some data here and there, and we just keep that list and see when do we have enough data to try a simple transformer-based baseline model and then we measure how quality looks like. At some point when it looks promising enough, we ask a couple of human annotators to tell us how bad it really is. Over time, these languages move from being super rough to being something which is maybe okayish. And then what we do is we more or less make them accessible as an experimental language for respective members of the language community so that they can use the very rough models and help us supply feedback and maybe find some more data. And then for some languages, things go quickly. So the initial Inuktitut engagement, I think, had us ship models which were very close in quality to the final public models in a matter of like three, four months, solely based on the fact that the respective local government had like 10 years worth of data made available to us of very high quality, so we had enough data, and it was reasonably easy. And then there’s other languages where we are literally after 10 years, still not in a shippable form just because data production is very, very slow. There’s only very limited speakers of the language, say, for Inuktitut has, I don’t know, I want to say 40, 50,000, in the 40,000 speakers range. The Inuinnaqtun dialect has 700 people speaking it. The only reason we could build that one is because the local Nunavut government created data which was Canadian-French, English, Inuktitut, and Inuinnaqtun because this is one of their main dialects they operate in. So we had the same amount of data for that dialect as we had for Canadian-French and for English and for Inuktitut. Otherwise, typically, when you have a language community of 700 people, you don’t have enough data, and you don’t find enough people to help you create enough data. And yes, in terms of priorities, I mean, if there’s a specific ask… For the African languages, say, at some point, our Microsoft President, Brad Smith, talked to the Nigerian President when we opened development center over there. And as part of those discussions, the question was brought up, could we maybe also roll out support for Nigerian languages, Hausa, Igbo and Yoruba? And then our President committed us to ship those languages at some point. And then you have an official mandate, and then, of course, you figure out, okay, where can we find data? How can we make this fly? For other languages, it’s opportunistic in the sense that we look at what data exists, how active our language communities. Recently we shipped Upper Sorbian, that’s a German minority language, and the local community has been extremely supportive and very active in providing us help and support and looking at the output and providing bug fixes and bug reports. And of course, that helps the language to be shipped sooner than later. And for some languages, for some of them, even asking human annotators to give you feedback may take weeks or months, and of course, that delays everything. So we’re trying to ship many languages and add more and more, so we have not converged to the final amount or final number just yet, but of course it gets harder.

Florian: 700 speakers. Oh my God. Young people as well? 

Christian: Yes, but they mostly don’t learn the language anymore, right? Obviously, there’s not a lot of business reason for a lot of these very small languages. Not only long-tail, now we are like ultra long-tail, ultra low-resource. So clearly the main motivation for us is language preservation and making sure these languages are somehow… They might grow extinct at some point, but now there is some digitized form of it. And everybody across the industry, the larger corps, we all seem to be fighting this fight, trying to make sure that we don’t lose the next five or 7000 languages.

Florian: It’s a struggle, right? I mean, I must have mentioned this a dozen times already on the pod, but like, I visit the Romansh speaking part of Switzerland quite often and last weekend I was there as well and I saw a bunch of kids on the playground speaking Romansh, which was like heartwarming, right? So it’s kids speaking the language live. It’s a small community, it’s 50,000 people broken down in five dialects. So you guys aren’t yet offering Romansh? Or is it on the roadmap?

Christian: I can’t name any details. It is on that big list, yes.

Florian: It’s on the big list. Good, I’ll pass that on. Pivot or no pivot, maybe just whatever you can say about that. Does it go via English or not? Or if it does, how?

Christian: Microsoft being an American company, so everything is English-centric. So if you go, I don’t know, French-Italian? It would be French-English, English-Italian? So we pivot via that. Otherwise you have this quadratic explosion of language pairs and that becomes not really nice. For specific markets, say, Chinese market is big enough, so there we have increasingly rolled out Chinese-centric models. So if you go from Japanese to Chinese, there’s a direct model. If you want to go Japanese-Korean, then that model may pivot via Chinese. Or it could also be direct one. I don’t remember right now. So this depends on what… I mean, this is all customer-driven in a sense that if there’s enough of an ask from a market to build direct models, and if there’s enough of data, of good enough quality training data available, then we are open to the idea of building direct models. But obviously, we are not trying to build 125 squared direct models because oftentimes the quality isn’t worth it and there’s not enough data.

Florian: Now, let me put that question out to you. Are we kind of reaching the point where incremental gains in machine translation quality are becoming super, super hard? Like assuming you have the proper setup, you have the proper data, you have a specialized model, like what’s your take here? The quality is amazing already, right, and there’s still these kind of odd problems here and there, but how hard is it in next ten years?

Christian: Even in the old SMT days we had this old idea of doubling your training data or your language model data gets you another BLEU point or something, right? Everybody has like a rule of thumb of how quality will go up over time and that typically requires more and more and more and more training data and that typically grows exponentially. So that means yes for specific, say, news domain translation model for high-resource language pair, say, English-Chinese, we already have enormous amounts of data which means to reach the next big point, quality gain on top of the current quality becomes harder and harder per time unit. So at that point, it might become much more interesting to figure out, okay, maybe instead of improving news quality for English-Chinese, which is good enough or already on a high level, maybe I should invest the same amount of effort and time into creating high-quality training data for a different domain. Automotive, eCommerce, I don’t know what and rather get that because for a new domain you will typically have much higher gains more quickly and then you have something which might actually address your problem better. In a sense, what we currently observe is these large language models being trained on super vast resources. That has, of course, has proven a very viable strategy. Quality is good or great depending on what you do with those models, but on the flip side of that coin is more and more data needed and more and more GPUs needed to hopefully get to a higher quality. We also now have to figure out how can we do quality gains on less, right? I mean then because for Romansh I will not have the same amount of training data ever that I have for an English-Chinese system right now. I mean, this is just, this won’t come to pass, right? So at some point I still want to translate that language pair or that language, so I need to find ways how to leverage and learn from less examples, right? I’m also not aware that a human being would learn from billions of examples, so somehow something in our brains is wired differently and ideally at some point… Now everybody goes large scale and then floods in more data on an architecture we have been using for a couple of years now, so it also will be interesting to see how we can maybe reach the next learning architecture step which allows us to extract more knowledge out of smaller samples.

Florian: You kind of already preempted one of my questions I was going to ask like LLMs and MT predictions, six months, twelve months?

Christian: Six months, maybe even earlier than that. The Hype is now very high and I think we will hit a brick wall at some point in the next half-year where people see that a lot of the or some of the sometimes insane assumptions people have in what these large language models can achieve will hit a brick wall and will fail, right? So then maybe at some point the hype cycle has to break and then people will become a little more realistic, what’s possible, what’s not. That will then also coincide with more people figuring out here’s a couple of problems these models can’t really solve and where things are really hard and that will lead to researchers figuring out new ways of measuring these things. And then in a year, in two years, we will have a better handle of how we do these things. In terms of more specific predictions for MT, I would say in a year, definitely there will be a large language model enhanced MT solutions available to anybody who is willing to pay for using those. Cost to serve will be a big enabler or disabler because it’s still at this point I want to say oftentimes if you want a GPT name tag on your translation service, you can do that and it will work somehow well, but it will also be very expensive. I’m not sure it’s worth the investment just yet, but we will go to the point where there will be a few very large language models available publicly and they will be translation savvy and people will use them, right? And the interesting bit in terms of MT will be how, as a specialized model provider, do you argue and showcase to customers that your specialization actually makes sense and/or even entitles you to maybe asking for more money than the other guys ask, right? And we’ve already seen this started by OpenAI in the sense that the recent pricing, GPT3.5-turbo pricing was highly competitive. So clearly they’re pushing for their models to become more accessible in the sense that they are not super ludicrous expensive. Which puts all of us on the specialized model front under the gun a little bit to figure out how do we tell people why they should pay more for our service? Or how do we make our service cheaper/better?

Florian: It’s a bit of a land grab scenario right now from the OpenAI people, but you also can see how they scale it back, right? Initially, I had 100 per hour, or something, I don’t know what it was. Now, then it went down to 50. Now I’m down to like 25 and then they give the thing like, hey, expect fewer going forward. Like, all right, and sometimes it kicks me out.

Christian: One of the things there is obviously OpenAI has a very different perspective on why they build these models and what they want to do. For OpenAI, say, you send something in, you don’t know what happens with your input data. It will likely end up in their training data at some point if you send enough similar stuff over time. This is different for people who integrate that on top of OpenAI’s models, right? Microsoft obviously has enterprise customers and they have some level of enterprise-grade expectations in terms of how data is not being looked at and how data privacy is respected. Those are the factors which allow a company to say, yeah, what we basically sell is some GPT-enhanced translation service, but on top of that, we don’t only offer the translation, but also here’s a couple of guarantees and blah, blah, blah, blah, right? But if you don’t have that, if you’re quote-unquote some integration shop who just sort of slaps together different APIs, then at some point you will have to maybe figure out a better way of value proposition for your customers because OpenAI has made it very, very simple. I mean, at this point you just code an API, get back something, it doesn’t get much simpler. So expectation or estimate for the future is it will be a very active next twelve months at the very least and based on the current timing, it seems everything is happening much faster than expected. Three months ago, I would have said, yeah, okay, we’re doing some preliminary experiments on QE and now GEMBA is basically done and we are really, really focused and pushing this forward.