1 month ago
October 29, 2021
Blazing-Fast Machine Translation With Kenneth Heafield
In this week’s SlatorPod, we’re joined by Kenneth Heafield, Reader in Machine Translation (MT) at the University of Edinburgh, a leading MT researcher. We originally connected with Kenneth on Twitter in a discussion about Slator’s coverage of a research paper on Carbon Emissions in MT.
Kenneth talks about his experience going back and forth between academia and industry, reflecting on the pros and cons of working for Big Tech. He discusses his recent research with efficient translation and language modelling as well as other MT topics that are undervalued by the industry.
Kenneth shares his thoughts on the popular preprint platform arXiv and how news outlets should cover research that hasn’t been peer-reviewed. He gives tips for those interested in attending natural language processing (NLP) conferences, particularly on how to navigate the complex system.
Kenneth concludes the podcast with an exciting demo of Translate Locally, an MT app that runs locally on a desktop or laptop CPU, allowing cloud-like translation speed without sacrificing privacy and browsing habits.
First up, Florian and Esther discuss the language industry news of the week, with Argos Multilingual acquiring rival Venga Global and roughly doubling its size to become one of the top 25 LSPs in the world. One LSP that missed the top spot by a hair is RWS, whose full-year revenue is exceeding expectations, with consensus placing the figure around USD 965m.
Meanwhile, Slator’s very own Anna Wyndham joins the Pod to talk about our highly popular article, “10 Areas Where Translators Are (and Will Remain) Essential Experts in the Loop,” published last week. She highlights a selection of mission-critical scenarios in which translators are the true experts in the loop, despite the advancement of tech.
Florian: This conversation came about a bit differently, we got into a Twitter discussion over the coverage of a carbon emissions paper. First, tell us a bit more about what got you into this space, your background, and what drew you into the machine translation and broader NLP research space?
Kenneth: I started in NLP by outsourcing myself to India to work in Bangalore at Infosys in 2006 as an intern and the project was applying topic modeling to source code. They would get some large, messy ball of source code and want to rearrange it semi-automatically and to something that is more sensible so people could get an understanding of the whole picture of the code. Then I began proper natural language processing that snowballed into working at Google where I learned how to run C++ well but was not doing any machine translation. I knew of some people there, but I was not properly on the team. Then at Carnegie Mellon, I started doing proper machine translation building on the background of knowing C++ and engineering. Machine translation has a reputation for being on the more engineering heavy side of academic research and with a lot of industry involvement as well, so it was more of a natural fit with the skills I had. Some natural language processing experience, but also knowing how to code more than the average PhD student.
Esther: You worked at Google and then Bloomberg. What drew you back into academia from the corporate or big tech side?
Kenneth: I have always sort of bounced between the two a little bit. A lot of it has to do with freedom and at Google I was a software engineer, pre-PhD, working on the Google Books team, actually, and recognizing the language of titles of books and that sort of thing. What academia provides is, I can write the proposal, subject to what funding is available and pick and choose which things to work on and that is also possible at a more arms-length relationship with a lot of industry people. I do not want to be carrying a pager. I do not want to suddenly have a product requirement change or there is a reorganization and I find myself working on a new team. There is some freedom that academia provides in choosing what to work on.
Florian: Big Tech and fast emerging tech is hiring students aggressively, so how do you keep some talent within your team while also not standing in the way if people want to join some of these big, fast-growing tech companies?
Kenneth: Two PhD students that graduated have both gone to Amazon. There are two more that did not graduate and more or less did their masters, and then went off to industry and left early in the PhD. Industry has a big draw, depending on where you are in industry so sometimes there is the production machine translation team and that has a different culture from, say the Google Brain or the Facebook AI research, where the objective is more doing papers. Sometimes the game is just to get to the publication level that one can get to and that is the student’s objective, to be in the research lab instead of the production team where they are building another language or using the same formula that has been developed by the researchers. I have also had a master’s student that quit Google, came and did a masters with us in natural language processing, then went back to Google in a better NLP team that he liked. It is much harder to get a postdoc in natural language processing and MT especially because those people are drawn to the Big Tech companies. There are obvious exceptions to this, Jacob Devlin and Chris Quirk, for instance, do not have PhDs but are at the top of their fields. Though the more normal path is to complete the PhD and then go to the industry research lab if that is the goal.
Florian: You do not see a lot of competition from the bigger language service providers or some of the other startups. Is it Big Tech that is drawing 90%?
Kenneth: Big Tech has the salaries and it is always natural that the largest organizations are the most well known, so a particular startup is not necessarily going to be all that well known, but there are definitely people working for startups. I hired a post-doc out of an LSP actually, and now she is going to work for Apple so, if the salary is competitive then, yes. There is also the question of the extent to which a smaller company can afford to have a research department that is leading and publishing in EMNLP if that is what people want to do versus delivering on the product. That said, there are also companies that participate in research grants, especially European Union grants and there is much more free-flowing movement of staff between a research group and a company that might be working in the same consortium. We had a partner on a grant that did not get accepted, left the company that was involved and now works for the university that was involved in a different grant that did get accepted.
Esther: To what extent do you think the research setting, whether it is Big Tech or academia, is shaping the work that you are doing? What factors are contributing to how the work gets done? What is similar and different about the two environments?
Kenneth: Big Tech has a lot of GPU’s. If you work for Google Brain, Facebook, Amazon or Microsoft, you will have access to more GPU’s than I can afford to give my staff, at least on a per-person basis. Part of that is actually more of a history thing than anything else because physics groups have been using high-performance computing centers for years now, and they have competing power that is comparable to what you see in large tech companies for their machine learning teams. Culturally natural language processing has not been needed to compute at that level and the skills are somewhat missing on the academic side. Whereas industry has brought in the engineers to build the cluster, keeps the researchers happy specifically for their NLP product teams and also the HPC centers. When I apply for time it says, is the number of simulations appropriate for the given timeline? Implicitly expecting that there will be physicists and chemists submitting though in practice. I have very high success rates in grabbing HPC time and they are happy to have us.
Florian: You have published a wide range of papers. Tell us a bit more about the core themes in your research and what do you hope to achieve through that? What is the division?
Kenneth: I want to do things that are useful and that is where there are also itches that come from problems I have had. I have a whole shtick about efficient translation and earlier efficient language modeling that takes two weeks to train the system and that makes us not very productive. The thing is it took two weeks to train a neural translation system 10 years ago when the very early neural MT papers came out, it is just that the hardware got better and we started doing more with them and now our systems are more complicated. One thing I tried to do is be at the top of efficiency and with that comes scale as well and a large volume of data. Other shticks are actually on the opposite side of that. There is a bit of low resource translation going on there so, low resource is arguably the least solved in terms of quality and machine translation and there are open research questions to be had about how do we deal with the smaller amounts of data compared to what you would have for German or French or something. The last thread we have been pursuing recently is a large parallel corpus, so we have the para crawl project, paracrawl.eu that produces by mining the web and then extracting it and putting it online. Free parallel corpora that are available for over 30 languages.
Florian: For low research, what are the top languages that are low resource, but are very important in an academic context, but also then later in an industry context?
Kenneth: Indonesian has a very high translation volume on Google translate, but the amount of parallel data is quite small. Indian languages, such as Hindi, have huge populations of people, but not much in the way of corpora are available for them. Some African languages, Xhosa and Zulu for instance, have millions of speakers, but machine translation is quite poor for these languages. It does exist, but it is poor because there is not much in the way of parallel corpora.
Florian: What about synthetic texts? Is that something that is also going into low resource MT now? We have been talking about it for some other applications.
Kenneth: That is one of the main methods, back translation being the most obvious form of synthetic texts, or trying to use a dictionary or terminology database and substituting words in there. People are using that to get more mileage out of their parallel corpora and thereby improve the quality. Another thing is using related languages, which are not quite synthetic texts but have a similar flavor to them.
Esther: How is success defined in academia? Is it important to look at certain metrics, like the number of papers published, citations? Are they valued among the community? What is valued within the community when it comes to achieving some level of success?
Kenneth: More important than the number of papers is having a few significant and influential papers. To give examples, Transformer, not an academic paper but very influential and people value that and you could have Ashish Vaswani put out a few more papers, but they are going to notice the Transformer above that unless he is got something else cooking. Certainly, the volume of papers does matter if you do not publish something for a year in academia. You are in trouble, if you have got tenure, it may mean that you are simply not promoted. Impact does matter and it is mattering more in academia because it is often tied to funding, so if I want to apply for an EU grant, there is an entire section that goes into what are my plans for impact and if I do not have a company involved in the consortium for the EU grant proposal, it is probably not going to get accepted. There is a similar thing for UK funding. It is a bit lighter and it takes into account whether you are doing basic research or applied research, but you have to write an impact section nonetheless explaining how this is going to improve the industry or ultimately help the taxpayers. Fortunately, in machine translation, we have quite good pathways to impact these days, our para crawl data is helping the EU translate swear words better in the online dispute resolution platform.
Florian: Help us understand more about NLP conferences. What are they? How are they important? It is interesting for us as a conference organizer.
Kenneth: There is ACL, EMNLP, NACAL, and those regional versions of these are the main conferences in natural language processing. It is very easy to get lost there in the sense that the conferences have gotten much larger these days and have been exponentially growing in the number of attendees. Now it is even stranger because you go into a Gather Town that is online instead of actually physically bumping into people and it happens to be whoever did not have other work responsibilities that are attending an online conference these days. In terms of when a conference happens, like the EMNLP, there is almost always a machine translation session going on and we feel a bit lost when there is not one. That said, there is usually a related field, like language modeling, that has talks. They have started adopting industry sessions that focus on papers that are more evaluated on how useful they were in an industry context and less evaluated on is this new work that has not been done before so implementing an existing paper, deploying it in a product and the story of how that was done can go into the industry track as well. There are also more industry-focused conferences, so AMTA actually coordinates with the American Translators Association and tries to have the conferences in the same place right after each other, and has a much bigger industry focus to it.
Esther: One question I have is about the questions and the topics around MT and NLP that academics are looking into. Are there certain topics that are generally undervalued or unobserved by the industry that we should be more aware of?
Kenneth: I think speech translation integrated end-to-end is growing. It has problems with working for everything right now but I suspect it is going to replace those cascaded pipelines that are currently deployed in industry and we have seen some of this adoption. Facebook is working on it. There is obviously the Zoom acquisition of Kites that you have covered and then there are longer field things, some people are working on textual entailment and summarization and those tasks are undergoing some revolutions as a result of realizing that they can take a large language model and then adjust it to work for their tasks. Machine translation outside of the low resource languages has a lot of training data compared to other tasks and that is part of why it is so applicable immediately in industry. Whereas there is not much training data for, does this sentence imply the content of this other sentence? It is ultimately manually annotated and what is happening there is growth of large language modeling is moving into those tasks and making them more doable because it can generalize over the sentences and abstract a little bit and once it has abstracted, then you can start using the abstracted form to do tasks that have less training data.
Florian: You mentioned speech-to-speech, are we taking an audio data set in one language and then directly converting it to an audio data set in another language, or are we completely skipping all the intermediate steps? If yes, how different is this from a research and deployment perspective?
Kenneth: Yes, we are talking about things growing end-to-end and there are some systems out there that are starting to do this. It is going to get bigger and the trick is how to use the training data that is out there so dubbed movies are not great for training data because there is all this background music and honestly, it is not how people sound. It is a very practiced speech rather than a bunch of ‘ums’ and ‘ers’ inserted into it that we see in reality. The trick there is making a system that can exploit all this parallel text that exists, all this transcribed speech, which is not as large but much larger than transcribed and translated speech and turn it into a multi-task system that can exploit all of these sources of data at once while still performing in it. One of the reasons you might want end-to-end is a lot of content in what we say, intonation, emotion, and even the sound of someone’s voice is lost in the cascaded system.
Florian: In terms of the data, is it not huge? Does that not exponentially add to the compute problem when you have these big datasets?
Kenneth: The size of the data definitely impacts how long it takes to train and then inference a task we are working on. For machine translation, we can get a translation down to 17 milliseconds for one sentence on a CPU core. Speech recognition is typically more processor-intensive than translation is. Though a lot of it is latency driven by not knowing what the person is going to say, rather than the process. If you look at the ratio of interpretation lag time, it is mainly driven by waiting for the verb in German, that sort of problem, rather than the probably cloud-based processing of actually doing the speech translation.
Esther: Do you also have the same issue of compound errors that you find in the cascade systems? If there is a problem with the ASR is it going to carry through to the MT, et cetera, does that exist also in the end-to-end?
Kenneth: This is sort of the sales pitch for the integrated systems, as some of those errors can be rectified. There will always be problems in misunderstanding the word that someone said and you cannot just wave a magic end-to-end wand and get rid of that, but it does mean that there is a wider pipeline, so the language model that is effectively part of the generation process on the output language is helping you disambiguate what was said in the source language, which you do not get when you are doing a cascaded system.
Florian: We did come across some of that and we covered it. We got it from arXiv and we liberally cover research taken from arXiv, which is pre-print, i.e. it has not been peer-reviewed. The platform is a live feed of what is going on. It is quite cutting edge. We go there, we see what the research community wants, but we obviously cannot cover all of them because there is so much research coming out so we selectively pick and we pick for certain perceived interests. If it is too geeky and it has a two-sentence title, we would probably want to stay away from it. If it is from Big Tech, we look closer and if it has a great title, we want to look even closer. Long story short, how do we best cover this kind of research, not being able to peer review ourselves? How would you suggest that we go about that as we try to bring this closer to industry people?
Kenneth: Big Tech has much better PR departments that are actually helping advertise papers compared to say academia where yes, we have a PR department for the Informatics School in Edinburgh, but it is one or two people as opposed to a team that is promoting things and therefore we see that Big Tech does a much better branding advertise and that has boiled down to your implied trust. arXiv is bizarre and it has everything in there. There is some good stuff. There is some bad stuff. There are a few kinds of papers on there. One of them is the corporate press release disguised as an academic paper, so examples of that would be Google’s neural machine translation, where they claimed human parity, but actually, they made their claim with respect to a few short simple sentences and it has been rejected from multiple academic conference and publication venues. I know that because I was reviewing it and one of Microsoft’s papers where they also claimed human parity. That was more of a press release than anything else and defined human parity to mean, we ran a significance test and did not find any significance, but if you have taken statistics you know that failure to find significance does not actually mean anything. It could just mean that there was enough random noise in their annotations that they could not conclude anything from it. It does not mean that it is human parity so both of these have actually resulted in actual published peer-reviewed papers debunking some of their claims.
Another thing is there are smells that come out of papers sometimes. Using an old WMT is a good sign, but their experimental practices are not so great, especially if they want to claim state of the art. Sometime after 2014, Google famously put out that paper and then they claimed state of the art in WMT14, but the conference on machine translation still keeps the WMT. They have renewed the test side every year, partly to prevent overfitting by the community in the sense that if everyone runs on WMT14 all the time, all we are going to get is systems that are good at WMT14 and nothing else, so that means each year everyone submits their best systems against the newest WMT and anyone claiming to be state of the art on an old one is probably lying to you in the sense that, by way of using an old test set, have excluded all of the latest state of the art systems from the comparison.
Another thing is cheating at BLEU scores. It happens in a lot of papers so BLEU was looking for matches of 1, 2, 3, and 4 words at a time and the question is, what is a word? BLEU has an internal tokenizer it uses to split off comma from a word, for instance, but some people run their own tokenizer and if you run your own tokenizer and it splits more aggressively and breaks the hyphen in the middle, then all of a sudden that hyphenated words you had, which might have counted as one towards regular BLEU, now counts as three words that matched, so if a paper does not tell you, we ran SacreBLEU on de-tokenized references then they are probably cheating on their BLEU scores and you cannot compare BLEU scores across papers as a result.
Florian: Do you think some of this is done intentionally hoping no one finds out? Or is it just sloppiness or ignorance?
Kenneth: It is a combination of sloppiness and wanting to compare. WMT14 English to German turned into an industry and academia meme about comparing test sets and reporting BLEU scores. The nice thing about that is if you ignore all the problems where you cannot actually compare across papers because people use different tokenizers and no one in the industry can agree what the tokenizer is, you can make a table that says, haha, my BLEU score is higher than the following papers, without having to replicate all of their work. Replicating papers is also a problem that everyone in academia and industry has and some studies show you cannot replicate a lot of papers. Around 25% of groups could not replicate their own papers. It was meant partly to save on replication costs and partly it is easier if you can just download the examples that the toolkit has, do your thing that modifies it and then run the score that was provided by the toolkit. The problem is the score that was provided by the toolkit was probably created to replicate the Transformer paper and get exactly the same BLEU score, which means they had to put all of the same cheats that the Transformer paper had and now we are all stuck with the poor evaluation practices of one paper.
Florian: When we cover this, should we be more cautious and not put a mildly intriguing headline on top that claims something that the paper then does not fulfill?
SlatorCon Remote December 2021 | $150
A rich online conference which brings together our research and network of industry leaders.
Kenneth: What triggers the researcher is, I have been doing all this work on putting papers out on efficient machine translation and now this paper that is clearly bad by having the amount of training data off by a factor of 10,000 is getting press. That is partly a failure on our part that we do not deal with the press as much as we should and there is also an incentive created to make exaggerated claims in a paper. I do not like to talk about the climate impact of an efficiency paper because Jevons paradox is this thing that says, if you make it more efficient to use coal, people will find more ways to use coal and then in the UK, the coal supply will actually go down as the result of efficiency and not go up. In this case, if I make machine translation go faster, hopefully, more people will use it and ultimately the environmental cost of machine translation may well go up as a result of having made it more efficient and less costly to use. I am very suspicious of papers that jump from speed to climate, especially this paper that had no citation or methodology for converting GPU hours or power consumption into grams and yet somehow had a graph that involves grams.
Esther: Wrapping things up, I believe there is a product announcement and demo. Tell us a little bit about it?
Florian: This is the first demo we have and we are screen-sharing so watch the live demo on YouTube at 58:33.
Kenneth: I have been working on this horizon 2020 project. I am the coordinator and the idea is we want machine translation to run efficiently client side. We also have quality estimation but the idea is to preserve your privacy and we know privacy is very important in a corporate context as well. Users do not want to send their data to Google Translate or Microsoft Translator, or what have you, even if they will sign NDAs. We thought of bringing it directly to the desktop’s CPU. It does not have to be any fancy GPU. I am using Intel graphics that are integrated into the CPU. It is available at translatelocally.com. You can download it yourself on Linux, Mac, and Windows, and it lets you translate so fast that it will go as you type. You can see how the German below is generating and updating itself as I type, and this is not just translating one sentence at a time. You can go in and make edits and fix things and then the German below will actually change as well. There is no caching or anything.
Florian: This is running on your local CPU?
Kenneth: Yeah, it is a Skylake. It is not even the latest processor. We have done it on my five-year-old laptop. We are integrating this with Firefox. I can also show you a German to English example. I have got some texts here from Wikipedia and all of that text was just translated in the flash you saw while it was waiting. You can see this English output here.
Florian: You have got seven language combinations now.
Esther: Yes, some are bi-directional.
Florian: It is interactive. It is live as you go. When did this become available?
Kenneth: We have been working on the core technology for a couple of years now, but putting a UI and having it go as you type came out in March this year.
Florian: What do you plan to do with this? How open-source is it if anybody wants to put another UI on top of it or build on top of it? How would that work?
Kenneth: Behind this, we have got a wrapper around Marian and there is the whole API. If you want to go in C++ we are working on a command-line program where you just tell it, here is some text and here is the language pair that I want to do the translation for and then output it. There is nothing, in principle, hard about it. It is just getting a pretty interface. With Mozilla, we are directly integrating this into the browser, so you have an extension experience, so when it detects something is not in the language you speak, according to the UI settings at least, then it will offer you to translate the page for you and work it into the HTML.
Florian: I also hope maybe at some point it will go to Brave if you continue to work with the browsers.
Kenneth: We have been talking to Brave as well and Brendan Eich actually appeared in the slack channel once and we discussed some of the limitations of WebAssembly. It runs slower when you do it on top of WebAssembly so we are talking to Brave about integration directly into the browser as well.