Nico Herbig and Josef van Genabith join SlatorPod to talk all about their project, called MMPE, a multi-modal interface for post-editing machine translation.
Nico shares the motivation behind starting MMPE and how they aim to improve the collaboration between human translators and language technology. Josef explains how they geared the project toward a human-in-the-loop model, where they offer a combination of pen, touch, gesture, speech, keyboard, and mouse.
The duo reflect on the early research behind the project, where they gathered input from professional translators to narrow down the design and handwriting interaction. They discuss the hardware technology and their approach when it comes to measuring cognitive load.
The segment rounds off with the next steps for MMPE now that the project is complete — from the potential commercialization of certain aspects to further research in eye-tracking in tandem with automatic speech recognition.
First up, Florian and Esther discuss the language industry news of the week, with RWS unveiling a new market strategy, announcing the acquisition of Liones; and analyst projections of USD 1 billion in FY 2022 revenue.
In the US, Boostlingo has bought two tech-focused providers, Voiceboxer and Interpreter Intelligence, expanding their offering to VRI, OPI, RSI, and on-demand interpreting. Meanwhile, Google gives YouTubers early access to free machine dubbing with Aloud, the Google-grown dubbing solution that debuted in March 2022.
Florian talks about the StartSummit conference, which kicked off with Jaroslaw Kutylowski, Founder and CEO of DeepL, highlighting the company’s strong growth despite zero marketing.
In other news, France-based Weglot, a website localization provider powered by machine translation, secured a EUR 45m investment from tech-focused investor Partech Growth. And two of the largest French LSPs have come together as Super Agency Acolad acquired top-10 Leader Ubiqus.
Esther: Can you tell our listeners a bit about your background, how you got into the language industry?
Josef: That is a long story. When I was young, there was no such thing as natural language processing. At least you could not do courses in natural language processing, so my first degree is in electronic engineering and English. I tried to put the two things together and that is how it all started. That was in Germany and I was very, very lucky. I got a scholarship to go to the UK for a couple of years where I was able to do my PhD. I came back to Germany, worked as a postdoc in Stuttgart. Then I spent lots of time in Ireland. I spent 17 years at Dublin City University and now I have been back in Germany for 8 years.
Nico: My story is not as long as Josef’s. I studied for my bachelor’s and master’s in computer science. I do not come directly from the language industry. Then I started my PhD at the German Research Center for Artificial Intelligence. I had the Chair for human-computer interaction but was co-supervised by Josef, so it is a combination of HCI and NLP, and that is also what the project was about.
Josef: We really needed that for this project, because it is about the collaboration between humans and computers. Our area of expertise in multilingual language technologies is the language technology bit. Nico’s expertise and Antonio Kruüger’s, who was the other supervisor and the boss of DFKI, area of expertise is human-computer interfaces, so we are very lucky to be able to put those two things together.
Florian: Going back to Ireland, is Dublin University very active generally in machine translation, NLP?
Josef: Very much so. Andy Way and Qun Lui are colleagues from Dublin City University. Qun is now the Head of NLP Research in Huawei in Hong Kong. You have probably come across the ADAPT Center. The first version of the ADAPT Center was called CNGL, Center for Next Generation Localisation and I was the founding director, so I put that together with colleagues, and then I left to come back to Germany. CNGL turned into ADAPT and it is now led very successfully by Vincent Wade at Trinity College Dublin and by Andy. Andy is the Deputy Director of ADAPT and they are doing very well. It is amazing. The DFKI is one of the oldest research institutes in artificial intelligence and I knew the former director quite well for this particular group. They had lots of labs and it is not just in Saarbrücken, it is in a couple of sites across Germany and it was a good possibility at the time.
Esther: Let us dive into the project, MMPE. Can you tell us what that is in a nutshell and what the reason was in particular for starting the project?
Nico: The reason was that if you look at research on NLP or language technology, great tools are developed every year and presented at conferences, but what is often missing is the integration with humans, so studies on how humans use those technologies and how can we improve the collaboration between the human translator and language technology. Often you see technical evaluations and all these tools are put somewhere in the submenu of an existing CAT tool. We thought to take the human-computer interaction perspective and the MT perspective and try to put together something great that enhances the collaboration. The idea was that when you switch to post-editing from normal translation, your interaction headlines change quite a lot. The mouse and keyboard were developed for producing text, so you start with an empty text box and you type your text and for that mouse and keyboard are great tools. We have no concerns about that, but in post-editing, the machine does the first draft and you edit it. The question is, are mouse and keyboard still good tools for that? That was the core question we were asking and therefore we explored other interaction modalities. People get used to this stuff from their smartphones, with touch. From tablets, with a pen. Speech interfaces with Alexa. We have all of these tools in our daily lives and therefore we thought, let us try that out for post-editing and also hand gestures or eye-tracking, so we also explore these things that most people do not know. The combination of all those different things means that overall we had to not just change things on the user interface side, but also on the hardware side because otherwise, you cannot explore all of this.
Esther: You mentioned the interaction between humans and language technology has not been a major focus in the past, but what impact do you think that has had?
Nico: It starts from the beginning. MT is produced with the goal of getting as close as possible to a reference translation, so already the models are optimized to produce texts to reference. This is not very close to something that is easy to post-edit, so there is this one angle you could take. Can you create an MT output that is easier to post-edit? As long as you still need post-editing and I guess that will be for a very long time if you look at all the complex texts around the complex language pairs where there is just not enough data and so on.
Josef: One of the reasons why we did this was that machine translation is getting pretty good, at least between some language pairs and where you have enough training data. It still makes mistakes like humans make mistakes and if there is a lot of value in the translation then you need a professional human-in-the-loop because the mistakes that the machine translation makes are getting more and more subtle. In the old days, there were lots of agreement errors or reordering but sometimes with neural machine translation it looks perfect, but it is not. We need post-editing and we need to support that in such a way that it is geared towards the human-in-the-loop or the human expert. As humans, if we communicate with each other, we do not always use keyboards and mouse. We speak, we gesture, we use vision, eye contact, we do lots of things. Part of the project was how can we best support all these multimodal ways of interacting with humans or with a human and a machine or human and content? How can we make that available? There was good research in the past on eye-tracking and other research on using electronic pens to post-edit, for example, but was not brought together in one place. Often we use different modalities at the same time. We speak, we use gaze, we might point at something or touch something and part of the project was to look at the individual interaction modalities on top of this. That is why it is called MMPE, multimodal post-editing, in addition, and on top of the mouse and the keyboard. Mouse and keyboard are incredibly useful for many things but they might no longer be optimal for the modern task of post-editing high-quality machine translation output and so we looked at the individual interaction modalities. We looked at some of the combinations and one of the things that was important that came from the human-computer interaction side was that from the very beginning, we had human translators in the loop. Initially, we had no idea what these interaction modalities would be like or what do human professional translators think? They have years and years of experience with the mouse and keyboard and it is very hard to wean them away from this but we started with questionnaires and with experiments. Nico conducted all of this to look at a couple of tasks that you are supposed to be doing in post-editing. If you imagine you have all these modalities, how would you do it? That was one of the first steps.
Nico: We started with elicitation studies and tried to elicit ideas from the professionals because we know what interaction modalities could work, but we do not know the translation task as well as a professional translator. We went to professional translators, talked to them, and discussed all the ideas, ranging from eye-tracking through gestures through to speech. They could imagine speech quite well because most of them have used it in some CAT tool, but you got them more familiar with the other ideas. They started to love the pen, for example. They said it would be a great option and also the combination of pen and speech by marking something and telling it to move that to the end of the sentence. You do not need to do everything and you can distribute across modalities how the interaction should work.
Josef: Those early interviews narrowed down the design space. They gave us strong signals of what the professional translators thought were good interaction modalities, combinations, etcetera. We did not start off with a blank slate. We did not start from scratch, so it was very much informed by these experiments, by these studies and the questionnaires and evaluating them and seeing, what do they want? That then led to the first implementation.
Nico: Also the choice of hardware. They were very eager to try out handwriting. You do not handwrite in the air all day, you would get the so-called gorilla arms syndrome. No one can do that the whole day, so it was important to have a screen that you can put on the table flat and also move up and so on. That is where our hardware came in and all the little design decisions, which PAM, which display and so on. We try to implement it very generically so that it also works on other devices. You do not need that hardware. For our studies, we chose a big screen that you can put down. The headset so that the audio quality and speech recognition are good. With a screen, you can directly move things around with your finger. We used a leap motion camera to detect finger movements and an eye tracker that detected where you are looking and so on. We plugged all of this together and created the different interaction modalities.
Josef: The technology is very generic in both hardware, but also the software. There was a lot of software to integrate this behind the scenes. There is the front end, really nice visualization, which also has a couple of novel aspects and quite an amazing backend that is running behind the scenes that puts everything together. There were many dimensions to this work, so once the prototype was up and running, we could test it and evaluate it, then go back to the design board and change things and adapt things. One important aspect was that we tried to measure cognitive load and it is very hard. One of the guiding principles was to come up with designs and solutions that help to ease the cognitive load.
Nico: On the one hand, we have this input and interaction side that we just talked about and on the other side, we also tried passively monitoring the user to understand what is going on. Quality and cognitive load are also not the same things. A sentence can be very badly translated by the MT, but if it is a very easy sentence, fixing the errors is very easy. On the other hand, it can be something that is already very close to the reference but finding that error and finding the spot where it goes wrong can be hard. Therefore we also tried out many physiological sensors, such as eye-tracking for pupil diameter and also the number of blinks, then lots of skin conductance measures. We have a smartwatch that measures conductivity on the skins, so if you are sweating. Then different heart sensors that also measure the heart rate variability and normal heart rate and all of that. We try to get better approximations of cognitive load by combining and fusing those different modalities.
Josef: There was even face recognition in some of the experiments. The idea should not be while you are interacting with MMPE you are wired up to all of these things, this is rather to guide the development and see to what extent is it possible. It is hard to measure cognitive load and the idea was to get as many signals as possible, such as the way you sit so one of the signals was posture detection.
Florian: Is there any chance you will commercialize this? Do you have any plans to commercialize this? Why do you think nobody has done this before because it is literally a billion-dollar market?
Nico: This was a publicly funded research project, so one of the aspects is to also try out things that might be a long shot, so where you do not see the money coming in directly. Rather, things that are cool and you want to try out and you want to see if they work or do not work and that is why we also tried more fancy things like gesture interaction. We are not commercializing this but we hope that ideas from it get commercialized. Here, you need to distinguish things like eye-tracking. Not many people have eye-trackers at home. This is probably not a good market to commercialize right now. Same with gesture detection. With pen interaction, so many people have iPads with a pen at home and you can review other translators’ work, but also machine translation in the form of post-editing. The better MT gets the fewer changes you need to do and the more important those interaction modalities will become. What participants liked a lot was reordering just by grabbing the words and moving them somewhere else. This is very complex to achieve with a mouse and keyboard. You need to mark things, cut them, paste them, and so on, compared to dragging it somewhere. There were things that could be implemented and commercialized and speech is currently being commercialized with many CAT tools. It is gaining more and more popularity.
Josef: This is basic research and we are looking to the future of post-editing or some of it is the future of post-editing. Some of the things are quite close to commercialization, for example, speech in combination with other modalities and there are good reasons for this. If you type and use the mouse the whole day, then your carpal tunnel is one of the injuries in the profession if you are typing all day. If you have a chance to switch modalities then that is pretty cool. It might help you to move a bit and use gestures, et cetera, so there are plenty of possibilities. This was a basic research project and because the funding is public money, all the results have to go back to the public and you can download the software and the configurations, et cetera, on GitHub. This is an invitation to other people to continue the research and play with it and also get inspiration for commercial undertakings.
Esther: When you were workshopping the ideas and generally creating MMPE, was it restricted to certain language combinations and certain text types? What have you observed how well it works for different settings?
Nico: We focused on English to German text from the news domain, so that it is pretty generic. Why did we do that? It was hard to find the participants and if you limit that down to some narrow domain that not many people are experts in or some language pair where there are not so many translators, it gets harder to find participants. That is why we try to stay pretty generic there and use something that we could find translators easily in German.
Josef: For most of the publications we are working closely with professional human translators as evaluators, guides, et cetera, and it is easier to find German-English, English-German translations here in Germany. In a couple of cases, we were not able to involve professional translators, but we involved master’s translation students who are in their final year. They are professionals in training and are pretty close to being professionals but we only did that when we could not get the human professional translators involved and that was in the final part of the project because of COVID-19.
Nico: This depended on both hardware requirements, so we did that with students. If no hardware was required we were exploring other things that were purely on the interface side where we are doing it online with freelance professional translators.
Florian: Did you look at how modality may influence creativity? Or different approaches to translation solutions?
Nico: We did in many regards, but not regarding creativity. We were mostly focusing on comparing the different modalities or modality combinations. We were measuring, how long does it take? How is it with touch? How is it with pen and speech combined? How is it only with speech? How is it only with gestures? How is it with a mouse and keyboard? Those were the things we compared, so a very narrowed down comparison because we wanted to measure times, we wanted to measure subject to feedback on each of those interaction modalities but it is a very good question. Is there an impact on creativity? Could be, I hope, but I cannot say.
Josef: The research was very much focused on specific questions, so one of the things that we wanted to know is if the interaction involves producing text versus identifying something, marking it, and then moving it by touching the screen or by gesture or by a command. Part of the research was to also see how the different modalities and their interactions distribute over those tasks because some modalities are more suited for text generation. For example, you might dictate text and if you reorder it, it might be easier to just touch it, identify a span, and move it on the screen with your finger and it re-slots back into a new environment. There were some interesting findings, like which modalities map well onto which interactions and which modality combinations map well onto which interactions. This is really projecting into the future and speculating because we did not measure creativity. It is an extremely good point, but one hunch would be that if you are just constrained to the keyboard and the mouse, then you are likely to be conservative with respect to the effort that you put in. You do not want to retype the whole sentence if you get a reasonable translation and only need to change two things. It is also economical. Often human professional translators are under enormous pressure, especially with post-editing because they have to fulfill a quota. Maybe with speech recognition and this is just a guess, you are more likely to dictate a new version of the sentence because it is easier than typing the whole thing. That is an interesting point that you raised, and this might be the future and it should be investigated.
Florian: Have you gotten any feedback so far from any commercial organizations and fellow researchers when you presented the paper? What was the general feedback here?
Josef: If you are in front of the screen with your hands, you grab the screen, you move it around, you release it somewhere else. There are modality combinations. There is traditional copyediting that people are trained with and there is speech, there is handwriting recognition and so it creates the spaces automatically, rearranges stuff. They are quite stunning.
Nico: What was also interesting for handwriting I found was that professional translators initially all thought, if I can do handwriting, I am going to handwrite almost everything and it turned out to be very useful if you do small changes, if you delete a word, if you exchange of word by another word, if you reorder. For these changes, it was very good, but as soon as you needed to write multiple words, then definitely speech and the keyboard is much better and faster. This also is where the different advantages come in. For reordering, pen and touch are great but speech is not so good and mouse and keyboard are also not so good for reordering, but when you want to add something, those things are great.
Josef: It is the freedom that you have. People can avail of different modalities and whatever works best for them. A lot of the experiments showed that depending on the task, the new modalities are competitive. Often we measure time by doing it the old way, doing it in different ways, the new ways. The research opens up many different ways of doing the same thing. In our experimental results outcomes were achieved with our subjects having sometimes 20, 30 years of classical training with keyboard and mouse and it is not easy to undo this when you are competing with muscle memory that has been trained for decades in some cases. Also with a younger translator, the standard is a keyboard and mouse. If your experimental results are good and you are competitive, sometimes even faster for certain things, that is a good achievement. It is a glimpse into the future, we are hoping at least.
Esther: Are there any more steps to come with MMPE? Is this as far as the project goes? Do you expect anyone to pick it up and run with it and develop it further?
Josef: The project has ended, unfortunately, but also fortunately as Nico is now part of a new exciting company that also does lots and lots of natural language processing. That is an extremely successful startup here in Saarbrücken. For MMPE, we are hoping to continue it so one dream is, you have probably heard about these brain readers where you wear a cap. It is hard but maybe there is a place for it in post-editing and in addition to other modalities, it is something that we have not explored. In the original MMPE, one of the things that we wanted to do is eye-tracking. Eye-tracking shows you where you are looking at the screen and in tandem with automatic speech recognition, the eye-tracking might prime the speech recognizer. It is because you are likely to talk about something that you are looking at and usually the gaze is a few microseconds before you say something and that is enough time to prime or tune the speech recognizer to get better recognition rates and so there are plenty of possibilities. I think we opened up a small Pandora’s box and hopefully, more good stuff is going to come out.
Nico: I finished my PhD thesis with lots of open research questions because we tried out so many things and you could now dig into each of these in much more depth. We also tried to combine it with interactive post-editing so the MT reacts to your changes because you do one small change somewhere and then the whole thing readapts. This will also become very useful because then you never need to do lots of writing but just small changes here. This combination of modalities with such interactive post-editing is also something that I would have loved to look into. What I personally like about DeepL is where you click on a word and you get the best list of alternatives. What I do not like about it is, for example, you do not know the impact of clicking on one of these suggestions. Sometimes you click on something and it just exchanges the word for a synonym, sometimes the whole remainder of the sentence is changed. We also tried different visualizations for that, so those three options are something that we will offer a synonym change, others will be a minor change that changes three or four words, and others will be big changes and if you combine that then with a pen or just finger touch, you would become super-efficient and also produce high-quality output.
Josef: The idea is that the system gives you an indication of the potential impact of a change, so it tries to predict that, not just give you the different options, but say, if you click here, all hell is going to break loose. If you click on that one, it is just going to change the word. That is the idea, there was a master thesis on this.