5 months ago
January 15, 2021
Oxford Languages President Casper Grathwohl on Language Data as a Service
Casper Grathwohl, President of Oxford Languages and Academic Product Director at Oxford University Press (OUP), joins the Pod.
Casper talks about his role at Oxford Languages, working with big tech customers, key drivers behind data annotation’s fast growing industry, and the world’s under-served and low resource languages.
He also discusses Oxford English Dictionary’s business in lexical data, and developing new annotation and visualization tools at their lab.
Florian and Esther talk about Slator’s newly-launched 2020 M&A and Funding report.
Florian talks about transcription and audio/video editing tool Descript — which raised USD 30m to expand its capabilities for enterprise use. He also shares news from Japan-based MT and human translation provider Rozetta.
Stream Slator webinars, workshops, and conferences on the Slator Video-on-Demand channel.
Florian: You are the president of Oxford Languages and the Academic Product Director at Oxford University Press. Tell us a bit more about your background, those two roles, and how long have you been with the company?
Casper: I am what they used to call a lifer at Oxford University Press. I have been there for over 20 years and I have had a variety of roles within the business. Right now, I run the Oxford Languages division and this part of the press is where our dictionary publishing is. It is where the Oxford English dictionary, that iconic brand and program, sits, as well as art language data services and language engineering work.
I serve currently as the Academic Product Director, and that is about our academic publishing at Oxford University Press, which includes our journals, our academic monographs, the work that we do for the scholarly community. The market there is mainly the institutional market around libraries at universities, schools, governments. We are focused squarely on that technology sector and that sub-sector that is interested in language data services in the Oxford Languages program.
Esther: You contributed recently to our Data-for-AI report and we had a case study of OUP which was great. Can you give us a short overview of the types of data that you work with and the types of lexical data that you license and maintain for different companies?
Casper: The keyword that you said there was lexical. There are lots of different types of language data, and we focus on lexical data and lexicographical data. That is data that at its core had originally come out of dictionaries. It is dictionary definitions, thesaurus information, synonyms, antonyms, and sample sentences. It is language material that originally had been in what we would consider a traditional dictionary. As we shifted from thinking about the dictionary content as content to it actually being data, that mindset shift expanded what we offer in that the lexical data that we had been building for dictionaries is an uncommon kind of data that is valuable for a variety of uses within the technology sector, whether it be localization or machine translation or AI training.
Highly structured lexical data is a very specific subset, but a very valuable one and so we spend time building lexical data across a variety of languages. We build it down to the census level in terms of the identification of words and the disambiguation there. We enriched that traditional data with a lot more intelligent information and additional material that ended up making it much more useful for specific use cases that the technology sector has. Other common things such as frequency or sentiment, or how you link at the census level in a multilingual way to help with translation and things like this. We are constantly getting requests to enrich our existing data in different ways to make it more valuable for those customers, and it has become a big business for us across about 40 or 50 languages.
Esther: Has it shifted in terms of what you are doing when it was purely the print and dictionary-based to what you are doing now with the language data, and what was the rationale for that transition?
Casper: I thought our rationale was basically survival. As we all know, the idea of a dictionary has been so disrupted and exploded in the digital age that most traditional dictionary publishers either, are no longer involved in that business anymore or are struggling in a way that we started to anticipate and see even 20 years ago.
What I think was the most helpful part of the lifeline we had was that we started getting involved in Japan with handheld bilingual, Japanese, English dictionaries. There was a big demand in that market for handheld electronic dictionaries and this is in the late nineties, early two-thousands. We supplied a lot of that. Our mind shifted from content to data, that is when we started to think about what we do as a language data service, and it has transformed our business over the years.
We still publish dictionaries in a sense, and we work on the same kind of data, but for such radically different uses and to such a different market, that it does not even feel accurate to describe us as a dictionary publisher anymore. That was a gradual transition, but I think a lot of people are not aware that what we do is enhanced data services around lexical content now, and we still have lexicographers and language specialists on staff. We have got over a hundred people who work within the program who have that kind of specialized training and expertise. We now have a large group of people who are language engineers, data specialists and those who really are working with language specialists to make sure that we can provide the enhanced and robust data that meets the demands that are coming out of the tech sector.
Florian: What were some of the key drivers and key innovations that happened in the broader digital space that took this from a very niche space to companies like Lionbridge selling it for a billion dollars or Appen being valued at $3 billion? What was the journey there?
Casper: There are others who probably understand this from a slightly different perspective and mine comes from our journey at OUP. Some of it comes from the way in which the technology sector, in particular, driven by big tech, moved into how they have wanted to localize. How they have thought about what originally were secondary and tertiary markets that had language barriers. Now that they have yoked the world, they are going deeper and deeper into local markets and need language data that we are a component of. What we do is a small piece, but a very important piece so it is valuable.
Another part of this is not just the localization, but also the sense of how AI has evolved and needs training data and how machine translation in neural networks have been changing things. At first, they needed a lot of data, but what we quickly realized as an industry is they need really clean data. The more sophisticated the data, the more effective those tools and applications are that you are building out of them. That is again where this structured data is really valuable because it is one of the aspects that is often highly curated. Even if it had started out in a big data sense through a community gathering project, there is a scrubbing and a work that is done on it that makes it much more valuable for a lot of purposes. It is those kinds of developments that have driven the language data field, and it has been encouraging to see this kind of growth.
I was worried for a period of time, particularly with major world languages, that the need for the structured data, as we exploited it, the networks, the MTE, all of the different applications and software would quickly leapfrog and bypass the need for this kind of data. That we would be very valuable for a time and then the industry would leave us behind unless we pivoted to do something else. What I found interesting is that this is not the case. There continues to be a need, as industry use cases evolve, to continually modify and enhance and evolve that structured clean data. With localization as companies move into low markets with low resource language needs and requirements, more and more languages are being enfranchised in that digital iteration and technology iteration. That has proven to be a much longer process and one that is really resilient when it comes to the need for the kind of data that we can produce.
Florian: In terms of the team composition, who structures the data? What kind of professionals do you have? What are their roles? What kind of positions now are you hiring for?
Casper: We started out with our core strength being lexicographical expertise, with editors who can check, validate, verify, and ensure that structured data is clean and robust in the ways that it consistently needs to be. We have now moved towards a competency around language engineering, data architecture, we have got semantic engineers who work in our program. That is an area of our business that has grown over the last several years. We have layered in more people, particularly in our data teams. That is probably where you would notice the transformation of our program if you looked at it four or five years ago and then you looked at it now.
One of the things I think is interesting is the makeup of the program. If you had asked me what it would look like five years ago, I would have thought it would have changed more radically than it has. Some of that is probably down to things do not change as quickly as you think they are going to but also they change way more quickly. I think we are at a particularly pivotal point in our evolution as a language technology business in that we have got a couple of fantastic competencies. Our lexicographical experience, our expertise is unrivaled in English and we have got a power brand.
What we do not have is the bench strength and the traditional power in software development, the building of the tools and applications, and sophisticated enrichment of our data that requires a lot of semantic engineering. We are building up that capability, but we do not have it yet. That is one of the reasons why when I have been talking to a lot of people I have been interested in partnerships. I have been interested in industry partnerships because we have a very specific and uncommon competency. A complimentary competency could be interesting as a way for us to build, as opposed to bringing that all in house. Partner with a group, with an organization and institution that has that software capability in the language engineering space around tools and the next point on the value chain. I have been searching, looking and trying to talk to a lot of businesses and finding a right fit for us and for them so that we can expand the range of materials and the role we play in the industry. There is a real opportunity there.
Esther: There is a range of customers you work with as a business. Can you tell us about some of the similarities and differences you have observed from the traditional dictionary publisher side, compared with big tech clients that you are working with as well?
Casper: As we transitioned from a dictionary publisher to a language data provider, the use case was one that at first was fairly straightforward and expected. It was one where it was more about surfacing our dictionary related content or data in the digital realm. Think about when you scroll over and want to see a definition of something or the kind of ways in which an online dictionary is embedded into some operating system or a piece of software. That is really how it started for us. It made sense because that bridge was the easiest to cross. It is what we were doing in our print dictionary realm, and we transferred it to that same experience in the digital world.
Our language data is deeper embedded and is not necessarily about surfacing in a display sense but It still happens and you can see Oxford’s data around the world with big tech players as they surface it sometimes for purely lexical queries and such. It now also plays a heavy role, like a lot of other language data. It is foundational and it is used as stepping stones to build some of these more sophisticated experiences. That is where it requires a lot of enhancement, standardization and richness.
Over the last 15 years or so, as we started to work with big tech companies that were looking for more traditional dictionary data but to be exploited in their environments, we started not just licensing our bilingual material and our English monolingual material, but we started going out to the traditional dictionary publishers in a variety of local language markets. We identified the most trusted brand and the most trusted dictionary in that market. A lot of times the business that supported those lexical resources in print did not have the scale or the capability to be making that digital transition themselves.
We have licensed now, in at least 30 or 40 languages, the best traditional lexical material in those languages. As we bring them into Oxford, we standardize them as best you can. Standardization across languages is a little trickier than some people would like to think it is. We standardize it into formats, we clean it, we prepare it so that we have got a clearinghouse of lexical data, the best in the world that then becomes a one-stop-shop for a lot of those technology companies. That is most effective for big tech because they have got the need for the widest number of languages and the widest number of use cases within a single business. Subsets of them in particular language pairings are valuable for a variety of tech businesses for different use cases, and we have seen those use cases continue to evolve in a way that has been very heartening when I think about the future of our business.
Esther: There is the idea of low resource languages and the importance of, as you described it, tier two languages. You have got a content building hub in the local market in India, so can you tell us a bit about what they are doing and the logic for having it in India?
Casper: It was about 2012 that we started a program called Oxford Global Languages and we identified at that point that there was a real challenge around low resource languages in the digital space. Oxford University Press is a department of Oxford University, we are not-for-profit, we are mission-based and it was surprising how few languages, really the world’s global big languages, were the ones that had enough gravity of language data that they were naturally iterating as the versions of technology continued to evolve. They were carried with it, but after 10 to 12 languages, which at that point might have even been a little generous, it just fell off a cliff in terms of the local language development that was happening. We focused on the idea that the global language tapestry is going to be a lot richer and communication is going to be a lot more effective if we can help give a boost to languages that might have millions of speakers, but you are not seeing them in the applications and software and digital lives that you are leading. We are not the only business that has done this.
Now it is nice to see a network, almost a community of language data people and others in the technology space that are focused on this and trying to work on it. We started with this hub in India, as you said, the Indian business is one where we realized we needed a market group that was able to provide some of that lexical expertise dynamically on the ground there, and also close to the needs of that market in particular so that they could help us. Whether it is a company that is interested in Indian languages and are a local Indian business or not, they needed to understand what those requirements were on the ground and how we were enhancing that data and for what purposes. We have got a unit there now that is a mix of freelance and contract and on-staff people in a way that it is also been spearheading our building of a broader network, the way that a lot of other language data businesses have.
When I first was starting to get familiar with the LSPs, I thought, God, these companies are so big and they are so sophisticated. As I looked behind the curtain, I thought what their core competency, more than anything, was that they are project management businesses. They are hardly even language businesses, what they specialize in is project management. That is a hard thing to do when you are talking about 8,000 people that you are pulling in to contribute to a project. That is not our specialty, ours is around the expertise that comes when you are validating a lot of what big project management can yield. Our India hub is really about us trying to find a hybrid there to make sure that we are doing bottom-up in terms of bringing in and building new data and then having a team that has the expertise to validate that in a way that it can be highly effective for the companies that we are working with.
So far it has been a great success story. We have moved into and we have built linked datasets in about six or eight of the 12 major Indian languages. The demand has been growing rapidly for that work. It started with big tech but now I am pleased to see that there is a wider variety of small and medium-sized enterprise businesses that are interested in exploiting this data. We will be thinking about that model as we move beyond.
What comes after this India hub? Are we going to be expanding to other areas? I think it is not replicable exactly but Southeast Asia has a lot of need around some of those local languages. What I am interested in and watching closely is Africa and African languages are really interesting to me, partly because I feel like it is valuable and needed. That market is so slow because of the low economic value that a lot of businesses see in moving into those markets. The more we can be one of a community of companies that help create the building blocks for that localization, the bar for companies wanting to localize into African languages will continue to go down and it will be easier and less expensive to do. That will speed up the rate at which we enfranchise these African languages, which are really important. There are still some places around the world that I think that model we have set up in India could be replicated over time.
Florian: When you are going to these languages, are you doing very basic codifying work? There is no Oxford dictionary for some of these languages and you are putting in a base layer of grammar, syntax, codifying rules? Even today, in 2021?
Casper: Yes. If it has been there, it has been there in a much more ad hoc way or has not been done in a way that then becomes machine-readable and usable from a technology company or a software developer point of view. Some of that is needed. Sometimes it has been just the core data. There might be a dictionary and a thesaurus, but what you need are a lot of the grammars, some idiomatic data, and good morphologisers.
One of the things that we are working on with Indian languages is not just bilingual with English but is a multilingual net linked at the sense level. That is valuable for whether it be AI training or neural machine translation, where at a sense level you have a dictionary entry, all of the synonyms and antonyms, a large bank of example sentences that have been qualified and are tagged to that, all at the census level, linked to that exact same sense in another language with all of those other things, including audio. You then put those together and I think it creates one of the building blocks that allow you to do a lot more sophisticated software development and building of applications and tools.
Esther: What areas of data provision do you think are poorly served right now and where is there an opportunity for LSPs and other providers to build those capabilities?
Casper: I only see a certain sector of the market from the perspective that we are at, which is pretty specialized. Some opportunities that I do see, which we have been talking about for years, is that Chinese big tech is a real opportunity right now. We work with a lot of them, and a lot of others do. China has such a strong relationship in the developing world. China’s been such a big market and it has been evolving so quickly that there has been no need for those Chinese big tech companies to focus too much outside of China itself. It has been growing and it has been dynamically building but I see that starting to change. Ironically the pandemic will change things. It will make things move even more quickly.
Chinese big tech companies, as they start to look outward, are going to be interesting. There are things with the tension between India and China. What happens, how welcome are they from a political point of view to be moving into some of these places? That is a different question, but I do think that interaction and integration of the Chinese big tech companies and a much wider variety of localization is an interesting area. From a data point of view, we were noting that there is a real need for currency in the structured language data that we provide. The lexical data, their currency is important and where it comes down to is a lot of technology companies have to be reflecting, not just using language data as a basis for some of their building. When that language surfaces, it has to be in a way that their users feel like it reflects their value and where the culture and society are seeing things at the moment. That is difficult to do.
The big tech companies, in particular, we see them get in trouble all the time where they have got some outdated thing or views have shifted. That is also something that AI is going to take a long time before it gets sophisticated enough for the value judgements around these things because where AI is strong is where it can look at precedent and past realities. The truth is that we live in language somewhere between who we really are and aspirationally who we want to be and are trying to become. The AI has a very good handle on the former and a very poor handle on the latter.
There is a responsibility of a lot of the companies that we work with that they are reflecting a language in a way that is used. For example, with all of the political protesting and things that happened last summer around the world, people were looking at how language was being used and reflected and taking people to task for when it was not used in a way that they felt like was showing who we want to be as people. The pandemic highlighted the fact that all of a sudden massive new vocabularies around the conditions of our lives had emerged because we needed new words to describe this unprecedented experience we are having on all fronts, whether it was science or business or culture or medicine. If these companies are not able to incorporate those new words, those new senses, the new way we are using language, they are missing out.
I am wondering how the language data providers can add real value there. I think about the number of times that we have updated and reviewed the definition or the entry, which has various senses and a large entry. For the word marriage over the last 15 years, it is sensitive and it is not easy to do because it is not as if you just need to update it because now it is X. How we experience and use and want to see language reflecting who we are is different in different parts of the world. Now that a global presence is so strong in seeing how it is used in other places, you have got all kinds of politics that run into play there. I find that to be an interesting aspect that I have not seen people focus on as much as they could.
SlatorCon Remote September 2021 | $98
A rich online conference which brings together our research and network of industry leaders.
Florian: On the technology side, you launched some new annotation visualization tools at the OED lab. Are those purely for your internal use, to make your internal people more productive, or are you licensing those as well to third parties?
Casper: We are definitely licensing those. In the labs, our experimental team are trying to play around with and build some interesting visualization tools, some ways that illuminate some of the richness and insights in the language data that we have got. We are interested in commercially exploiting those tools and it is true that they are very helpful for our editors as well in giving them more efficiency around the kind of validation work that they are often doing. I see potential applications for some of these visualization tools within the industry sector and whether we bring them to market or we partner with someone else to help bring them to market. We are just about at a point where we have got two or three of these tools that now have enough kind of maturity that we are looking at what that next step is in exploiting them.
Florian: To close off on COVID, did you see any impact in terms of the nature of the work, low research languages or the client composition from COVID or was it just steady?
Casper: We have seen an increase in demand and low resource languages has picked up more than anything I have seen, incremental increases in certain areas that I can attribute to the pandemic and the conditions. What I think is interesting is the way we talk about where the language data industry language is and where it actually is. There always feels like there has been about a five year lag between those two. We are actually doing things in a much more traditional way than we are talking about and saying the industries at, and one of the things that the pandemic has done is it is flattened that delta. Now it is almost like we are six months behind where we say we are, which actually is a paradigm shift. The difference between that many years and that many months is one that I think is going to have a long-term effect on our industry. We are going to be stumbling to try to understand how to exploit that and how to take advantage of the fact and turn it from a liability into an opportunity. I think that the closer those are, there is going to be new business models uses that come and ways of working and processes where we had a lot of time before.
I think it is going to take us the next couple years probably to really understand how to exploit the fact that the pace of digital communication change is now as rapid as what we thought it might be five or six years from now. All of a sudden it is happening now, but we did not have five or six years to do all the evolutionary things that would have put us in the place where we serve it well. We are going to start seeing new gaps and then having to scramble to figure out how to fill those gaps. In a way, the year of this pandemic has only really been coming out of the shock of it and dealing with the incremental growth in the business of what we already do, but I think next we will start to see some interesting innovation. I can feel we are on the brink of something. It is going to suit us and it is going to suit a post-pandemic world much better than we are doing now and I just do not know if we see it yet.