An Inside View of Language Technologies at Google

This feature is republished on Slator with permission from Breakthrough Analysis. The original article written by Seth Grimes appears here: An Inside View of Language Technologies at Google.

Natural language processing, or NLP, is the machine handling of written and spoken human communications. Methods draw on linguistics and statistics, coupled with machine learning, to model language in the service of automation.

OK, that was a dry definition.

Fact is, NLP is at, or near, the core of just about every information-intensive process out there. NLP powers search, virtual assistants, recommendations, and modern biomedical research, intelligence and investigations, and consumer insights. (I discuss ways it’s applied in my 2013 article, All About Natural Language Processing.)

No organization is more heavily invested in NLP — or investing more heavily — than Google. That’s why a keynote on “Language Technologies at Google,” presented by Google Research’s Enrique Alfonseca, was a natural for the up-coming LT-Accelerate conference, which I co-organize. (LT-Accelerate takes place 23-24 November in Brussels. Join us!)

I invited Enrique to respond to questions about his work. First, a short bio: Enrique Alfonseca manages the Natural Language Understanding (NLU) team at Google Research Zurich, working on information extraction and applications of text summarization. Overall, the Google Research NLU team “guides, builds, and innovates methodologies around semantic analysis and representation, syntactic parsing and realization, morphology and lexicon development. Our work directly impacts Conversational Search in Google Now, the Knowledge Graph, and Google Translate, as well as other Machine Intelligence research.”

Before joining the NLU team, Enrique held different positions in the ads quality and search quality teams working on ads relevance and web search ranking. He launched changes in ads quality (sponsored search) targeting and query expansion leading to significant ads revenue increases. He is also an instructor at the Swiss Federal Institute of Technology (ETH) at Zurich. Here, then, is: An Inside View of Language Technologies at Google.

Seth Grimes> Your work has included a diverse set of NLP topics. To start, what’s your current research agenda?

Enrique Alfonseca> At the moment my team is working on question answering in Google Search, which allows me and my colleagues to innovate in various different areas where we have experience. In my case, I have worked over the years on information extraction, event extraction, text summarization and information retrieval, and all of these come together for question answering — information retrieval to rank and find relevant passages on the web, information extraction to identify concrete, factual answers for queries, and text summarization to present it to the user in a concise way.

Seth> And topics that colleagues at Google Research in Zurich are working on?

Enrique> The teams at Zurich work in a very connected way to the teams at other Google offices and the products that we are collaborating with, so it is hard to define a boundary between “Google Research in Zurich” and the rest of the company. This said, there are very exciting efforts in which people in Zurich are involved, in various areas of language processing (text analysis, generation, dialogue, etc.), video processing, handwriting recognition and many others.

Do you do only “pure” research or has your agenda, to some extent, been influenced by Google’s product roadmap?

A 2012 paper from Alfred Spector, Peter Norvig and Slav Petrov nicely summarizes our philosophy to research. On the one hand, we believe that research needs to happen and actually happens in the product teams. A large proportion of our Software Engineers have a master or a Ph.D. degree and previous experience working on research topics, and they bring this expertise into product development to areas as varied as search quality, ads quality, spam detection, and many others. At the same time, we have a number of longer-term projects working on answers to the problems that Google, as a company, should have solved in a few years from now. In most of these, we take complex challenges and subdivide in smaller problems that one can handle and make progress quickly, with the aim of having impact in Google products along the way, in a way that moves us closer to the longer-term goals.

To give an example, when we started working on event models from text, we did not have a concrete product in mind yet, although we expected that understanding the meaning of what is reported in news should have concrete applications. After some time working on it, we realised that it was useful to make sure that the information from the Knowledge Graph that is shown in web search was always up-to-date according to the latest news. While we do not have yet models for high-precision, wide-coverage deep understanding of news, the technologies built along the way have already proven to be useful for our users.

Do you you get involved in productizing research innovations? Is there a typical path from research into products, at Google?

Yes, we are responsible of bringing to production all the technologies that we develop. If research and production are handled separately, there are at least two common causes of failure.

By having the research team not so close to the production needs, it is possible that their evals and datasets are not fully representative of the exact needs of the product. This is particularly problematic if a research team is to work on a product that is being constantly improved. Unless working directly on the product itself, it is likely that the settings under which the research team is working will quickly become obsolete and positive results will not translate into product improvements.

At the same time, if the people bringing research innovations to product are not the researchers themselves, it is likely that they will not know enough about the new technologies to be able to make the right decisions, for example, if product needs require you to trade-off some accuracy to reduce computation cost.

Your LT-Accelerate presentation, Language Technologies at Google, could occupy both conference days, just itself. But you’re planning to focus on information extraction and a couple of other topics. You have written that information extraction has proved to be very hard. You cite challenges that include entity resolution and consistency problems of knowledge bases. Actually, first, what are definitions of “entity resolution” and “knowledge base”?

We call “entity resolution” the problem of finding, for a given mention of a topic in a text, the entry in the knowledge base that represents that topic. For example, if your knowledge base is Wikipedia, one may refer to this entry in English text as “Barack Obama”, “Barack”, “Obama”, “the president of the US”, etc. At the same time, “Obama” may refer to any other person with the same surname, so there is an ambiguity problem. In literature people also refer to this problem with other names, like entity linking or entity disambiguation. Two years ago, some colleagues at Google released a large corpus of entity resolution annotations in a large web corpus that includes 11 billion references to Freebase topics that has already been exploited by researchers worldwide working on Information Extraction.

When we talk about knowledge bases, we refer to structured information about the real world (or imaginary worlds) on which one can ground language analysis of texts, amongst many other applications. These typically contain topics (concepts and entities), attributes, relations, type hierarchies, inference rules… There have been decades of work on knowledge representation and on manual and automatic acquisition of knowledge, but these are far from solved problems.

So ambiguity, name matching, and pronouns and other anaphora are part of the challenge, all sorts of coreference. Overall, what’s the entity-resolution state of the art?

Coreference is indeed a related problem and I think it should be solved jointly with entity resolution.

Depending on the knowledge base and test set used, results vary, but mention-level annotation currently has an accuracy between 80% and 90%. Most of the knowledge bases, such as Wikipedia and Freebase, have been constructed in large part manually, without a concrete application in mind, and issues commonly turn up when one tries to use them for entity disambiguation.

Where do the knowledge-base consistency issues arise? In representation differences, incompatible definitions, capture of temporality, or simply facts that disagree? (It seems to me that human knowledge, in the wild, is inconsistent for all these reasons and more.) And how do inconsistencies affect Google’s performance, from the user’s point of view?

Different degrees of coverage of topics, and different levels of detail in different domains, are common problems. Depending on the application, one may want to tune the resolution system to be more biased to resolve mentions as head entities or tail entities, and some entities may be artificially boosted simply because they are in a denser, more detailed portion of the network in the knowledge base. On top of this, schemas are thought out to be ontologically correct but exceptions happen commonly; many knowledge bases have been constructed by merging datasets with different levels of granularity, giving rise to reconciliation problems; and Wikipedia contains many “orphan nodes” that are not explicitly related to other topics even though they are clearly related to them.

Is “curation” part of the answer — along the lines of the approaches applied for IBM Watson and Wolfram Alpha, for instance — or can the challenges be met algorithmically? Who’s doing interesting work on these topics, outside Google, in academia and industry?

There is no doubt that manual curation is part of the answer. At the same time, if we want to cover the very long tail of facts, it would be impractical to try to enter all that information manually and to keep it permanently up-to-date. Automatically reconciling existing structured sources, like product databases, books, sports results, etc. is part of the solution as well. I believe it will eventually be possible to apply information extraction techniques over structured and unstructured sources, but that is not without challenges. I mentioned before that the accuracy of entity resolution systems is between 80% and 90%. That means that for any set of automatically extracted facts, at least 10% of them are going to be associated to the wrong entity — an error that will accumulate on top of any errors from the fact extraction models. Aggregation can be helpful in reducing the error rate, but will not be so useful for the long tail.

On the bright side, the area is thriving — it is enough to review the latest proceedings of ACL, EMNLP and related conferences to realize that there is fast progress in the area. Semantic parsing of queries to answer factoid questions from Freebase, how to integrate deep learning models in KB representation and reasoning tasks, better combinations of global and local models for entity resolution… are all problems in which important breakthroughs have happened in the last couple of years.

Finally, what’s new and exciting on the NLP horizon?

On the one hand, the industry as a whole is quickly innovating in the personal assistant space: a tool that can interact with humans through natural dialogue, understands their world, their interests and needs, answers their information needs, helps in planning and remembering tasks, and can help control their appliances to make their lives more comfortable. There are still many improvements in NLP and other areas that need to happen to make this long-term vision a reality, but we are already starting to see how it can change our lives.

On the other hand, the relation between language and embodiment will see further progress as development happens in the field of robotics, and we will not just be able to ground our language analyses on virtual knowledge bases, but on physical experiences.

Thanks Enrique!

Google Research’s Enrique Alfonseca will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher and solution provider speakers on the application of language technologies — in particular, text, sentiment and social analytics — to a range of business and governmental challenges. Join us there!


Featured image: photogearch /