Creating a system able to translate almost all the world’s news as they happen is daunting task. But that is the ambitious goal of the GDELT project (Global Data on Events, Location and Tone). Created with the support of organizations like Google and Yahoo!, the brainchild is called the GDELT Project and it belongs to Kalev Leetaru, who the German-weekly Der Spiegel once called “one of the superstars of the new discipline”; the new discipline being big data, specifically its use in the analysis of society.
The GDELT claim is that it will deploy the largest real-time streaming news machine translation in history. It intends to monitor all global news in 65 languages (which it says represents 98.4% of its daily non-English monitoring volume), translate it in real time into English, and process it through the entire GDELT Event and Global Knowledge Graph or Global Content Analysis Measures pipelines.
GDELT will “catalog the world’s events, narratives, and emotions in real time,” said founder Leetaru, who was among Foreign Policy Magazine‘s Top 100 Global Thinkers of 2013. The Economist named his work one of the five most notable science discoveries of 2011.
Slator reached out to Leetaru to learn more about the project.
Slator: GDELT’s machine translation technology has been explained extensively in the GDELT website. Can you briefly break it down in layman’s terms?
Kalev Leetaru: The goal of the GDELT Project is to monitor the world’s news media and other open information streams, peering deeply into local coverage and languages to catalog the world’s events, narratives, and emotions in real time. Given that the majority of the world’s news coverage is published in a language other than English, penetrating local news around the world requires the ability to live machine translate all of that coverage.
GDELT’s machine translation system came out of the need for massively scalable translation that combines very high throughput with the ability to rapidly evolve the system over time as new tools and datasets become available.
Unlike traditional machine translation, in which the goal is maximal accuracy at all costs and translations can take as long as they need, GDELT must translate an infinite continuous stream of content in near-real-time.
To do this, GDELT operates as a streaming machine translation system, in which both the translation and language models are adjusted on a per-document basis based on time pressure (how much time is left before the next batch of documents arrive) and relevance (how relevant the document is to GDELT’s focus on societal stability).
Every 15 minutes, a batch of documents arrives and the system translates them in a series of successive passes—the first pass does a very simplistic “general gist” translation to see whether the article is reporting the score of a local sporting match (which is not relevant to GDELT) or whether it is detailing a massive political protest that is rapidly turning violent (which is relevant).
Each document is assigned a score estimating how relevant and linguistically complex it is. After this first pass all documents are ordered by score and processed again, with the amount of time and complexity of the translation and language models used dependent on the estimated relevance and complexity scores of the document.
This is repeated, with each pass increasing the quality of the translation until the 15 minutes are up, at which point the translations are boxed up and sent off to the rest of the system and the process begins anew for the latest batch of documents. In this way a brief update report on a protest that has few details or other information will receive a basic translation, while an intricately detailed chronology of the protest that includes quotes from various officials will be translated to the best of the system’s ability, optimizing the use of translation resources.
The system is also completely modular, allowing any component for any language to be swapped out or upgraded over time. For example, the system uses the Stanford Chinese Word Segmenter to break Chinese news coverage into discrete words. If a new version of the system becomes available or someone else produces a faster or more accurate word segmenter for Chinese content, it can simply be swapped in seamlessly while the system is running without interrupting anything. This also allows the system to constantly learn new linguistic constructs and terminology over time and update the underlying dictionaries and models, allowing its translations to evolve with colloquial language use.
Slator: How do you ensure the machine translations are accurate? Do you have some rules-based translation blended in or do you implement quality checks at certain intervals?
KL: Translation accuracy is assessed via regular spot checks and feedback from users as they identify errors or changing language use that is not being processed correctly.
Slator: Have you seen improvements in the quality of statistical machine translations as your reference texts increased in volume?
KL: Quality is constantly improving as the total volume of translated material continues to grow over time, providing real-time insight into evolving language use such as new words and linguistic formations.
Slator: Will GDELT make the reference text libraries public in the future for other engines to use? Could there be possible business uses to such a large reference text database?
GDELT does not republish any of the text of the articles it monitors, but in future it may compile n-gram tables and other macro-level linguistic resources to assist in the construction of improved language models for the broader translation community and to identify areas needing further focus.
Slator: How is GDELT funded? Who, which organizations, or institutions financially support GDELT, and will there be future rounds of funding?
KL: GDELT is an open data project designed to provide a completely free and open platform for understanding global society from conflict and wars to the narratives and emotions that undergird global behavior. GDELT gratefully acknowledges the support of Google Ideas and many personnel across Google.
Slator: Can you tell us about particular successful-use cases of GDELT Translingual by people or companies who use GDELT?
KL: As an open data project, GDELT is used by organizations across the world. Most recently, GDELT was named as one of the Prize Winners of the Wildlife Crime Tech Challenge and will be applying its ability to peer deeply into local material in local languages throughout the world to track global wildlife crime.
Featured image: GDELT website