Entity Extraction and Geoparsing for News Articles
CLIFF-CLAVIN parses news articles and pulls out people, organizations and places mentioned. A number of tools do this, so why did we create CLIFF-CLAVIN? We've built on those tools to add disambiguation tailored to the ways news articles are written, and a concept of "focus" that tries to get at what place an article is really about (as opposed to all the places it mentions). We wrote CLIFF-CLAVIN to help drive our Media Cloud suite of tools, but are sharing it in hopes that others find it useful.
CLIFF-CLAVIN is a Java-based web service that receives raw English text and returns JSON. CLIFF-CLAVIN is open source and hosted on GitHub. We've built on top of a number of other tools:
- We started off by extending Berico Technologies' CLAVIN geoparsing tool. In fact, this is why we called our tool CLIFF! (get the joke?)
- We rely on Stanford's Named Entity Recognizer to extract strings that might be people or places from articles.
- We pull places from the GeoNames gazeteer.
D’Ignazio, C., Bhargava, R., Zuckerman, E., & Beck, L. (2014). CLIFF-CLAVIN: Determining geographic focus for news. In NewsKDD: Data Science for News Publishing, at KDD 2014. New York, NY, USA.
Geoparsing accuracy is hard to measure. We started by building on top of CLAVIN because it performed best in our testing. We rely on Stanford's NER for the precision and recall pieces of the geoparsing puzzle. For geographic disambiguation, we wrote our own set of heuristics tuned to countries and cities. On top of that we added a simple definition of "focus" to determine which countries and cities an article is actually about (as opposed to all of them that are mentioned).
Encoding these very human concepts was difficult, and so is measuring how well we are doing. Here are a few ways we check our results:
- We hand-coded a set of 25 articles each from the BBC, Huffington Post, New York Times to determine what countries they were about. CLIFF-CLAVIN's focus metric matches the hand-coded results correctly 95% of the time.
- We pulled thousands of articles from the New York Times Annotated corpus and tested against the "locations" tag. At the basic level of places mentioned, the list of countries CLIFF-CLAVIN finds has all the countries on their "locations" list 85% of the time. Looking at our concept of "focus", the list of coutries CLIFF-CLAVIN thinks the article is about are on their list of locations 90% of the time.
- We pulled thousands of articles from the Reuters RCV1 corpus and tested against the "codes['bip:countries:1.0']" tag. For places mentioned, the list of countries CLIFF-CLAVIN finds has all the countries Reuters coded 94% of the time. For "focus", the list of countries CLIFF-CLAVIN thinks the article is about are on the Reuters list 91% of the time.
- We created a set of special cases we wanted to make sure we processed correctly. For instance, we want to choose the city "Paris" over the administrative district that is also called "Paris".