Friday, October 31, 2008

The New York Times Annotated Corpus

Last week, I was invited to give a talk at a conference at the New York Public Library, about the preservation of news. I talked about our research in the Economining project, where we are trying to find the "economic value" of textual content on the Internet.

As part of the presentation, I discussed some problems that I had in the past with obtaining well-organized news corpora that are both comprehensive and also easily accessible using standard tools. Factiva has an excellent database of articles, exported in a richly annotated XML format but unfortunately Factiva prohibits data mining of the content of its archives.

The librarians in the conference were very helpful in offerring suggestions and acknowledging that providing content for data mining purposes should be one of the goals of any preservation effort.

So, yesterday I received an email from Dorothy Carner informing me about the availability of The New York Times Corpus, a corpus of 1.8 million articles from The New York Times, dating from 1987 until 2007. The details are available from http://corpus.nytimes.com but let me repeat some of the interesting facts here (the emphasis below is mine):

The New York Times Annotated Corpus is a collection of over 1.8 million articles annotated with rich metadata published by The New York Times between January 1, 1987 and July 19, 2007.

With over 650,000 individually written summaries and 1.5 million manually tagged articles, The New York Times Annotated Corpus has the potential to be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction.

The corpus is provided as a collection of XML documents in the News Industry Text Format (NITF). Developed by a consortium of the world’s major news agencies, NITF is an internationally recognized standard for representing the content and structure of news documents. To learn more about NITF please visit the NITF website.

Highlights of The New York Times Annotated Corpus include:

  • Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
  • Over 650,000 article summaries written by the staff of The New York Times Index Department.
  • Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
  • Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at NYTimes.com.
  • Java tools for parsing corpus documents from xml into a memory resident object.

To learn more about The New York Times Annotated Corpus please read the PDF Overview.

Yes, 1.8 million articles, in richly annotated XML, with summaries, with hierarchically categorized articles, and with verified annotations of people, locations, and organizations! Expect the corpus to be a de facto standard for many text-centric research efforts! Hopefully more organizations are going to follow the example of New York Times and we are going to see such publicly available corpora from other high-quality sources. (I know that Associated Press has an archive of almost 1Tb of text, in computerized form, and hopefully we will see something similar from them as well.)

How can you get the corpus? It is available from LDC, for 300 USD for non-members; members should get this for free.

I am looking forward to receiving the corpus and start playing!

5 comments:

daniela barbosa said...

got here via @dtunkelang twitt

It is indeed good news that the New York Times is doing this and i am looking forward to what people come up with so i am keeping an eye on your blog!

I happen to work for Dow Jones (in the Synaptica/Taxonomy group)- so i am very familiar with Factiva's database and the investment we made in normalization and metadata enrichment across over 12,000 sources in 22 languages.

The issue with text mining is not that Factiva (now Dow Jones) doesn't allow you to do it- we do we just charge for it and many customers who have a need for diverse content set do purchase rights to data mine large chunks (some source even going back to the late 60s and we have the ability to slice and dice it as needed).
Why charge? well for starters we don't really own the content it belongs to the content providers that we have contracts with (and this also goes for DJ media properties since it is a separate division) so we need explicit permission from them and we essentially pay them for access to their content so we can provide it to our customers with all the bells and whistles we put on top of it- the value-add that the Factiva platform brings as you know is the normalization and enrichment as well as all the tools built on top the platform to get to the content.

PS> this is not an official statement, i just happen to be an employee who thinks what the NYTs is doing is pretty cool and wanted to 'clarify' the Factiva 'prohibits data mining' statement.

Panos Ipeirotis said...

@Daniela: No, when Factiva provides data to universities then Factiva does not pay royalties. At least this is what our head librarian told me.

Still, we are prohibited from downloading large chunks of data for data mining.

If you believe that I am wrong in any of the above, please let me know.

Factiva is a great resource, but paying 1USD per article (or even 0.1USD) does not cut it for data mining purposes. I truly wish that I could use Factiva but my attempts over last year to get a license to data mine the content met some rather unreasonable demands from Factiva.

daniela barbosa said...

Yes and you are right the licensing model is tiered to large corporations with in-house investments for projects that need to mine data.

I would be interested to know who at factiva/dow jones you spoke to - you can drop me a line at daniela.barbosa@dowjones.com

Brendan said...

Very interesting. I once wrote my own scraper and got hundreds of thousands of articles out of the search.nytimes.com interface -- this is of course much nicer. I'm really impressed by how much metadata and manual annotations have been done.

$300 is a bit annoying for someone used to downloading things for free -- their bandwidth/storage costs should definitely be no more than $5-$10 per download (S3 pricing for 30GB or so). I guess the preparation work that went in to it justifies it...

Bob Carpenter said...

I'm guessing that organizatinons like Google (for their n-grams) and the NY Times (for their annotated articles) like LDC because LDC collects physical signed licenses from people (by post, e-mail or fax) and deal with the distribution and record-keeping.

What I always find unclear is this language in LDC's generic non-member license: "User agrees to use this material only for non-commercial linguistic education and research purposes."

As with many of the news-derived LDC corpora, the NYT corpus isn't available under the standard LDC membership license, but only through a more restrictive member license which says: "The Data may only be used for non-commercial linguistic education, research and technology development, including but not limited to information retrieval, document understanding, machine translation or speech recognition."

Post a Comment