Thursday, November 13, 2008

Social Annotation of the NYT Corpus?

While I am waiting for the arrival of the New York Times Annotated Corpus, I have been thinking about the different tasks that we could use the corpus for. For some tasks, we might have to run additional extraction systems, to identify entities that are not currently marked. So, for example, we could use the OpenCalais system to extract patent issuances, company legal issues, and so on.

And then, I realized that most probably, tens of other groups will end up doing the same, over and over again. So, why not run such tasks once, and store them for others to use? In other words, we could have a "wiki-style" contribution site, where different people could submit their annotations, letting other people use them. This would save a significant amount of computational and human resources. (Freebase is a good example of such an effort.)

Extending the idea even more, we could have reputational metrics around these annotations, where other people provide feedback on the accuracy, comprehensiveness, and general quality of the submitted annotations.

Is there any practical problem with the implementation of this idea? I understand that someone needs access to the corpus to start with, but I am trying to think of more high-level obstacles (e.g., copyright, or conflict with the interests of publishers)?

4 comments:

Daniel Tunkelang said...

Isn't this basically what sites like delicious do for the web? Of course, there's the issue of access, but a lot of content is publicly accessible. I suppose the question is what would motivate people to be annotators. Do we need an ESP game for annotating text corpora?

Panos Ipeirotis said...

Admittedly, if we talk about humans contributing annotations, we lack the motivational aspects for Wikipedia and Freebase (people contributing material for a niche that they are passionate about).

I was thinking, though, mainly along the lines of providing annotation for the whole corpus (not only for a few documents).

In principle, if someone ran an ESP-game incarnation for text, then the submitted annotations could become part of the archive.

But I would also like to see annotations and relations derived by various extraction tools.

Some public statistics about usage (+ "helpful" votes?) for the submitted annotations can be a motivation to submit content. Perhaps this could be a "hall of fame" for researchers testing their tools :-)

Brendan said...

Yeah, let's save it all into Freebase! Or at least post as download.

I'm reminded on "semantically annotated Wikipedia", where researchers posted dependency parses, named entity extractions, and other machine-derived semantic annotations of a Wikipedia dump. (Though I really wish they did a quality/error evaluation.) http://www.yr-bcn.es/semanticWikipedia ... http://www.lrec-conf.org/proceedings/lrec2008/slides/581.pdf

One concern is -- do different researchers agree on which machine-derived annotations they like? There are many different extraction and NLP systems out there. Maybe if someone is aggressive at running one particular parser or something -- and very importantly, does a quality evaluation! -- it could be come a de facto standard. (For example, before today I always used the Stanford Parser for doing dependency parses. But now I want to look into DeSR that they used.)

Nancy Ide said...

ndspezI am not sure which NYT corpus you are referring to--the one NYT just gave to the Linguistic Data Consortium to distribute for $300? (Sorry, I've just seen this blog for the first time.)

A heads up--one comment said that a lot of web content is publicly accessible. Accessible, yes--but (by law) copyrighted unless specifically stated to be in the public domain (or under certain Creative Commons licenses). So you can annotate it and re-distribute the annotations, but not the text itself. We looked into Wikipedia and that, too, is copyrighted and cannot be re-distributed.

We know this because we are creating an annotated "open" American National Corpus that we distribute freely via download for any purpose. (There is a version of 22 million words available from LDC, portions of which are restricted in use.) We are desperately trying to get people to contribute texts and annotations, so if you are interested please help us out. Contributions of either text or annotations can be easily uploaded via http://www.anc.org.

Right now the downloadable version of the Open ANC includes 15 million words of American English across a range of written and spoken genres that is freely available, with annotations for various linguistic phenomena (in a standard format so they can be combined etc.). We are in the process of preparing another 15 or 20 million words for release as soon as we can get it out there, but we have very little funding for that activity so things progress slowly.

If you have any ideas about how we can get people to contribute annotations and/or texts to the open ANC (or any ideas about making ourselves more visible and useful), we would be very grateful to hear them. I am reading the suggestions some of you have already made concerning the NYT with interest. Hoping to hear more!

Post a Comment