Publish Faster with Cornell's LDC Language Corpora

Corpora Overview:

Want To Publish Faster?

Then use Cornell’s database of Linguistics Data Consortium (LDC) corpora!  The LDC corpora contains more than 750 high-quality text, audio, and video corpora in more than 60 languages. Access is free to Cornell Linguistics & Natural Language Processing Researchers, including students, staff, faculty, postdocs, and visiting scholars.

  • 751 LDC and five non-LDC corpora as of July 25, 2019
  • LDC Corpora (many languages and situations)
    • 472 standard-license LDC corpora
    • 111 special-license LDC corpora
    • 168 experimental corpora
  • Non-LDC Corpora (some are licensed)
    • Historial American English
    • Spontaneous Japanese
    • Google 1, 2, and 3 ngrams
    • Project Gutenberg English Language Books (downloaded Nov. 2-3, 2016)
    • Buckeye Corpus, 2nd Release (Conversational English)

What kind of corpora?

  • Text, audio, and video
  • Multilingual (>60 languages)
  • Written, spoken, signed language (with annotations)
  • Sourced from:
    • Newspapers and Newswires (the New York Times Annotated Corpus is very popular)
    • Telephone conversations
    • Interviews and meetings
    • Broadcast programming
    • Transcripts, websites, books, periodicals, etc.

Speech Audio Corpora: Not just recorded speech!

  • Sound-based corpora contain sound files (various formats), and may contain one or more of the following:
    • Word transcriptions
    • Phonetic transcription
    • Annotations
    • Morphology
    • Tokenization
  • There are special corpora for study of:
    • Emotion
    • Prosody
    • Anomaly analysis
    • Elephant vocalizations

Linguistics Data Consortium:

LDC Corpora are "gold-standard" corpora

  • Distributed by the Linguistics Data Consortium (www.ldc.upenn.edu) 

  • An open consortium of universities, libraries, corporations, and govt. research labs

    • Started in 1992 - Creates & distributes language resources

    • LDC has a Core Trust Seal as a trustworthy, secure data repository

    • Issues 36-40 new corpora yearly

    • Every  corpora undergoes quality control tests

 LDC Corpora are used worldwide

  • LDC has distributed more than 140,000 corpora since 1992
  • LDC's monthly newsletter has more than 22,000 readers
  • 36-40 new corpora per year
  • Over 18,000 citations

Cornell's LDC Holdings

  • We have most LDC corpora issued since 1992.
    • See https://www.ldc.upenn.edu/ for all LDC corpora
    • Cornell has everything from the last 12 years
    • Contact Bruce McKee (bwm55@cornell.edu) to ask if we have a particular corpus
    • If we don't have it, then must purchase "a la carte" from LDC (Cornell gets half-off the list price)
  • Individual corpora size ranges from less than 1MB to 1.2TB compressed

Linguistics, CS, and InfoSci faculty fund Cornell's LDC Membership

  • Linguistics: Mats Rooth and Sam Tilsen
  • Computer Science: Yoav Artzi, Claire Cardie, and Lillian Lee
  • Information Science: David Mimno

Limits on Cornell LDC Usage

  • Only for non-commercial Linguistics and Natural Language Processing research at Ithaca and NYC campuses
  • Typical users are Linguistics, CS, and InfoSci faculty and graduate students
  • Exceptions made for researchers in other Cornell departments
    • Must be related to Linguistics or Natural Language Processing
    • Dr. Mats Rooth and Dr. Claire Cardie decide on access
  • Researchers in these other Cornell departments have arranged research access:
    • Electrical and Computer Engineering
    • Psychology
    • Statistical Science

All LDC Corpora are stored on a Linguistics Department server

  • Currently approximately 6.3TB of corpora files
  • Corpora distributed via Cornell Box service
  • Certain faculty/graduate students can get:
    • a corpora server account and VPN login
    • unlimited access to standard license corpora

A word on licenses

  • Standard LDC License (most corpora):
    • Signed by Dr. Mats Rooth for Cornell
    • You must co-sign, good for all standard license LDC corpora
  • Special LDC Licenses (some corpora):
    • Signed by Dr. Mats Rooth or other faculty for Cornell
    • You must co-sign - license only applies to that corpora
    • Each special license is unique - read the details as they can affect publication!
  • Experimental Corpora (DEFT agreement):
    • DEFT = Deep Exploration and Filtering of Text
    • Not listed on the LDC website
    • Only available through Dr. Claire Cardie (Computer Science)

The Four "Can'ts" of LDC Corpora

  • Can't be used for any commercial work
  • Can't be shared with colleagues outside Cornell
  • Can't be shared with Cornell students, faculty, staff, postdocs, or visiting scholars who have not signed the LDC agreement(s)
  • Can't take corpora with you after you graduate or end your Cornell employment

Access to LDC Corpora:

How do I get Corpora access?

Top