Corpora Overview:
Want To Publish Faster?
Then use Cornell’s database of Linguistics Data Consortium (LDC) corpora! The LDC corpora contains more than 750 high-quality text, audio, and video corpora in more than 60 languages. Access is free to Cornell Linguistics & Natural Language Processing Researchers, including students, staff, faculty, postdocs, and visiting scholars.
- 751 LDC and five non-LDC corpora as of July 25, 2019
- LDC Corpora (many languages and situations)
- 472 standard-license LDC corpora
- 111 special-license LDC corpora
- 168 experimental corpora
- Non-LDC Corpora (some are licensed)
- Historial American English
- Spontaneous Japanese
- Google 1, 2, and 3 ngrams
- Project Gutenberg English Language Books (downloaded Nov. 2-3, 2016)
- Buckeye Corpus, 2nd Release (Conversational English)
What kind of corpora?
- Text, audio, and video
- Multilingual (>60 languages)
- Written, spoken, signed language (with annotations)
- Sourced from:
- Newspapers and Newswires (the New York Times Annotated Corpus is very popular)
- Telephone conversations
- Interviews and meetings
- Broadcast programming
- Transcripts, websites, books, periodicals, etc.
Speech Audio Corpora: Not just recorded speech!
- Sound-based corpora contain sound files (various formats), and may contain one or more of the following:
- Word transcriptions
- Phonetic transcription
- Annotations
- Morphology
- Tokenization
- There are special corpora for study of:
- Emotion
- Prosody
- Anomaly analysis
- Elephant vocalizations
Linguistics Data Consortium:
LDC Corpora are "gold-standard" corpora
-
Distributed by the Linguistics Data Consortium (www.ldc.upenn.edu)
-
An open consortium of universities, libraries, corporations, and govt. research labs
-
Started in 1992 - Creates & distributes language resources
-
LDC has a Core Trust Seal as a trustworthy, secure data repository
-
Issues 36-40 new corpora yearly
-
Every corpora undergoes quality control tests
-
Strict requirements on data quality, organization, metadata, & documentation
-
-
LDC Corpora are used worldwide
- LDC has distributed more than 140,000 corpora since 1992
- LDC's monthly newsletter has more than 22,000 readers
- Sign up at https://www.ldc.upenn.edu/communications/ldc-newsletter
- Announces and describes all new corpora issued that month
- 36-40 new corpora per year
- Over 18,000 citations
Cornell's LDC Holdings
- We have most LDC corpora issued since 1992.
- See https://www.ldc.upenn.edu/ for all LDC corpora
- Cornell has everything from the last 12 years
- Contact Bruce McKee (bwm55@cornell.edu) to ask if we have a particular corpus
- If we don't have it, then must purchase "a la carte" from LDC (Cornell gets half-off the list price)
- Individual corpora size ranges from less than 1MB to 1.2TB compressed
Linguistics, CS, and InfoSci faculty fund Cornell's LDC Membership
- Linguistics: Mats Rooth and Sam Tilsen
- Computer Science: Yoav Artzi, Claire Cardie, and Lillian Lee
- Information Science: David Mimno
Limits on Cornell LDC Usage
- Only for non-commercial Linguistics and Natural Language Processing research at Ithaca and NYC campuses
- Typical users are Linguistics, CS, and InfoSci faculty and graduate students
- Exceptions made for researchers in other Cornell departments
- Must be related to Linguistics or Natural Language Processing
- Dr. Mats Rooth and Dr. Claire Cardie decide on access
- Researchers in these other Cornell departments have arranged research access:
- Electrical and Computer Engineering
- Psychology
- Statistical Science
All LDC Corpora are stored on a Linguistics Department server
- Currently approximately 6.3TB of corpora files
- Corpora distributed via Cornell Box service
- Certain faculty/graduate students can get:
- a corpora server account and VPN login
- unlimited access to standard license corpora
A word on licenses
- Standard LDC License (most corpora):
- Signed by Dr. Mats Rooth for Cornell
- You must co-sign, good for all standard license LDC corpora
- Special LDC Licenses (some corpora):
- Signed by Dr. Mats Rooth or other faculty for Cornell
- You must co-sign - license only applies to that corpora
- Each special license is unique - read the details as they can affect publication!
- Experimental Corpora (DEFT agreement):
- DEFT = Deep Exploration and Filtering of Text
- Not listed on the LDC website
- Only available through Dr. Claire Cardie (Computer Science)
The Four "Can'ts" of LDC Corpora
- Can't be used for any commercial work
- Can't be shared with colleagues outside Cornell
- Can't be shared with Cornell students, faculty, staff, postdocs, or visiting scholars who have not signed the LDC agreement(s)
- Can't take corpora with you after you graduate or end your Cornell employment
Access to LDC Corpora:
How do I get Corpora access?
- Visit the LDC Corpora page on Cornell's Confluence wiki
- Contact Bruce McKee at bwm55@cornell.edu if you are unable to access the wiki