Available Language Corpora
Cornell Linguistics researchers who need access to any of these corpora should contact system administrator Eric Evans (eje4@cornell.edu) for assistance.
Phonetics-related corpora (available from Phonetics Lab server)
Buckeye CorpusCELEX Lexical Database
Sounds of the World's Languages
Spoken Karaim
TIMIT
University of Victoria Phonetic Database
University of South Florida Phoneme Database
LDC corpora
Corpus types: In the corpus identifier the letter immediately following the year tells what type of corpus it is.
"T" indicates a text corpus.
"S" indicates a speech audio corpus.
"V" indicates a video corpus.
"L" indicates a lexicon.
2011
LDC0211S01 2005
NIST Speaker Recognition Evaluation Training Data
LDC0211S02 2006 NIST Spoken Term
Detection Development Set
LDC2011S03 2006 NIST Spoken Term
Detection Evaluation Set
LDC2011S04 2005 NIST Speaker
Recognition Evaluation Test Data
LDC2011S05 2008 NIST Speaker Recognition
Evaluation Training Set Part 1
LDC2011S06 2005 Spring NIST Rich
Transcription (RT-05S) Evaluation Set
LDC2011S07 2008 NIST Speaker
Recognition Evaluation Training Set Part 2
LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set
LDC2011S09 2006 NIST Speaker Recognition Evaluation Training Set
LDC2011S10 2006 NIST Speaker Recognition Evaluation Test Set Part 1
LDC2011S11 2008 NIST Speaker Recognition Evaluation Supplemental Set
LDC2011T01 SemEval-2010 Task 1
OntoNotes English: Coreference Resolution in Multiple Languages
LDC2011T02 ACE 2005 English
SpatialML Annotations Version 2
LDC2011T04 Indian Language
Part-of-Speech Tagset: Sanskrit
LDC2011T05 2008/2010 NIST Metrics
for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2011T06 Broadcast News Lattices
LDC2011T07 English Gigaword Fifth
Edition
LDC2011T08 Datasets for Generic
Relation Extraction (reACE)
LDC2011T09 Arabic Treebank: Part 2
v 3.1
LDC2011T10 French Gigaword Third
Edition
LDC2011T11 Arabic Gigaword Fifth Edition
LDC2011T12 Spanish Gigaword Third Edition
LDC2011T13 Chinese Gigaword Fifth Edition
LDC2011V01 NIST/USF Evaluation
Resources for the VACE Program - Meeting Data Training Set Part 1
LDC2011V02 NIST/USF Evaluation
Resources for the VACE Program - Meeting Data Training Set Part 2
LDC2011V03 NIST/USF Evaluation
Resources for the VACE Program - Meeting Data Test Set Part 1
LDC2011V04 Indian Language
Part-of-Speech Tagset: Sanskrit
LDC2011V05 2006 NIST/USF Evaluation
Resources for the VACE Program, Meeting Data Test Set Part 1
LDC2011V06 2006 NIST/USF Evaluation Resources for the VACE Program, Meeting Data Test Set Part 2
2010
LDC2010S01 Fisher
Spanish Speech
LDC2010S02 WTIMIT 1.0
LDC2010S03 2003 NIST Speaker
Recognition Evaluation
LDC2010S07 Asian Spoken Language
Sampler
LDC2010T01 NIST Open Machine
Translation 2008 Evaluation (MT08) Selected Reference and System Translations
LDC2010T02 Czech Broadcast News MDE
Transcripts
LDC2010T03 GALE Phase 1 Chinese
Newsgroup Parallel Text - Part 2
LDC2010T04 Fisher Spanish -
Transcripts
LDC2010T05 NPS Internet Chatroom
Conversations, Release 1.0
LDC2010T07 Chinese Treebank 7.0
LDC2010T08 Arabic Treebank: Part 3
v 3.2
LDC2010T09 ACE 2005 Mandarin
SpatialML Annotations
LDC2010T10 NIST 2002 Open Machine
Translation (OpenMT) Evaluation
LDC2010T11 NIST 2003 Open Machine
Translation (OpenMT) Evaluation
LDC2010T12 NIST 2004 Open Machine
Translation (OpenMT) Evaluation
LDC2010T13 Arabic Treebank: Part 1
v 4.1
LDC2010T14 NIST 2005 Open Machine
Translation (OpenMT) Evaluation
LDC2010T15 Message Understanding
Conference 7 Timed (MUC7_T)
LDC2010T17 NIST 2006 Open Machine
Translation (OpenMT) Evaluation
LDC2010T18 ACE Time Normalization
(TERN) 2004 English Evaluation Data V1.0
LDC2010T19 Korean Newswire Second
Edition
LDC2010T21 NIST 2008 Open Machine
Translation (OpenMT) Evaluation
LDC2010T22 Manually Annotated
Sub-Corpus First Release
LDC2010T23 NIST 2009 Open Machine
Translation (OpenMT) Evaluation
LDC2010V01 TRECVID 2004 Keyframes
& Transcripts
LDC2010V02 TRECVID 2006 Keyframes
2009
LDC2009E58 TAC
2009 KBP Evaluation Reference Knowledge Base
LDC2009L01 An English Dictionary of
the Tamil Verb Second Edition
LDC2009S01 CSLU: Numbers Version
1.3
LDC2009S02 Czech Broadcast
Conversation Speech
LDC2009S03 CSLU: S4X Release 1.2
LDC2009T01 English CTS Treebank
with Structural Metadata
LDC2009T02 GALE Phase 1 Chinese Broadcast
Conversation Parallel Text - Part 1
LDC2009T03 GALE Phase 1 Arabic
Newsgroup Parallel Text - Part 1
LDC2009T04 2007 NIST Language
Recognition Evaluation Test Set
LDC2009T05 2007 NIST Language
Recognition Evaluation Supplemental Training Set
LDC2009T06 GALE Phase 1 Chinese
Broadcast Conversation Parallel Text - Part 2
LDC2009T07 Unified Linguistic
Annotation Text Collection
LDC2009T08 Japanese Web N-gram
Version 1
LDC2009T09 GALE Phase 1 Arabic
Newsgroup Parallel Text - Part 2
LDC2009T10 Language Understanding
Annotation Corpus
LDC2009T11 REFLEX Entity
Translation Training/DevTest
LDC2009T12 2008 CoNLL Shared Task
Data
LDC2009T13 English Gigaword Fourth
Edition
LDC2009T14 Tagged Chinese Gigaword
Version 2.0
LDC2009T15 GALE Phase 1 Chinese
Newsgroup Parallel Text - Part 1
LDC2009T20 Czech Broadcast
Conversation MDE Transcripts
LDC2009T21 Spanish Gigaword Second
Edition
LDC2009T22 Arabic Newswire English
Translation Collection
LDC2009T23 FactBank 1.0
LDC2009T24 OntoNotes Release 3.0
LDC2009T25 Web 1T 5-gram, 10
European Languages Version 1
LDC2009T26 NXT Switchboard
Annotations
LDC2009T27 Chinese Gigaword Fourth
Edition
LDC2009T28 French Gigaword Second
Edition
LDC2009T29 ACL Anthology Reference
Corpus
LDC2009T30 Arabic Gigaword Fourth
Edition
LDC2009V01 Audiovisual Database of
Spoken American English
2008
LDC2008T13 BLLIP North American News Text, Complete
2007
LDC2007T22 2001
Topic Annotated Enron Email Data Set
LDC2007T36 Chinese Treebank 6.0
2006
LDC2006S42 Korean
Broadcast News Speech
LDC2006T03 Korean Propbank
LDC2006T09 Korean Treebank
Annotations Version 2.0
LDC2006T13 Web 1T 5-gram Version 1
LDC2006T14 Korean Broadcast News
Transcripts
2005
LDC2005L01 Mawukakan
Lexicon
LDC2005S07 Arabic CTS Levantine
Fisher Training Data Set 3, Speech
LDC2005S08 BBN/AUB DARPA Babylon
Levantine Arabic Speech and Transcripts
LDC2005S13 Fisher English Training
Part 2, Speech
LDC2005S14 Levantine Arabic QT
Training Data Set 4 (Speech + Transcripts)
LDC2005S15 HKUST Mandarin Telephone
Speech, Part 1
LDC2005S22 Articulation Index
LDC2005T01 Chinese Treebank 5.0
LDC2005T02 Arabic Treebank: Part 1
v 3.0 (POS with full vocalization + syntactic analysis)
LDC2005T03 Arabic CTS Levantine
Fisher Training Data Set 3, Transcripts
LDC2005T05 Multiple-Translation
Arabic (MTA) Part 2
LDC2005T06 Chinese News Translation
Text Part 1
LDC2005T07 ACE Time Normalization
(TERN) 2004 English Training Data v 1.0
LDC2005T08 Discourse Graphbank
LDC2005T09 ACE 2004 Multilingual
Training Corpus
LDC2005T10 Chinese English News
Magazine Parallel Text
LDC2005T12 English Gigaword Second
Edition
LDC2005T13 CCGbank
LDC2005T14 Chinese Gigaword Second
Edition
LDC2005T19 Fisher English Training
Part 2, Transcripts
LDC2005T20 Arabic Treebank: Part 3
(full corpus) v 2.0 (MPG + Syntactic Analysis)
LDC2005T23 Chinese Proposition Bank
1.0
LDC2005T24 RT-04 MDE Training Data
Text/Annotations
LDC2005T28 HARD 2004 Text
LDC2005T29 HARD 2004 Topics and
Annotations
LDC2005T30 Arabic Treebank: Part 4
v 1.0 (MPG Annotation)
LDC2005T32 HKUST Mandarin Telephone
Transcript Data, Part 1
LDC2005T33 BBN Pronoun Coreference
and Entity Type Corpus
LDC2005T35 American National Corpus
(ANC) Second Release
2004
LDC2004L01 Klex:
Finite-State Lexical Transducer for Korean
LDC2004S01 Czech Broadcast News
Speech
LDC2004S02 ICSI Meeting Speech
LDC2004S04 2002 NIST Speaker
Recognition Evaluation
LDC2004S05 ISL Meeting Speech Part
1
LDC2004S09 NIST Meeting Pilot
Corpus Speech
LDC2004S11 2002 Rich Transcription
Broadcast News and Conversational Telephone Speech
LDC2004T01 Czech Broadcast News
Transcripts
LDC2004T02 Arabic Treebank: Part 2
v 2.0
LDC2004T03 Morphologically
Annotated Korean Text
LDC2004T04 ICSI Meeting Transcripts
LDC2004T05 Chinese Treebank 4.0
LDC2004T07 Multiple-Translation
Chinese (MTC) Part 3
LDC2004T09 TIDES Extraction (ACE)
2003 Multilingual Training Data
LDC2004T10 ISL Meeting Transcripts
Part 1
LDC2004T11 Arabic Treebank: Part 3
v 1.0
LDC2004T13 NIST Meeting Pilot
Corpus Transcripts and Metadata
LDC2004T14 Proposition Bank I
LDC2004T15 2000 Communicator
Dialogue Act Tagged
LDC2004T16 2001 Communicator
Dialogue Act Tagged
LDC2004T17 Arabic News Translation
Text Part 1
LDC2004T18 Arabic English Parallel
News Part 1
LDC2004T19 Fisher English Training
Speech Part 1 Transcripts
LDC2004V01 FORM1 Kinematic Gesture
2003
LDC2003S01 2001
Communicator Evaluation
LDC2003S03 Korean Telephone Conversations Speech
LDC2003T08 Korean Telephone Conversations Transcripts
LDC2003T11 ACE-2 Version 1.0
2002
LDC2002L49 Buckwalter
Arabic Morphological Analyzer Version 1.0
LDC2002S13 2001 HUB5 English
Evaluation
LDC2002S56 2000 Communicator
Evaluation
2000
LDC2000T45 Korean Newswire
1998
LDC98S73 1997
Mandarin Broadcast News Speech (HUB4-NE)
LDC98T24 1997 Mandarin
Broadcast News Transcripts (HUB4-NE)
LDC98T29 1997 Spanish
Broadcast News Transcripts (HUB4-NE)
1996
LDC96L17 CALLHOME
Spanish Transcripts
LDC96S37 CALLHOME Japanese
Speech
LDC96S53 CALLFRIEND Japanese
LDC96T18 CALLHOME Japanese
Transcripts
1995
LDC95S26 ATIS3
Test Data
LDC95T7 Treebank-2
LDC95T20 Hansard
French/English
1994
LDC94S19 ATIS3 Training Data