Skip directly to main navigation | secondary navigation | main content

Department of Linguistics

Department of Linguistics, Cornell Univeristy Cornell Univeristy Cornell Univeristy Department of Linguistics

Language Corpora


morrill_hall

Available Language Corpora

Cornell Linguistics researchers who need access to any of these corpora should contact system administrator Eric Evans (eje4@cornell.edu) for assistance. 

Phonetics-related corpora (available from Phonetics Lab server)

Buckeye Corpus
CELEX Lexical Database
Sounds of the World's Languages
Spoken Karaim
TIMIT
University of Victoria Phonetic Database
University of South Florida Phoneme Database

 

LDC corpora

Corpus types: In the corpus identifier the letter immediately following the year tells what type of corpus it is.
"T" indicates a text corpus.
"S" indicates a speech audio corpus.
"V" indicates a video corpus.
"L" indicates a lexicon.  

2011

LDC0211S01        2005 NIST Speaker Recognition Evaluation Training Data
LDC0211S02        2006 NIST Spoken Term Detection Development Set
LDC2011S03        2006 NIST Spoken Term Detection Evaluation Set
LDC2011S04        2005 NIST Speaker Recognition Evaluation Test Data
LDC2011S05        2008 NIST Speaker Recognition Evaluation Training Set Part 1
LDC2011S06        2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set
LDC2011S07        2008 NIST Speaker Recognition Evaluation Training Set Part 2
LDC2011S08        2008 NIST Speaker Recognition Evaluation Test Set
LDC2011S09        2006 NIST Speaker Recognition Evaluation Training Set
LDC2011S10        2006 NIST Speaker Recognition Evaluation Test Set Part 1
LDC2011S11        2008 NIST Speaker Recognition Evaluation Supplemental Set
LDC2011T01        SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
LDC2011T02        ACE 2005 English SpatialML Annotations Version 2
LDC2011T04        Indian Language Part-of-Speech Tagset: Sanskrit
LDC2011T05        2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2011T06        Broadcast News Lattices
LDC2011T07        English Gigaword Fifth Edition
LDC2011T08        Datasets for Generic Relation Extraction (reACE)
LDC2011T09        Arabic Treebank: Part 2 v 3.1
LDC2011T10        French Gigaword Third Edition
LDC2011T11        Arabic Gigaword Fifth Edition
LDC2011T12        Spanish Gigaword Third Edition
LDC2011T13        Chinese Gigaword Fifth Edition
LDC2011V01        NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1
LDC2011V02       NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2
LDC2011V03       NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1     
LDC2011V04       Indian Language Part-of-Speech Tagset: Sanskrit             
LDC2011V05       2006 NIST/USF Evaluation Resources for the VACE Program, Meeting Data Test Set Part 1
LDC2011V06       2006 NIST/USF Evaluation Resources for the VACE Program, Meeting Data Test Set Part 2

2010

LDC2010S01        Fisher Spanish Speech
LDC2010S02        WTIMIT 1.0
LDC2010S03        2003 NIST Speaker Recognition Evaluation
LDC2010S07        Asian Spoken Language Sampler
LDC2010T01        NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations
LDC2010T02        Czech Broadcast News MDE Transcripts
LDC2010T03        GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
LDC2010T04        Fisher Spanish - Transcripts
LDC2010T05        NPS Internet Chatroom Conversations, Release 1.0
LDC2010T07        Chinese Treebank 7.0
LDC2010T08        Arabic Treebank: Part 3 v 3.2
LDC2010T09        ACE 2005 Mandarin SpatialML Annotations
LDC2010T10        NIST 2002 Open Machine Translation (OpenMT) Evaluation
LDC2010T11        NIST 2003 Open Machine Translation (OpenMT) Evaluation
LDC2010T12        NIST 2004 Open Machine Translation (OpenMT) Evaluation
LDC2010T13        Arabic Treebank: Part 1 v 4.1
LDC2010T14        NIST 2005 Open Machine Translation (OpenMT) Evaluation
LDC2010T15        Message Understanding Conference 7 Timed (MUC7_T)
LDC2010T17        NIST 2006 Open Machine Translation (OpenMT) Evaluation
LDC2010T18        ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
LDC2010T19        Korean Newswire Second Edition
LDC2010T21        NIST 2008 Open Machine Translation (OpenMT) Evaluation
LDC2010T22        Manually Annotated Sub-Corpus First Release
LDC2010T23        NIST 2009 Open Machine Translation (OpenMT) Evaluation
LDC2010V01       TRECVID 2004 Keyframes & Transcripts
LDC2010V02       TRECVID 2006 Keyframes

2009

LDC2009E58        TAC 2009 KBP Evaluation Reference Knowledge Base
LDC2009L01        An English Dictionary of the Tamil Verb Second Edition
LDC2009S01        CSLU: Numbers Version 1.3
LDC2009S02        Czech Broadcast Conversation Speech
LDC2009S03        CSLU: S4X Release 1.2
LDC2009T01        English CTS Treebank with Structural Metadata
LDC2009T02        GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
LDC2009T03        GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
LDC2009T04        2007 NIST Language Recognition Evaluation Test Set
LDC2009T05        2007 NIST Language Recognition Evaluation Supplemental Training Set
LDC2009T06        GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
LDC2009T07        Unified Linguistic Annotation Text Collection
LDC2009T08        Japanese Web N-gram Version 1
LDC2009T09        GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
LDC2009T10        Language Understanding Annotation Corpus
LDC2009T11        REFLEX Entity Translation Training/DevTest
LDC2009T12        2008 CoNLL Shared Task Data
LDC2009T13        English Gigaword Fourth Edition
LDC2009T14        Tagged Chinese Gigaword Version 2.0
LDC2009T15        GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
LDC2009T20        Czech Broadcast Conversation MDE Transcripts
LDC2009T21        Spanish Gigaword Second Edition
LDC2009T22        Arabic Newswire English Translation Collection
LDC2009T23        FactBank 1.0
LDC2009T24        OntoNotes Release 3.0
LDC2009T25        Web 1T 5-gram, 10 European Languages Version 1
LDC2009T26        NXT Switchboard Annotations
LDC2009T27        Chinese Gigaword Fourth Edition
LDC2009T28        French Gigaword Second Edition
LDC2009T29        ACL Anthology Reference Corpus
LDC2009T30        Arabic Gigaword Fourth Edition
LDC2009V01       Audiovisual Database of Spoken American English

2008

LDC2008T13        BLLIP North American News Text, Complete

2007

LDC2007T22        2001 Topic Annotated Enron Email Data Set
LDC2007T36        Chinese Treebank 6.0

2006

LDC2006S42        Korean Broadcast News Speech
LDC2006T03        Korean Propbank
LDC2006T09        Korean Treebank Annotations Version 2.0
LDC2006T13        Web 1T 5-gram Version 1
LDC2006T14        Korean Broadcast News Transcripts

2005

LDC2005L01        Mawukakan Lexicon
LDC2005S07        Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC2005S08        BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC2005S13        Fisher English Training Part 2, Speech
LDC2005S14        Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
LDC2005S15        HKUST Mandarin Telephone Speech, Part 1
LDC2005S22        Articulation Index
LDC2005T01        Chinese Treebank 5.0
LDC2005T02        Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
LDC2005T03        Arabic CTS Levantine Fisher Training Data Set 3, Transcripts         
LDC2005T05        Multiple-Translation Arabic (MTA) Part 2
LDC2005T06        Chinese News Translation Text Part 1
LDC2005T07        ACE Time Normalization (TERN) 2004 English Training Data v 1.0
LDC2005T08        Discourse Graphbank
LDC2005T09        ACE 2004 Multilingual Training Corpus
LDC2005T10        Chinese English News Magazine Parallel Text
LDC2005T12        English Gigaword Second Edition
LDC2005T13        CCGbank
LDC2005T14        Chinese Gigaword Second Edition
LDC2005T19        Fisher English Training Part 2, Transcripts
LDC2005T20        Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
LDC2005T23        Chinese Proposition Bank 1.0
LDC2005T24        RT-04 MDE Training Data Text/Annotations
LDC2005T28        HARD 2004 Text
LDC2005T29        HARD 2004 Topics and Annotations
LDC2005T30        Arabic Treebank: Part 4 v 1.0 (MPG Annotation)
LDC2005T32        HKUST Mandarin Telephone Transcript Data, Part 1
LDC2005T33        BBN Pronoun Coreference and Entity Type Corpus
LDC2005T35        American National Corpus (ANC) Second Release

2004

LDC2004L01        Klex: Finite-State Lexical Transducer for Korean
LDC2004S01        Czech Broadcast News Speech
LDC2004S02        ICSI Meeting Speech
LDC2004S04        2002 NIST Speaker Recognition Evaluation
LDC2004S05        ISL Meeting Speech Part 1
LDC2004S09        NIST Meeting Pilot Corpus Speech
LDC2004S11        2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2004T01        Czech Broadcast News Transcripts
LDC2004T02        Arabic Treebank: Part 2 v 2.0
LDC2004T03        Morphologically Annotated Korean Text
LDC2004T04        ICSI Meeting Transcripts
LDC2004T05        Chinese Treebank 4.0
LDC2004T07        Multiple-Translation Chinese (MTC) Part 3
LDC2004T09        TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC2004T10        ISL Meeting Transcripts Part 1
LDC2004T11        Arabic Treebank: Part 3 v 1.0
LDC2004T13        NIST Meeting Pilot Corpus Transcripts and Metadata
LDC2004T14        Proposition Bank I
LDC2004T15        2000 Communicator Dialogue Act Tagged
LDC2004T16        2001 Communicator Dialogue Act Tagged
LDC2004T17        Arabic News Translation Text Part 1
LDC2004T18        Arabic English Parallel News Part 1
LDC2004T19        Fisher English Training Speech Part 1 Transcripts
LDC2004V01       FORM1 Kinematic Gesture

2003

LDC2003S01        2001 Communicator Evaluation
LDC2003S03        Korean Telephone Conversations Speech
LDC2003T08        Korean Telephone Conversations Transcripts
LDC2003T11        ACE-2 Version 1.0

2002

LDC2002L49        Buckwalter Arabic Morphological Analyzer Version 1.0
LDC2002S13        2001 HUB5 English Evaluation
LDC2002S56        2000 Communicator Evaluation

2000

LDC2000T45        Korean Newswire

1998

LDC98S73             1997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98T24             1997 Mandarin Broadcast News Transcripts (HUB4-NE)
LDC98T29             1997 Spanish Broadcast News Transcripts (HUB4-NE)

1996

LDC96L17             CALLHOME Spanish Transcripts
LDC96S37             CALLHOME Japanese Speech
LDC96S53             CALLFRIEND Japanese
LDC96T18             CALLHOME Japanese Transcripts

1995

LDC95S26             ATIS3 Test Data
LDC95T7               Treebank-2
LDC95T20             Hansard French/English

1994

LDC94S19             ATIS3 Training Data