Language Corpora

Access

LDC corpora are available to Cornell undergraduates, graduates, faculty, post-docs, and visiting scholars for faculty-supervised non-commercial research.  The procedures for accessing corpora are listed on this Confluence web page:   For all other corpora, please contact Linguistics system administrator Bruce McKee (mckee2@cornell.edu).

Cornell researchers have access to a wide range of licensed language corpora, which includes:

  • A CQPweb server for fast linguistic searches on very large corpora
  • Phonetics-Related Corpora
  • Brigham Young University/English-Corpora.org Corpora
  • Linguistics Data Consortium Corpora

For non-Cornell researchers seeking language corpora, please visit the following sites:

Brigham Young University/English-Corpora.org Corpora

Corpus of American Soaps - 100 million words of data from 22,000 transcripts from American soap operas from the early 2000s, and it serves as a great resource to look at very informal language

TV Corpus - contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc. 

Corpus del Espanol - contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries from the past three to four years

Corpus of Contemporary American English contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages.

Corpus of Historical American English contains more than 475 million words of text from the 1820s-2010s (which makes it 50-100 times as large as other comparable historical corpora of English) and the corpus is balanced by genre decade by decade. 

News of the World Corpus - contains 15.6 billion words of data from web-based newspapers and magazines from 2010 to the present time.  This corpus will be updated monthly through March 2024, and each update grows the corpus by about 180-200 million words of data each month (from about 300,000 new articles), or about two billion words per year.

The Pile

The Pile is a 825 GiB Dataset of Diverse Text for Language Modeling.  Dr. Mats Rooth & graduate student Andrea Hummel created code that produces a  truly cleaned and filtered subset (of your choosing) of The Pile parsed into CoNLL-U Format.  This GitHub repository contains the code, and the Department has a Cornell Box folder containing 1.6TB of cleaned, tagged Pile data. 

Linguistics Data Consortium Corpora

Cornell maintains a Linguistics Data Consortium (LDC) membership, and we currently have >870 language corpora available free to Cornell students, staff, post-docs, visiting scholars, and faculty working in Linguistics and/or Natural Language Processing or Psychology. This corpora database grows by 3-4 corpora per month as the LDC distributes new corpora. This corpora database dates back to 1995, and it grows by 3-4 corpora per month as the LDC distributes new corpora. See our corpora holdings list towards the end of this page.

For instructions on how Cornell researchers can access LDC language corpora, please visit our How to Access LDC webpage and please read "Publish Faster with Cornell's LDC Language Corpora".

For corpora not found in the LDC database, we suggest visiting the Open Language Archives Community (OLAC) webpage.

Corpus types: In the corpus identifier the letter immediately following the year indicates the type of corpus.
"T" indicates a text corpus
"S" indicates a speech audio corpus
"V" indicates a video corpus
"L" indicates a lexicon

Cornell Library Corpora Resources

The Cornell Library's Michael Engle (Reference/Instruction/Collection Development Librarian)  and Phil Robinson (Software Development Director) have created a website titled:   Linguistics and Language: A Research Guide: Library Support for Linguistics, which contains information about the Library's unique language corpora resources.

 

CQPweb Server

The Linguistics Department has CQPweb server for conducting fast linguistics searches in very, very large corpora.  It does this by combining a query language with pre-indexed corpora with annotations such as parts-of-speech.  The searches are comparatively fast, because the system uses an efficient internal representation for words and annotations, and also has indexes of words and annotations.

This means that you can focus on the linguistics questions, rather than having to learn languages like PERL before you can analyze a text corpus. The interface is good for exploratory analysis, such as finding examples of some syntactic or semantic phenomenon you are interested in.  There are also advanced modes involving statistics.

We have lots of data ready for use - specifically 17 years and 2 billion words of NY Times, Associated Press, and Agence France-Presse text news corpora extracted from the LDC Annotated English Gigaword corpus for years 1994 through 2010.  (https://catalog.ldc.upenn.edu/LDC2012T21). This is a lot of data, but this is the scale needed for linguistic research.  We are also indexing more corpora for inclusion in CQPweb, so if you have a particular corpus you'd like to work with, please let us know!

If this capability interests you, then we suggest that you visit the CQPweb author's YouTube tutorial at https://www.youtube.com/watch?v=Yf1KxLOI8z8.  

If you'd like to try our CQPweb server, then please contact Dr. Mats Rooth (Professor of Linguistics) or  Bruce McKee (System Administrator), Department of Linguistics, Cornell University  

2016 to 2023

2023

  • LDC2023V01  2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual
  • LDC2023S03  2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge
  • LDC2023S06  2019 OpenSAT Public Safety Communications Simulation
  • LDC2023S01  AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
  • LDC2023T04  DEFT English Light and Rich ERE Annotation
  • LDC2023S07  LDC Spoken Language Sampler - Sixth Release
  • LDC2023T07  LORELEI Indonesian Representative Language Pack
  • LDC2023T01  LORELEI Swahili Representative Language Pack
  • LDC2023T02  LORELEI Tagalog Representative Language Pack
  • LDC2023T03  LORELEI Tamil Representative Language Pack
  • LDC2023T08  LORELEI Thai Representative Language Pack
  • LDC2023T06  LORELEI Zulu Representative Language Pack
  • LDC2023S02  Mixer 3 Speech
  • LDC2023S04  Mixer 7 Spanish Speech
  • LDC2023L01  Moroccan Arabic - English Lexical Database
  • LDC2023T05  Penn Korean Universal Dependency Treebank
  • LDC2023S05  Samrómur Queries Icelandic Speech 1.0

2022

  • LDC2022S10  2017 NIST Language Recognition Evaluation Training and Development Sets
  • LDC2022S01  2017 NIST OpenSAT Pilot - SSSF
  • LDC2022T02  AttImam
  • LDC2022T06  BOLT English Translation Treebank - Egyptian Arabic SMS/Chat
  • LDC2022T07  CAMIO Transcription Languages
  • LDC2022S13  Global TIMIT Thai
  • LDC2022V01  HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation
  • LDC2022V02  HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation
  • LDC2022T05  LORELEI Bengali Representative Language Pack
  • LDC2022T01  LORELEI Kinyarwanda Incident Language Pack
  • LDC2022T03  LORELEI Wolof Representative Language Pack
  • LDC2022S08  MASRI Synthetic
  • LDC2022S04  NUBUC
  • LDC2022T04  Qatari Corpus of Argumentative Writing
  • LDC2022L01  Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon
  • LDC2022S11  Samrómur Children Icelandic Speech 1.0
  • LDC2022S05  Samrómur Icelandic Speech 1.0
  • LDC2022S06  Second DIHARD Challenge Evaluation - Eleven Sources
  • LDC2022S07  Second DIHARD Challenge Evaluation - SEEDLingS
  • LDC2022S03  Spoken Digits in Hindi and Indian English
  • LDC2022S02  The Child Subglottal Resonances Database
  • LDC2022S12  Third DIHARD Challenge Development
  • LDC2022S14  Third DIHARD Challenge Evaluation
  • LDC2022S09  Xi'an Guanzhong Object Naming

2021

  • LDC2021S01  Althingi Parliamentary Speech
  • LDC2021T04  ATIS - Seven Languages
  • LDC2021T07  BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
  • LDC2021T11  BOLT Chinese SMS/Chat Parallel Training Data
  • LDC2021T14  BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
  • LDC2021T18  BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
  • LDC2021T15  BOLT Egyptian Arabic SMS/Chat Parallel Training Data
  • LDC2021T12  BOLT Egyptian Arabic Treebank - Conversational Telephone Speech
  • LDC2021T17  BOLT Egyptian Arabic Treebank - SMS/Chat
  • LDC2021T19  BOLT English Translation Treebank - Chinese SMS/Chat
  • LDC2021T03  BOLT English Treebank - SMS/Chat
  • LDC2021T13  Chinese Abstract Meaning Representation 2.0
  • LDC2021L01  Classical Arabic Dictionary
  • LDC2021S02  Columbia Games Corpus
  • LDC2021T16  DiscAlign for Penn and RST Discourse Treebanks
  • LDC2021T10  ESPADA
  • LDC2021S06  Ethnobotanical Research and Language Documentation of Nahuatl
  • LDC2021S03  Global TIMIT Mandarin Chinese
  • LDC2021V01  HAVIC MED Training Data -- Videos, Metadata and Annotation
  • LDC2021T02  LORELEI Akan Representative Language Pack
  • LDC2021S05  MyST Children's Conversational Speech
  • LDC2021T05  Penn Discourse Treebank Version 2.0 - German Translation
  • LDC2021S08  RATS Speaker Identification
  • LDC2021S10  Second DIHARD Challenge Development - Eleven Sources
  • LDC2021S11  Second DIHARD Challenge Development - SEEDLingS
  • LDC2021T08  TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014
  • LDC2021T06  TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010
  • LDC2021S04  The SSNCE Database of Tamil Dysarthric Speech
  • LDC2021S09  UCLA Speaker Variability Database
  • LDC2021S07  Wikipedia Spanish Speech and Transcripts
  • LDC2021T09  X-SRL: Parallel Cross-lingual Semantic Role Labeling

2020

  • LDC2020S04 2018 NIST Speaker Recognition Evaluation Test Set
  • LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0
  • LDC2020T07 Abstract Meaning Representation 2.0 - Four Translations
  • LDC2020T15 BOLT Chinese-English Word Alignment and Tagging - Conversational Telephone Speech Training
  • LDC2020T05 BOLT Egyptian Arabic-English Word Alignment - Conversational Telephone Speech Training
  • LDC2020T20 BOLT English Co-Reference - Discussion Forum, SMS/Chat, and Conversational Telephone Speech
  • LDC2020T21 BOLT English PropBank and Sense - Discussion Forum, SMS/Chat, and Conversational Telephone Speech
  • LDC2020T09 BOLT English Translation Treebank - Chinese Discussion Forum
  • LDC2020S08 CALLFRIEND American English-Southern Dialect Second Edition
  • LDC2020S06 CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition
  • LDC2020T01 Chinese CogBank
  • LDC2020L02 Chinese Lexical Resources for Gender, Number, Animacy
  • LDC2020T23 Corpus of Law, Academic, and News
  • LDC2020L01 Database of Word Level Statistics - Mandarin
  • LDC2020T19 DEFT Chinese Light and Rich ERE Annotation
  • LDC2020T06 EVALution
  • LDC2020S11 Global TIMIT Learner Simple English
  • LDC2020S09 Global TIMIT Learner Treebank English
  • LDC2020S12 Global TIMIT Mandarin Chinese-Guanzhong Dialect
  • LDC2020S02 IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b
  • LDC2020S07 IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b
  • LDC2020S10 IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b
  • LDC2020S01 LibriVox Spanish
  • LDC2020T10 LORELEI Entity Detection and Linking Knowledge Base
  • LDC2020T11 LORELEI Oromo Incident Language Pack
  • LDC2020T22 LORELEI Tigrinya Incident Language Pack
  • LDC2020T24 LORELEI Ukranian Representative Language Pack
  • LDC2020T17 LORELEI Vietnamese Representative Language Pack
  • LDC2020T04 Machine Reading Phase 1 IC Training Data
  • LDC2020S03 Mixer 4 and 5 Speech
  • LDC2020S05 Multi-Language Conversational Telephone Speech 2011 - Mandarin Chinese
  • LDC2020T16 Penn Parsed Corpora of Historical English
  • LDC2020S13 Phonemes of Arabic
  • LDC2020T12 SemTransCNC
  • LDC2020T14 Speech Sentiment Annotations
  • LDC2020T03 TAC KBP English Event Argument - Training and Evaluation Data 2014-2015
  • LDC2020T13 TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015
  • LDC2020T08 AC KBP English Temporal Slot Filing - Comprehensive Training and Evaluation Data 2011 and 2013
  • LDC2020T18 TAC KBP Event Argument - Comprehensive Training and Evaluation Data 2016-2017

2019

  • LDC2019S20 2016 NIST Speaker Recognition Evaluation Test Set
  • LDC2019T01 BOLT Arabic Discussion Forum Parallel Training Data
  • LDC2019T13 BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training
  • LDC2019T18 BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training
  • LDC2019T06 BOLT Egyptian-English Word Alignment -- Discussion Forum Training
  • LDC2019T15 BOLT English Treebank - Discussion Forum
  • LDC2019S21 CALLFRIEND American English-Non-Southern Dialect Second Edition
  • LDC2019S18 CALLFRIEND Canadian French Second Edition
  • LDC2019S04 CALLFRIEND Egyptian Arabic Second Edition
  • LDC2019T07 Chinese Abstract Meaning Representation 1.0
  • LDC2019S07 CIEMPIESS Experimentation
  • LDC2019T11 Corpus of Conversational Persian Transcripts
  • LDC2019T03 DEFT Chinese Committed Belief Annotation
  • LDC2019T16 DEFT English Committed Belief Annotation
  • LDC2019T09 DEFT Spanish Committed Belief Annotation
  • LDC2019S09 First DIHARD Challenge Development - Eight Sources
  • LDC2019S10 First DIHARD Challenge Development - SEEDLingS
  • LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources
  • LDC2019S13 First DIHARD Challenge Evaluation - SEEDLingS
  • LDC2019V01 HAVIC MED Progress Test -- Videos, Metadata and Annotation
  • LDC2019S22 IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b
  • LDC2019S08 IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c
  • LDC2019S16 IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c
  • LDC2019S03 IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b
  • LDC2019S17 LDC Spoken Language Sampler - Fifth Release
  • LDC2019T14 Machine Reading Phase 1 NFL Scoring Training Data
  • LDC2019S23 Magic Data Chinese Mandarin Conversational Speech
  • LDC2019S02 Multi-Language Conversational Telephone Speech 2011 -- Arabic Group
  • LDC2019S15 Multi-Language Conversational Telephone Speech 2011 -- East Asian
  • LDC2019S06 Multi-Language Conversational Telephone Speech 2011 -- English Group
  • LDC2019T04 Multilingual ATIS
  • LDC2019T05 Penn Discourse Treebank Version 3.0
  • LDC2019T10 Phrase Detectives Corpus Version 2
  • LDC2019S19 Polish Speech Database
  • LDC2019S01 SRI Speech-Based Collaborative Learning Corpus
  • LDC2019T08 TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014
  • LDC2019T17 TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017
  • LDC2019T19 TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017
  • LDC2019T02 TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015
  • LDC2019T12 TAC KBP Evaluation Source Corpora 2016-2017
  • LDC2019S14 The DKU-JNU-EMA Electromagnetic Articulography Database
  • LDC2019S11 USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition
  • LDC2019S05 VAST Chinese Speech and Transcripts

2018

  • LDC2018T08 2007 CoNLL Shared Task - Arabic & English
  • LDC2018T06 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish
  • LDC2018T07 2007 CoNLL Shared Task - Greek, Hungarian & Italian
  • LDC2018S06 2011 NIST Language Recognition Evaluation Test Set
  • LDC2018S14 AISHELL-1
  • LDC2018S15 Avatar Education Portuguese
  • LDC2018T10 BOLT Arabic Discussion Forums
  • LDC2018T15 BOLT Chinese SMS/Chat
  • LDC2018T23 BOLT Egyptian Arabic Treebank - Discussion Forum
  • LDC2018T19 BOLT English SMS/Chat
  • LDC2018T18 BOLT Information Retrieval Comprehensive Training and Evaluation
  • LDC2018S09 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition
  • LDC2018S11 CIEMPIESS Balance
  • LDC2018T20 Concretely Annotated English Gigaword
  • LDC2018T12 Concretely Annotated New York Times
  • LDC2018T01 DEFT Spanish Treebank
  • LDC2018S01 DIRHA English WSJ Audio
  • LDC2018S05 GALE Phase 4 Arabic Broadcast News Speech
  • LDC2018T14 GALE Phase 4 Arabic Broadcast News Transcripts
  • LDC2018T05 H2, E2, ERK1 Children's Writing
  • LDC2018V01 HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation
  • LDC2018S07 IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
  • LDC2018S13 IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a
  • LDC2018S16 IARPA Babel Telugu Language Pack IARPA-babel303-v1.0a
  • LDC2018S02 IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e
  • LDC2018T04 LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
  • LDC2018T11 LORELEI Somali Representative Language Pack - Monolingual and Parallel Text
  • LDC2018S03 Multi-Language Conversational Telephone Speech 2011 -- Central Asian
  • LDC2018S08 Multi-Language Conversational Telephone Speech 2011 -- Central European
  • LDC2018S12 Multi-Language Conversational Telephone Speech 2011 -- Spanish
  • LDC2018S17 Nautilus Speaker Characterization
  • LDC2018S10 RATS Language Identification
  • LDC2018S04 Rhythm and Pitch
  • LDC2018T09 SPADE
  • LDC2018T03 TAC KBP Comprehensive English Source Corpora 2009-2014
  • LDC2018T16 TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013
  • LDC2018T22 TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014
  • LDC2018T13 TRAD Arabic-French Parallel Text -- Newsgroup
  • LDC2018T21 TRAD Arabic-French Parallel Text -- Newswire
  • LDC2018T02 TRAD Chinese-French Parallel Text -- Blog
  • LDC2018T17 TRAD Chinese-French Parallel Text -- Broadcast News

2017

  • LDC2017L01 Arabic Speech Recognition Pronunciation Dictionary
  • LDC2017S01 IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
  • LDC2017S02 GALE Phase 3 Arabic Broadcast News Speech Part 2
  • LDC2017S03 IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
  • LDC2017S04 Noisy TIMIT Speech
  • LDC2017S05 IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
  • LDC2017S06 2010 NIST Speaker Recognition Evaluation Test Set
  • LDC2017S07 CHiME2 Grid
  • LDC2017S08 IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
  • LDC2017S09 Multi-Language Conversational Telephone Speech 2011 -- Turkish
  • LDC2017S10 CHiME2 WSJ0
  • LDC2017S11 Metalogue Multi-Issue Bargaining Dialogue
  • LDC2017S12 KSUEmotions
  • LDC2017S13 IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
  • LDC2017S14 Multi-Language Conversational Telephone Speech 2011 -- South Asian
  • LDC2017S15 GALE Phase 4 Arabic Broadcast Conversation Speech
  • LDC2017S16 LDC Spoken Language Sampler - Fourth Release
  • LDC2017S17 Vehicle City Voices Corpus – Part I
  • LDC2017S18 SRI-FRTIV
  • LDC2017S19 IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e
  • LDC2017S20 RATS Keyword Spotting
  • LDC2017S21 ASpIRE Development and Development Test Sets
  • LDC2017S22 IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a
  • LDC2017S23 CIEMPIESS Light
  • LDC2017S24 CHiME3
  • LDC2017S25 GALE Phase 4 Chinese Broadcast News Speech
  • LDC2017T01 MWE-Aware English Dependency Corpus
  • LDC2017T02 GALE Phase 3 and 4 Chinese Web Parallel Text
  • LDC2017T03 First-Year Law Students' Court Memoranda
  • LDC2017T04 GALE Phase 3 Arabic Broadcast News Transcripts Part 2
  • LDC2017T05 BOLT Chinese Discussion Forum Parallel Training Data
  • LDC2017T06 GALE English-Chinese Parallel Aligned Treebank -- Training
  • LDC2017T07 BOLT Egyptian Arabic SMS/Chat and Transliteration
  • LDC2017T08 Phrase Detectives Corpus
  • LDC2017T09 The EventStatus Corpus
  • LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0
  • LDC2017T11 BOLT English Discussion Forums
  • LDC2017T12 GALE Phase 4 Arabic Broadcast Conversation Transcripts
  • LDC2017T13 2015-2016 CoNLL Shared Task
  • LDC2017T14 Ancient Chinese Corpus
  • LDC2017T15 English Web Treebank Propbank
  • LDC2017T16 MWE-Aware English Dependency Corpus 2.0
  • LDC2017T17 TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014
  • LDC2017T18 GALE Phase 4 Chinese Broadcast News Transcripts
  • LDC2017V01 UCLA High-Speed Laryngeal Video and Audio

2016

  • LDC2016L01 Bamanankan Lexicon
  • LDC2016S01 GALE Phase 3 Arabic Broadcast Conversation Speech Part 2
  • LDC2016S02 IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
  • LDC2016S03 GALE Phase 4 Chinese Broadcast Conversation Speech
  • LDC2016S04 CHM150
  • LDC2016S05 Digital Archive of Southern Speech - NLP Version
  • LDC2016S06 IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
  • LDC2016S07 GALE Phase 3 Arabic Broadcast News Speech Part 1
  • LDC2016S08 IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
  • LDC2016S09 IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
  • LDC2016S10 IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
  • LDC2016S11 Multi-Language Conversational Telephone Speech 2011 -- Slavic Group
  • LDC2016S12 IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
  • LDC2016S13 IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
  • LDC2016T01 H1 Children's Writing
  • LDC2016T02 Arabic Treebank - Weblog
  • LDC2016T03 NewSoMe Corpus of Opinion in Blogs
  • LDC2016T04 GALE Phase 4 Chinese Weblog Parallel Sentences
  • LDC2016T05 BOLT Chinese Discussion Forums
  • LDC2016T06 GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2
  • LDC2016T07 DEFT Narrative Text
  • LDC2016T08 GALE Phase 3 and 4 Arabic Web Parallel Text
  • LDC2016T09 GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
  • LDC2016T10 SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
  • LDC2016T11 GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences
  • LDC2016T12 GALE Phase 4 Chinese Broadcast Conversation Transcripts
  • LDC2016T13 Chinese Treebank 9.0
  • LDC2016T14 GALE Phase 4 Arabic Weblog Parallel Sentences
  • LDC2016T15 GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
  • LDC2016T16 English Speed Networking Conversational Transcripts
  • LDC2016T17 GALE Phase 3 Arabic Broadcast News Transcripts Part 1
  • LDC2016T18 ARL Arabic Dependency Treebank
  • LDC2016T19 BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training
  • LDC2016T20 GALE Phase 4 Arabic Broadcast News Parallel Sentences
  • LDC2016T21 KAFD: Arabic Font Database
  • LDC2016T22 Chinese-English Parallel Sentences Extracted from Patents
  • LDC2016T23 Richer Event Description
  • LDC2016T24 JANA: A Human-Human Dialogues Corpus for Egyptian Dialect
  • LDC2016T25 GALE Phase 3 and 4 Chinese Newswire Parallel Text
  • LDC2016T26 TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation
  • LDC2016T27 GALE Phase 4 Arabic Newswire Parallel Sentences
  • LDC2016V01 HAVIC Pilot Transcription

2011 to 2015

2015

LDC2015L01 SenSem Lexicons
LDC2015S01 GALE Phase 2 Arabic Broadcast News Speech Part 2
LDC2015S02 RATS Speech Activity Detection
LDC2015S03 The Subglottal Resonances Database
LDC2015S04 Mandarin-English Code-Switching in South-East Asia
LDC2015S05 Mandarin Chinese Phonetic Segmentation and Tone
LDC2015S06 GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
LDC2015S07 CIEMPIESS
LDC2015S08 The Walking Around Corpus
LDC2015S09 LDC Spoken Language Sampler - Third Release
LDC2015S10 Arabic Learner Corpus
LDC2015S11 GALE Phase 3 Arabic Broadcast Conversation Speech Part 1
LDC2015S12 Articulation Index LSCP
LDC2015S13 GALE Phase 3 Chinese Broadcast News Speech
LDC2015T01 GALE Phase 2 Arabic Broadcast News Transcripts Part 2
LDC2015T02 SenSem Databank
LDC2015T03 Avocado Research Email Collection
LDC2015T04 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3
LDC2015T05 GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text
LDC2015T06 GALE Chinese-English Parallel Aligned Treebank -- Training
LDC2015T07 GALE Phase 3 and 4 Arabic Broadcast News Parallel Text
LDC2015T08 Coordination Annotation for the Penn Treebank
LDC2015T09 GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
LDC2015T10 RST Signalling Corpus
LDC2015T11 2006 CoNLL Shared Task - Ten Languages
LDC2015T12 2006 CoNLL Shared Task - Arabic & Czech
LDC2015T13 English News Text Treebank: Penn Treebank Revised
LDC2015T14 GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences
LDC2015T15 TS Wikipedia
LDC2015T16 GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1
LDC2015T17 NewSoMe Corpus of Opinion in News Reports
LDC2015T18 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4
LDC2015T19 GALE Phase 3 and 4 Arabic Newswire Parallel Text
LDC2015T20 ACE 2007 Spanish DevTest - Pilot Evaluation
LDC2015T21 GALE Phase 4 Chinese Broadcast News Parallel Sentences
LDC2015T22 Karlsruhe Children's Text
LDC2015T23 KHATT: Handwritten Arabic Text
LDC2015T24 GALE Phase 4 Chinese Newswire Parallel Sentences
LDC2015T25 GALE Phase 3 Chinese Broadcast News Transcripts

2014

LDC2014S01 CALLFRIEND Farsi Second Edition Speech
LDC2014S02 King Saud University Arabic Speech Database
LDC2014S03 Multi-Channel WSJ Audio
LDC2014S04 USF-SFI MALACH Interviews and Transcripts Czech
LDC2014S05 Hispanic-English Database
LDC2014S06 NIST Language Recognition Evaluation Test Set
LDC2014S07 GALE Phase 2 Arabic Broadcast News Speech Part 1
LDC2014S08 United Nations Proceedings Speech
LDC2014T01 CALLFRIEND Farsi Second Edition Transcripts
LDC2014T02 NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source
LDC2014T03 GALE Arabic-English Parallel Aligned Treebank - Broadcast News Part 2
LDC2014T04 GALE Phase 2 Chinese Broadcast News Parallel Text Part 1
LDC2014T05 GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web
LDC2014T06 ETS Corpus of Non-Native Written English
LDC2014T07 Domain-Specific Hyponym Relations
LDC2014T08 GALE Arabic-English Parallel Aligned Treebank -- Web Training
LDC2014T09 HyTER Networks of Selected OpenMT08/09 Sentences
LDC2014T11 GALE Phase 2 Chinese Broadcast News Parallel Text Part 2
LDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0
LDC2014T13 MADCAT Chinese Pilot Training Set
LDC2014T14 GALE Arabic-English Word Alignment Training Part 3 -- Web
LDC2014T15 GALE Phase 2 Chinese Newswire Parallel Text Part 1
LDC2014T16 TAC KBP Reference Knowledge Base
LDC2014T17 GALE Phase 2 Arabic Broadcast News Transcripts Part 1
LDC2014T18 ACE 2007 Multilingual Training Corpus
LDC2014T19 GALE Arabic-English Word Alignment - Broadcast Training Part 1
LDC2014T20 GALE Phase 2 Chinese Newswire Parallel Text Part 2
LDC2014T21 Chinese Discourse Treebank 0.5
LDC2014T22 GALE Arabic-English Word Alignment Broadcast Training Part 2
LDC2014T25 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2
LDC2014T26 GALE Phase 2 Chinese Web Parallel Text

2013

LDC2013E07 Deep NLU Exploration - DEFT Pilot Source Text and Annotations
LDC2013E11 DEFT Phase 1 Sample Narrative Text Creation
LDC2013E19 DEFT Phase 1 Narrative Text Source Data R1
LDC2013E28 DEFT Phase 1 ERE Annotation Sample
LDC2013E29 DEFT Phase 1 Narrative Text Source Data R2
LDC2013E30 DEFT Phase 1 AMR Annotation Sample
LDC2013E44 DEFT Phase 1 ERE Annotation R1
LDC2013E47 DEFT Phase 1 AMR Annotation R1
LDC2013E64 DEFT Phase 1 ERE Annotation R3
LDC2013E65 TAC 2011 KBP English Temporal Slot Filling Assessment Results
LDC2013E88 DEFT - Source Data for Event Anomaly Pilot Annotation Exercise
LDC2013E93 DEFT - Source Data for Speech and Belief Anomaly Pilot Annotation Exercises
LDC2013E95 DEFT Narrative Text Source Data R3
LDC2013E117 DEFT Phase 1 AMR Annotation R3
LDC2013L01 Maninkakan Lexicon
LDC2013S03 Mixer 6 Speech
LDC2013S04 GALE Phase 2 Chinese Broadcast Conversation Speech
LDC2013S07 GALE Phase 2 Arabic Broadcast Conversation Speech Part 2
LDC2013S08 GALE Phase 2 Chinese Broadcast News Speech
LDC2013T01 Chinese-English Biology and Chemistry Abstract Parallel Text
LDC2013T02 GALE Phase 2 Arabic Web Parallel Text
LDC2013T03 NIST 2012 Open Machine Translation (OpenMT) Evaluation
LDC2013T04 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1
LDC2013T05 GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web
LDC2013T06 1993-2007 United Nations Parallel Text
LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
LDC2013T08 GALE Phase 2 Chinese Broadcast Conversation Transcripts
LDC2013T10 GALE Arabic-English Parallel-Aligned Treebank - Newswire
LDC2013T11 GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1
LDC2013T12 Manually Annotated Sub-Corpus Third Release
LDC2013T13 Chinese Proposition Bank 3.0
LDC2013T14 GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1
LDC2013T15 MADCAT Phase 3 Training Set
LDC2013T16 GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
LDC2013T17 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2
LDC2013T18 Semantic Textual Similarity (STS) 2013 Machine Translation
LDC2013T19 OntoNotes Release 5.0
LDC2013T20 GALE Phase 2 Chinese Broadcast News Transcripts
LDC2013T21 Chinese Treebank 8.0
LDC2013T22 The ARRAU Corpus of Anaphoric Information
LDC2013T23 GALE Chinese-English Word Alignment and Tagging - Broadcast Training Part 1

2012

LDC2012S01 2006 NIST Speaker Recognition Evaluation Test Set Part 2
LDC2012S02 TORGO Database of Dysarthric Articulation
LDC2012S03 Digital Archive of Southern Speech
LDC2012S05 USC-SFI MALACH Interviews and Transcripts English
LDC2012S06 Turkish Broadcast News Speech and Transcripts
LDC2012T01 ModeS TimeBank 1.0
LDC2012T02 English Translation Treebank: An-Nahar Newswire
LDC2012T05 Chinese Dependency Treebank 1.0
LDC2012T07 Arabic Treebank - Broadcast News v1.0
LDC2012T08 Praque Czech-English Dependency Treebank 2.0
LDC2012T09 Arabic-Dialect/English Parallel Text
LDC2012T10 Catalan TimeBank 1.0
LDC2012T11 American English Nickname Collection
LDC2012T12 Spanish TimeBank 1.0
LDC2012T13 English Web Treebank
LDC2012T14 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
LDC2012T15 Multilingual Automatic Document Classification Analysis and Translation (MADCAT) Phase 1 Training
LDC2012T16 GALE Chinese-English Word Alignment and Tagging Training Part 1 - Newswire and Web
LDC2012T17 GALE Phase 2 Arabic Newswire Parallel Text
LDC2012T18 GALE Phase 2 Arabic Broadcast News Parallel Text
LDC2012T20 GALE Chinese-English Word Alignment and Tagging Training Part 2 - Newswire
LDC2012T21 Annotated English Gigaword
LDC2012T22 Chinese-English Semiconductor Parallel Text
LDC2012T23 Russian-English Computer Security Parallel Text
LDC2012T24 GALE Chinese-English Word Alignment and Tagging Training Part 3 - Web
LDC2012V01 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News

2011

LDC0211S01 2005 NIST Speaker Recognition Evaluation Training Data
LDC0211S02 2006 NIST Spoken Term Detection Development Set
LDC2011S03 2006 NIST Spoken Term Detection Evaluation Set
LDC2011S04 2005 NIST Speaker Recognition Evaluation Test Data
LDC2011S05 2008 NIST Speaker Recognition Evaluation Training Set Part 1
LDC2011S06 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set
LDC2011S07 2008 NIST Speaker Recognition Evaluation Training Set Part 2
LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set
LDC2011S09 2006 NIST Speaker Recognition Evaluation Training Set
LDC2011S10 2006 NIST Speaker Recognition Evaluation Test Set Part 1
LDC2011S11 2008 NIST Speaker Recognition Evaluation Supplemental Set
LDC2011T01 SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
LDC2011T02 ACE 2005 English SpatialML Annotations Version 2
LDC2011T04 Indian Language Part-of-Speech Tagset: Sanskrit
LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2011T06 Broadcast News Lattices
LDC2011T07 English Gigaword Fifth Edition
LDC2011T08 Datasets for Generic Relation Extraction (reACE)
LDC2011T09 Arabic Treebank: Part 2 v 3.1
LDC2011T10 French Gigaword Third Edition
LDC2011T11 Arabic Gigaword Fifth Edition
LDC2011T12 Spanish Gigaword Third Edition
LDC2011T13 Chinese Gigaword Fifth Edition
LDC2011V01 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1
LDC2011V02 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2
LDC2011V03 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
LDC2011V04 Indian Language Part-of-Speech Tagset: Sanskrit
LDC2011V05 2006 NIST/USF Evaluation Resources for the VACE Program, Meeting Data Test Set Part 1
LDC2011V06 2006 NIST/USF Evaluation Resources for the VACE Program, Meeting Data Test Set Part 2

2006 to 2010

2010

LDC2010S01 Fisher Spanish Speech
LDC2010S02 WTIMIT 1.0
LDC2010S03 2003 NIST Speaker Recognition Evaluation
LDC2010S07 Asian Spoken Language Sampler
LDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations
LDC2010T02 Czech Broadcast News MDE Transcripts
LDC2010T03 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
LDC2010T04 Fisher Spanish - Transcripts
LDC2010T05 NPS Internet Chatroom Conversations, Release 1.0
LDC2010T07 Chinese Treebank 7.0
LDC2010T08 Arabic Treebank: Part 3 v 3.2
LDC2010T09 ACE 2005 Mandarin SpatialML Annotations
LDC2010T10 NIST 2002 Open Machine Translation (OpenMT) Evaluation
LDC2010T11 NIST 2003 Open Machine Translation (OpenMT) Evaluation
LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation
LDC2010T13 Arabic Treebank: Part 1 v 4.1
LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation
LDC2010T15 Message Understanding Conference 7 Timed (MUC7_T)
LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
LDC2010T19 Korean Newswire Second Edition
LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation
LDC2010V01 TRECVID 2004 Keyframes & Transcripts
LDC2010V02 TRECVID 2006 Keyframes

2009

LDC2009E58 TAC 2009 KBP Evaluation Reference Knowledge Base
LDC2009L01 An English Dictionary of the Tamil Verb Second Edition
LDC2009S01 CSLU: Numbers Version 1.3
LDC2009S02 Czech Broadcast Conversation Speech
LDC2009S03 CSLU: S4X Release 1.2
LDC2009T01 English CTS Treebank with Structural Metadata
LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
LDC2009T04 2007 NIST Language Recognition Evaluation Test Set
LDC2009T05 2007 NIST Language Recognition Evaluation Supplemental Training Set
LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
LDC2009T07 Unified Linguistic Annotation Text Collection
LDC2009T08 Japanese Web N-gram Version 1
LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
LDC2009T10 Language Understanding Annotation Corpus
LDC2009T11 REFLEX Entity Translation Training/DevTest
LDC2009T12 2008 CoNLL Shared Task Data
LDC2009T13 English Gigaword Fourth Edition
LDC2009T14 Tagged Chinese Gigaword Version 2.0
LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
LDC2009T20 Czech Broadcast Conversation MDE Transcripts
LDC2009T21 Spanish Gigaword Second Edition
LDC2009T22 Arabic Newswire English Translation Collection
LDC2009T23 FactBank 1.0
LDC2009T24 OntoNotes Release 3.0
LDC2009T25 Web 1T 5-gram, 10 European Languages Version 1
LDC2009T26 NXT Switchboard Annotations
LDC2009T27 Chinese Gigaword Fourth Edition
LDC2009T28 French Gigaword Second Edition
LDC2009T29 ACL Anthology Reference Corpus
LDC2009T30 Arabic Gigaword Fourth Edition
LDC2009V01 Audiovisual Database of Spoken American English

2008

LDC2008L02 Hindi WordNet
LDC2008L03 Global Yoruba Lexical Database v. 1.0
LDC2008T02 GALE Phase 1 Arabic Blog Parallel Text
LDC2008T06 GALE Phase 1 Chinese Blog Parallel Text
LDC2008T07 Chinese Proposition Bank 2.0
LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
LDC2008T13 BLLIP North American News Text, Complete
LDC2008T17 CALLHOME Mandarin Chinese Transcripts - XML Version
LDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
LDC2008T20 PennBioIE CYP 1.0

2007

LDC2007E37 CoNLL 2007 Shared Task English Test Set, Part 1
LDC2007E38 CoNLL 2007 Shared Task Arabic Test Set, Part 1
LDC2007E39 CoNLL 2007 Shared Task Czech Test Set, Part 1
LDC2007E40 CoNLL 2007 Shared Task English Test Set, Part 2
LDC2007E41 CoNLL 2007 Shared Task Arabic Test Set, Part 2
LDC2007E42 CoNLL 2007 Shared Task Czech Test Set, Part 2
LDC2007T22 2001 Topic Annotated Enron Email Data Set
LDC2007T36 Chinese Treebank 6.0

2006

LDC2006S33 Middle East Technical University Turkish Microphone Speech v. 1.0
LDC2006S42 Korean Broadcast News Speech
LDC2006T03 Korean Propbank
LDC2006T06 ACE 2005 Multilingual Training Corpus
LDC2006T09 Korean Treebank Annotations Version 2.0
LDC2006T13 Web 1T 5-gram Version 1
LDC2006T14 Korean Broadcast News Transcripts

2001 to 2005

2005

LDC2005L01 Mawukakan Lexicon
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC2005S13 Fisher English Training Part 2, Speech
LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
LDC2005S22 Articulation Index
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
LDC2005T01 Chinese Treebank 5.0
LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
LDC2005T06 Chinese News Translation Text Part 1
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
LDC2005T08 Discourse Graphbank
LDC2005T09 ACE 2004 Multilingual Training Corpus
LDC2005T10 Chinese English News Magazine Parallel Text
LDC2005T12 English Gigaword Second Edition
LDC2005T13 CCGbank
LDC2005T14 Chinese Gigaword Second Edition
LDC2005T19 Fisher English Training Part 2, Transcripts
LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
LDC2005T23 Chinese Proposition Bank 1.0
LDC2005T24 RT-04 MDE Training Data Text/Annotations
LDC2005T28 HARD 2004 Text
LDC2005T29 HARD 2004 Topics and Annotations
LDC2005T30 Arabic Treebank: Part 4 v 1.0 (MPG Annotation)
LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
LDC2005T35 American National Corpus (ANC) Second Release

2004

LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
LDC2004S01 Czech Broadcast News Speech
LDC2004S02 ICSI Meeting Speech
LDC2004S04 2002 NIST Speaker Recognition Evaluation
LDC2004S05 ISL Meeting Speech Part 1
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC2004S09 NIST Meeting Pilot Corpus Speech
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2004T01 Czech Broadcast News Transcripts
LDC2004T02 Arabic Treebank: Part 2 v 2.0
LDC2004T03 Morphologically Annotated Korean Text
LDC2004T04 ICSI Meeting Transcripts
LDC2004T05 Chinese Treebank 4.0
LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC2004T10 ISL Meeting Transcripts Part 1
LDC2004T11 Arabic Treebank: Part 3 v 1.0
LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata
LDC2004T14 Proposition Bank I
LDC2004T15 2000 Communicator Dialogue Act Tagged
LDC2004T16 2001 Communicator Dialogue Act Tagged
LDC2004T17 Arabic News Translation Text Part 1
LDC2004T18 Arabic English Parallel News Part 1
LDC2004T19 Fisher English Training Speech Part 1 Transcripts
LDC2004T23 Prague Arabic Dependency Treebank 1.0
LDC2004V01 FORM1 Kinematic Gesture

2003

LDC2003S01 2001 Communicator Evaluation
LDC2003S03 Korean Telephone Conversations Speech
LDC2003S06 Santa Barbara Corpus of Spoken American English Part II
LDC2003T08 Korean Telephone Conversations Transcripts
LDC2003T11 ACE-2 Version 1.0

2002

LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
LDC2002S06 Switchboard-2 Phase III Audio
LDC2002S13 2001 HUB5 English Evaluation
LDC2002S56 2000 Communicator Evaluation

2001

(2001 corpora not purchased)

1993 to 2000

2000

LDC2000S85 Santa Barbara Corpus of Spoken American English Part I
LDC2000T45 Korean Newswire

1999

LDC99S79 Switchboard-2 Phase II

1998

LDC98S70 HUB5 Spanish Telephone Speech Corpus
LDC98S71 1997 English Broadcast News Speech (HUB4)
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
LDC98S75 Switchboard-2 Phase I
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
LDC98T27 HUB5 Spanish Transcripts
LDC98T29 1997 Spanish Broadcast News Transcripts (HUB4-NE)

1997

LDC97S62 Switchboard-1 Release 2

1996

LDC96L17 CALLHOME Spanish Transcripts
LDC96L14 CELEX2
LDC96S36 Boston University Radio Speech
LDC96S37 CALLHOME Japanese Speech
LDC96S53 CALLFRIEND Japanese
LDC96T18 CALLHOME Japanese Transcripts

1995

LDC95S26 ATIS3 Test Data
LDC95T7 Treebank-2
LDC95T20 Hansard French/English
LDC95T42 Treebank-3

1994

LDC94S19 ATIS3 Training Data

1993

LDC93S1  TIMIT Acoustic-Phonetic Continuous Speech Corpus
LDC1993S3A  Resource Management 2.0 Complet

Top