ETS Corpus of Non-Native Written English

Blanchard, Daniel; Tetreault, Joel; Higgins, Derrick; Cahill, Aoife; Chodorow, Martin

ETS Corpus of Non-Native Written English

Use of this dataset is restricted to the UNT Community. Off-campus users must log in to view.

Description

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay. The corpus was developed with the specific task of native … continued below

Physical Description

The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The essays are presented in both original raw and tokenized forms and presented in UTF-8 formatted text files. Also included are the prompts (topics) for the essays and metadata about the test takers' proficiency level.

Creation Information

Blanchard, Daniel; Tetreault, Joel; Higgins, Derrick; Cahill, Aoife & Chodorow, Martin June 16, 2014.

Context

This dataset is part of the collection entitled: Linguistic Corpora and was provided by the UNT Libraries to the UNT Digital Library, a digital repository hosted by the UNT Libraries. It has been viewed 119 times. More information about this dataset can be viewed below.

Authors

Provided By

UNT Libraries

The UNT Libraries serve the university and community by providing access to physical and online collections, fostering information literacy, supporting academic research, and much, much more.

Titles

Main Title: ETS Corpus of Non-Native Written English
Alternate Title: Educational Testing Service Corpus of Non-Native Written English

Description

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.

The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.

Physical Description

The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The essays are presented in both original raw and tokenized forms and presented in UTF-8 formatted text files. Also included are the prompts (topics) for the essays and metadata about the test takers' proficiency level.

Subjects

Keywords

Language

English

Item Type

Dataset

Identifier

Unique identifying numbers for this dataset in the Digital Library or other systems.

ISBN: 1-58563-675-4
Accession or Local Control No: LDC2014T06
Archival Resource Key: ark:/67531/metadc1610667

Publication Information

Preferred Citation: Blanchard, Daniel, et al. ETS Corpus of Non-Native Written English LDC2014T06. Web Download. Philadelphia: Linguistic Data Consortium, 2014.

Collections

This dataset is part of the following collections of related materials.

Linguistic Corpora

Sets of representative written or oral texts annotated for use in corpus linguistics.

UNT Libraries Licensed Content

A selection of materials licensed for use by members of the UNT community. Access to these items is restricted to the UNT community.

What responsibilities do I have when using this dataset?

Creation Date

June 16, 2014

Added to The UNT Digital Library

Feb. 3, 2020, 9:48 a.m.

Description Last Updated

Jan. 18, 2022, 7:55 p.m.

Usage Statistics

When was this dataset last used?

Yesterday: 0

Past 30 days: 3

Total Uses: 119

Blanchard, Daniel; Tetreault, Joel; Higgins, Derrick; Cahill, Aoife & Chodorow, Martin. ETS Corpus of Non-Native Written English, dataset, June 16, 2014; (https://digital.library.unt.edu/ark:/67531/metadc1610667/: accessed May 26, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .

ETS Corpus of Non-Native Written English

Description

Physical Description

Creation Information

Context

Who

Authors

Provided By

UNT Libraries

Contact Us

What

Titles

Description

Physical Description

Subjects

Keywords

Language

Item Type

Identifier

Publication Information

Collections

Linguistic Corpora

UNT Libraries Licensed Content

Digital Files

When

Creation Date

Added to The UNT Digital Library

Description Last Updated

Usage Statistics

Interact With This Dataset

Citations, Rights, Re-Use

Print / Share

Links for Robots

Archival Resource Key (ARK)

International Image Interoperability Framework (IIIF)

Metadata Formats

Images

URLs

Stats