ETS Corpus of Non-Native Written English

Use of this dataset is restricted to the UNT Community. Off-campus users must log in to view.

Description

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay. The corpus was developed with the specific task of native … continued below

Physical Description

The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The essays are presented in both original raw and tokenized forms and presented in UTF-8 formatted text files. Also included are the prompts (topics) for the essays and metadata about the test takers' proficiency level.

Creation Information

Blanchard, Daniel; Tetreault, Joel; Higgins, Derrick; Cahill, Aoife & Chodorow, Martin June 16, 2014.

Context

This dataset is part of the collection entitled: Linguistic Corpora and was provided by the UNT Libraries to the UNT Digital Library, a digital repository hosted by the UNT Libraries. It has been viewed 119 times. More information about this dataset can be viewed below.

Who

People and organizations associated with either the creation of this dataset or its content.

Provided By

UNT Libraries

The UNT Libraries serve the university and community by providing access to physical and online collections, fostering information literacy, supporting academic research, and much, much more.

Contact Us

What

Descriptive information to help identify this dataset. Follow the links below to find similar items on the Digital Library.

Titles

  • Main Title: ETS Corpus of Non-Native Written English
  • Alternate Title: Educational Testing Service Corpus of Non-Native Written English

Description

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.

The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.

Physical Description

The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The essays are presented in both original raw and tokenized forms and presented in UTF-8 formatted text files. Also included are the prompts (topics) for the essays and metadata about the test takers' proficiency level.

Language

Item Type

Identifier

Unique identifying numbers for this dataset in the Digital Library or other systems.

Publication Information

  • Preferred Citation: Blanchard, Daniel, et al. ETS Corpus of Non-Native Written English LDC2014T06. Web Download. Philadelphia: Linguistic Data Consortium, 2014.

Collections

This dataset is part of the following collections of related materials.

Linguistic Corpora

Sets of representative written or oral texts annotated for use in corpus linguistics.

UNT Libraries Licensed Content

A selection of materials licensed for use by members of the UNT community. Access to these items is restricted to the UNT community.

What responsibilities do I have when using this dataset?

When

Dates and time periods associated with this dataset.

Creation Date

  • June 16, 2014

Added to The UNT Digital Library

  • Feb. 3, 2020, 9:48 a.m.

Description Last Updated

  • Jan. 18, 2022, 7:55 p.m.

Usage Statistics

When was this dataset last used?

Yesterday: 0
Past 30 days: 3
Total Uses: 119

Interact With This Dataset

Here are some suggestions for what to do next.

Blanchard, Daniel; Tetreault, Joel; Higgins, Derrick; Cahill, Aoife & Chodorow, Martin. ETS Corpus of Non-Native Written English, dataset, June 16, 2014; (https://digital.library.unt.edu/ark:/67531/metadc1610667/: accessed May 26, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .

Back to Top of Screen