Collecting the corpus
Our research team contacted schools from across the country, briefing them on the project and inviting them to participate. All writing was obtained subject to the students’ voluntary informed consent, with additional consent obtained from the head teacher, the relevant subject teachers, and the students’ legal guardians.
Teachers collected texts from participating students and either photocopied these texts and mailed them to us or invited us into their schools to make photocopies ourselves.
All of the texts were received in hand-written form so we employed a small team of transcribers to type them up. Transcribers received two days of training and worked closely with a member of the core project team to deal with issues that arose during the process.
Transcription proceded in two phases. In the first phase, each transcriber was assigned a set of photocopies to type up, in accordance with our transcription conventions. They were also asked to make two types of change to the original texts: 1) replace any proper names which might compromise participants’ or institutions’ anonymity with anonymisation markers; 2) where a word had been mis-spelled, contained erroneous capitalization or an abbreviation, insert a tag recording both the original form and a ‘correction’ with the correct spelling/capitalization/expanded form of the abbreviation.
In the second phase, each transcriber was assigned texts which had originally been transcribed by someone else. They both reviewed the original transcription for accuracy and added annotations related to punctuation and grammar.
At each stage, transcribers followed a manual which set out our transcription conventions and principles to be followed during the process. The manual for stage one can be found here and the manual for stage two can be found here.
The conventions set out above describe the ‘basic’ version of the corpus. For the purposes of analysis, further versions were created incorporating different types of additional linguistic information.
We used the CLAWS tagger to automatically add information about the part-of-speech of each word in the corpus. To achieve more accurate classifications, prior to tagging, misspelled words were corrected and unclear/illegible material removed. Material appearing inside tables was also removed.
The corpus was tagged with syntactic information in two ways. First, the entire corpus was tagged for part-of-speech and grammatical relations using the Stanford Core NLP suit of tools (as with the part-of-speech tagging, misspellings were corrected and unclear/illegible material and tables were removed prior to parsing).
Second, a subset of the corpus was manually tagged by a team of trained annotators. This analysis focused specifically on tagging syntactic elements within noun phrases and subordinate clauses. Procedures and conventions used in this process are described in full here. The hand-parsed version of the corpus is available upon request. Please contact Phil Durrant for more information.