Parsed BAWE corpus

Follow the links below to download a parsed version of the British Academic Written English (BAWE) corpus of university student writing.

The corpus was originally collected by Hilary Nesi, Sheena Gardner, Paul Thompson, and Paul Wickens as part of the project, ‘An Investigation of Genres of Assessed Writing in British Higher Education’. The project was funded by the Economic and Social Research Council. (2004 – 2007 project number RES-000-23-0800). It comprises approximately 3,000 texts written by university student in England across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). The original corpus, metadata and additional information can be accessed at the Oxford Text Archive (http://hdl.handle.net/20.500.12024/2539).

The parsed versions of the corpus linked have been created using the Stanford Core NLP parser.

For ease of use, I’ve created three versions, as follows:

Version 1: .conll files: these are the raw files output by the parser, in .conll format

Version 2: .csv files: these are the parsed texts saved in .csv format. These can be more easily opened in spreadsheet software and with programs like R. I have also added column headings and sentence numbers.

Version 3: .csv file categorized: these are the same .csv files found in Version 2 but have been organised into separate folders for each disciplinary group (Arts and Humanities, Life Sciences, Physical Sciences, and Social Sciences) and each academic level (undergraduate years 1-3 and taught postgraduate).

Share this: