Academic Collocations

The files below contain lists of collocations retrieved from the British Academic Written English corpus.

The lists are organized according to the syntactic relationship between the collocation elements. Because collocations can vary characteristically between different disciplines (see here and here for relevant research), I have created three sets of lists:

  • collocations in the corpus as a whole;
  • collocations in the ‘hard disciplines’ of Engineering, Chemistry, Physics, Biological Sciences, Food Sciences and Computer Science
  • collocations in the ‘soft disciplines’ of Law, English, Classics, Philosophy, History, Sociology and Politics

Click on the links below to download the relevant files:

Full-corpus

Lemmatized collocations

Word form collocations

Hard disciplines

Lemmatized collocations

Soft disciplines

Lemmatized collocations

How the lists were created

Traditionally, collocations have been identified as words which co-occur frequently within a set span of words. However, this approach can be problematic as the distance between words is not always an accurate guide to whether words are in a syntactic relationship with each other. This is illustrated in the following examples. In example 1, the highlighted words are in a collocational relationship, in spite of being separated by eight intervening words. By contrast, in example 2, the highlighted words are not in a collocational relationship, though they are separated by only six words.

  1. The old dreamof wireless communication through space has now been
  2. she realizes that the buzzing sound from her dreamis present in her bedroom

To overcome this problem, these collocation lists were created on the basis of a dependency parse, performed through the Stanford CoreNLP pipeline (https://nlp.stanford.edu). The Stanford parser tokenizes and lemmatizes the corpus and tags it for both part-of-speech and syntactic dependencies between words. This enables us to identify words which are in specific types of syntactic relationship with each other.

I wrote a simple R-script to go through the parsed corpus, identifying every case of words in one of the five target syntactic relationships shown above. For each of these relationships, any word pair occurring more than five times was retained as a potential collocation. Association measures for each collocation were then retrieved, based on the frequencies of words in the BAWE corpus.