Frequency of Grammatical Disjuncts

(This is a copy of my blog post on the wordpress opencog brainwave blog, oringianlly made on 7 June 2009 -- Linas Vepstas).

The link-grammar parser uses labeled links to connect together pairs of words. In order to capture the idea of proper grammatical construction, any given word is only allowed to have very specific links to its right or left: for example, verbs have their subject on the left, and an object on the right. Link-grammar defines hundreds of different link types, and there are typically dozens or even hundreds of ways that these can attach to a word. Each allowed set of links is called a “disjunct”. So, for example:

MVp- Js+

is a disjunct that says “there must be an MVp link from this word, going to the left, and an Js link, going to the right”. This disjunct commonly connects prepositions to a verb on their left (the MV- link) and the object of the preposition on the right (the J+ link).

A good way to think about disjuncts is to imagine them as very fine-grained part-of-speech tags. Thus, when one sees “MVp- Js+” associated to a word, one knows not only that the word is a preposition, but even a bit more: its a preposition that took a singular object. Disjuncts classify words not just into crude part-of-speech categories, but much finer categories: thus verbs are not just as transtivie or intransitive verbs, but mgiht be transitive verbs that take both direct and indirect objects, or participles, etc.

Siva Reddy, a GSOC 2009 summer student, prepared a table of the frequency of occurrence of different disjuncts in a large collection of text. The top six entries are

Ds+ 950275.635843
Xp- 838569.90527
A+ 616522.664867
AN+ 566658.997313
MVp- Js+ 563082.649325
MVp- Jp+ 446487.310222

and these are exactly what one might expect:

A graph of rank vs. frequency is shown below:

Disjunct rank vs. frequency of occurance

As can be seen, the distribution is more or less Zipfian, with a power-law exponent of 1.5. The fact that the long tail appears to be linear indicates that grammatical construction in the English language appears to be more ore less scale-free: difficult and akward constructions are increasingly rare. The fact that the graph is not purely Zipfian, but instead has a knee for the most common grammatical connections suggests that the most common grammatical constructions are “less common than they should be”: almost as if English speakers are resisting the use of formulaic sentence constructions. So, for example, since adjectives and noun-modifiers appear near the top of the rank, this suggests that English speakers “could have” used more adjectives and noun-modifiers, but didn’t. Quite why this is so is not clear. Perhaps the use of anaphora and references in general helps decrease the need for lots of modifiers.

The open questions are then:

  1. Why a power law of 1.5?
  2. Why is there a knee?
  3. Does this result hold for other languages?

The corpus used here consists of approximately 1 million sentences, obtained by parsing entire Wikipedia articles, Voice of America news stories, and 10 books from Project Gutenberg, including War and Peace, Jane Austen, and some scientific or medical texts.