LINGUIST611: Surprisal

Class goals

Introduce memory-oriented and expectation-oriented approaches to processing

Two big ideas

There are two big ideas that we will explore today, as we discuss Hale (2001). They are:

Idea #1: The relationship between the parser and the grammar is one of 'strong competence' - that is, the rules of grammar are directly engaged in the moment by moment processing of words.

Idea #2: Our lifetime of experience gives us sharp expectations about the likelihood of different linguistic events, and we use this knowledge to infer the structure of the strings we are hearing.

Expectation-based approaches to comprehension

The approach to sentence comprehension that encodes these two big ideas in a specific parsing model has come to be called Surprisal theory. Rather than focusing on the description of the principles that underly the moment by moment processes of understanding the input, expectation-based approaches highlight the key role of our linguistic experience in guiding processing. More specifically, this view has come to be Here, the leading idea is:

Our linguistic experience affords us accurate estimates of how often different linguistic structures are encountered.
Those estimates form the basis of subjective probability distributions.
Those probability distributions are directly deployed, in conjunction with grammatical knowledge, to parse our input.

The resulting view is one that Hale (2001) calls strong competence: we are adopting the minimum amount of cognitive overhead to explain the processing of sentences. Instead, we simply assume that comprehenders have access to their grammar, knowledge of how often different structures occur in the input, and they simply treat the processing problem as an inference problem. Word by word we ask: what's the most probable syntactic structure now?

Discussion point: How does strong competence contrast with other approaches we have seen to sentence and word comprehension so far? Does the adoption of this principle seem justified / motivated to you?

Let's introduce one key concept to see how this works: a probabilistic context free grammar. A PCFG is just like a vanilla context free grammar, with a twist: the rules now have weights appended to them:

These weights are conditional probabilities. They specify the probability that a given node will be rewritten as the sequence of symbols to its right. For example, note that NP has three different possible expansions in this grammar. They are all equally frequent; therefore, each has a probability of 1/3.

This grammar comes from Hale (2001). It is largely a toy grammar, except in one key respect: it represents the distributional fact that SRCs are much more frequent than ORCs. How does surprisal theory link this observation to moment by moment processing?

Probabilistic grammars like the one above can be used to define language models - probabilistic models that can assign a probability to any arbitrary string licensed by a grammar. In many modern NLP systems, language models are used to estimate the conditional probability of words in context. This could be useful for, example, speech recognition systems, where having an estimate of the probability of a word being mentioned in a context can help provide some top-down information that can facilitate speech recognition.

The probability of a word in context is often quantified with surprisal, that is, the negative log probability of a word in context. Low probability, high surprisal; High probability, low surprisal. Surprisal theory states that the difficulty in recognizing / integrating a word into context is a monotonic function of its surprisal: the higher surprisal a word is, the more difficult it is to process:

Amazing, Levy (2008) showed that this value is formally equivalent to the relative entropy over syntactic structures before and after seeing a word. In other words, more surprising words are more surprising because they lead to a more dramatic update of our beliefs about what structures are likely.

How do we know how probable a word is? Surprisal theory says that the probability of a word comes from the probability of that word in a parse. All consistent parses with a string are activated in proportion to their probability. So we need to know i) how to compute a tree's probability for a given string and ii) how to combine those probabilities to derive an estimate of the surprisal of a word in context.

Let's see how we get (i). It comes right from our PCFG:

As for (ii), it comes from a weighted average of all the activated parses. Each contributes in proportion to its activation. More technically, this is known as marginalization:

Interestingly, Hale (2001) showed that this explains the SRC/ORC asymmetry. Let's look at Hale's example, and work it out for ourselves.

Note that the point in the sentence where difficulty is predicted to occur is very different on this theory!

Expectation-based theories say ORCs should be difficult at the embedded subject, because this point is where there is unambiguous evidence for the less frequent structure.
Memory-based theories say ORCs should be difficult at the embedded verb, because this i the point where integration costs are incurred.

Staub (2010) argued that self-paced reading is too coarse to get a precise measurement of exactly which word causes the difficulty. He proposed instead to use eye-tracking-while-reading to measure processing difficulty:

Consistent with the predictions of both expectation-based and memory-based theories, difficulty is seen at both words, although perhaps more clearly in the embedded subject region.

Surprisal theory also makes accurate predictions about where garden path difficulty will arise. Consider this toy grammar, and the word by word surprisals:

Discussion point: What's the difference between a 'garden path' effect under Hale's theory, and a 'garden path' effect under the classical Garden Path Theory? Are they similar types of mental events? In what ways are they different?

Ambiguity advantage effect

One surprising prediction that immediately follows from Hale and Levy's surprisal theory is that ambiguous sentences should in some cases be easier than unambiguous sentences.

Consider the following triplet of sentences:

[AMBIGUOUS]: If you flipped the channel, you would see the accomplices of the thieves who were indicted for stealing the Mona Lisa.

[HIGH ATTACH]: If you flipped the channel, you would see the accomplices of the thief who were indicted for stealing the Mona Lisa.

[LOW ATTACH]: If you flipped the channel, you would see the accomplice of the thieves who were indicted for stealing the Mona Lisa.

The first sentence has a relative clause whose attachment site is globally ambiguous. In the second and third sentence, the attachment site of the relative clause is unambiguous: the number marking on the NPs forces it to attach 'high' (i.e. to the first NP) or 'low' (i.e. to the second NP).

Somewhat counterintuitively, it is the ambiguous condition that is easier to process than the second and third, a phenomenon known as the ambiguity advantage effect (first discovered by Traxler, Clifton & Pickering, 1998):

This is exactly the response pattern predicted by surprisal theory. The critical region (and especially the inflected auxiliary were) is more predictable or expected in the ambiguous condition. This is because whether the relative clause attaches high, or attaches low, it is a fairly probable completion. However, in either of the high or low attachment conditions, it becomes less probable, because it is not grammatically consistent with one or another of the potential parses.

Anti-locality

To now we have focused on locality effects in sentence processing, which have a pleasing intuitive quality to them: close should be easier!

It turns out that expectation-based theories make a unique, and very interesting prediction: in certain cases, we should be able to see anti-locality effects. One of the earliest demonstrations of this was Konieczny & Döring (2003):

These two sentences differ in a single letter: s vs. m. But the difference in this one letter conveys a case difference. In one example, 'friend' is a genitive modifier of the preceding noun, in the other, it is a dative dependent of the verb. What differs in these two is that the dative conveys more information about the verb that it upcoming and (critically from the point of view of Surprisal), reduces the space of possible upcoming words. The parser has already seen the dative, so it has reduced the number of possible continuations dramatically. On the other hand, the genitive is not very information about the structure of the VP, and so it gives little information.

In a striking finding, Konieczny & Döring found that reading times were faster following the dative argument. It seems that having additional dependents did not increase reading time. Instead, reading times were faster, as predicted by Surprisal theory: