Individual results
View docsView in-depth performance of a single language model on a single test suite.
Region-by-region surprisal
Sample item for Filler-Gap Dependencies (subject extraction)
The first item of the test suite is shown below for quick reference. Please visit the page for Filler-Gap Dependencies (subject extraction) to see the full list of items.
Item |
Condition
|
prefix | comp | np1 | verb | np2 | prep | np3 | end |
---|---|---|---|---|---|---|---|---|---|
Item | Condition | prefix | comp | np1 | verb | np2 | prep | np3 | end |
1 | what_nogap | I know | who | our uncle | grabbed | the food | in front of | the guests | at the holiday party |
1 | that_nogap | I know | that | our uncle | grabbed | the food | in front of | the guests | at the holiday party |
1 | what_gap | I know | who | grabbed | the food | in front of | the guests | at the holiday party | |
1 | that_gap | I know | that | grabbed | the food | in front of | the guests | at the holiday party |
Prediction performance for GPT-2 XL on Filler-Gap Dependencies (subject extraction)
Accuracy |
Formula
|
Description |
---|---|---|
Accuracy | Prediction | Description |
66.67% | (676,what_nogap/3,np1) > (677,that_nogap/3,np1) | We expect the NP1 to be less surprising in the that_no-gap condition than in the what_no-gap condition, because an upstream wh-word should set up an expectation for a gap. |
79.17% | (678,what_gap/4,verb) < (675,that_gap/4,verb) | We expect the “verb” region to be lower in the what_gap condition than in the that_gap condition, because gaps must be licensed by upstream wh words (such as “what”). |