Individual results
View docsView in-depth performance of a single language model on a single test suite.
Region-by-region surprisal
Sample item for Filler-Gap Dependencies (object extraction)
The first item of the test suite is shown below for quick reference. Please visit the page for Filler-Gap Dependencies (object extraction) to see the full list of items.
Item |
Condition
|
prefix | comp | np1 | verb | np2 | prep | np3 | end |
---|---|---|---|---|---|---|---|---|---|
Item | Condition | prefix | comp | np1 | verb | np2 | prep | np3 | end |
1 | what_nogap | I know | what | our uncle | grabbed | the food | in front of | the guests | at the holiday party |
1 | that_nogap | I know | that | our uncle | grabbed | the food | in front of | the guests | at the holiday party |
1 | what_gap | I know | what | our uncle | grabbed | in front of | the guests | at the holiday party | |
1 | that_gap | I know | that | our uncle | grabbed | in front of | the guests | at the holiday party |
Prediction performance for GPT-2 XL on Filler-Gap Dependencies (object extraction)
Accuracy |
Formula
|
Description |
---|---|---|
Accuracy | Prediction | Description |
100.00% | (640,what_nogap/5,np2) > (641,that_nogap/5,np2) | We expect the NP2 to be less surprising in the that_no-gap condition than in the what_no-gap condition, because an upstream wh-word should set up an expectation for a gap. |
95.83% | (642,what_gap/6,prep) < (639,that_gap/6,prep) | We expect the “prep” region to be lower in the what_gap condition than in the that_gap condition, because gaps must be licensed by upstream wh words (such as “what”). |