Individual results

View docs

View in-depth performance of a single language model on a single test suite.

Region-by-region surprisal
Sample item for Filler-Gap Dependencies (hierarchy)
Item
Condition
prefix subj_subj subj_wh subj_embed subject_gap filler matrix_verb matrix_gap continuation
Item Condition prefix subj_subj subj_wh subj_embed subject_gap filler matrix_verb matrix_gap continuation
1 what_nogap The fact that my brother said who his friend trusted our uncle at the party surprised my daughter yesterday afternoon
1 that_nogap The fact that my brother said that his friend trusted our uncle at the party surprised my daughter yesterday afternoon
1 what_subjgap The fact that my brother said who his friend trusted at the party surprised my daughter yesterday afternoon
1 that_subjgap The fact that my brother said that his friend trusted at the party surprised my daughter yesterday afternoon
1 what_matrixgap The fact that my brother said who his friend trusted our uncle at the party surprised yesterday afternoon
1 that_matrixgap The fact that my brother said that his friend trusted our uncle at the party surprised yesterday afternoon
Prediction performance for GPT-2 on Filler-Gap Dependencies (hierarchy)
Accuracy
Formula
Description
AccuracyPredictionDescription
70.83% (573,what_nogap/6,filler) > (572,that_nogap/6,filler) We expect the “filler” to be less surprising in the that_no-gap condition than in the what_no-gap condition, because an upstream wh-word should set up an expectation for a gap.
91.67% (575,what_subjgap/6,filler) < (576,that_subjgap/6,filler) We expect the “filler” region to be lower surprisal in the what_gap condition than in the that_gap condition, because gaps must be licensed by upstream wh words (such as “what”).