A growing movement within natural language processing (NLP) and cognitive science asks how we can gain a deeper understanding of the generalizations that are learned by neural language models. While a language model may achieve high performance on certain benchmarks, another measure of success may be the degree to which its predictions agree with human intuitions about grammatical phenomena. To this end, an emerging line of work has begun evaluating language models as "psycholinguistic subjects" (e.g. Linzen et al. 2016, Futrell et al. 2018). This approach has shown certain models to be capable of learning a wide range of phenomena, while failing at others.

However, as this subfield grows, it becomes increasingly difficult to compare and replicate results. Test suites from existing papers have been published in a variety of formats, making them difficult to adapt in new studies. It has also been notoriously challenging to reproduce model output due to differences in computing environments and resources.

Furthermore, this research demands nuanced knowledge about both natural language syntax and machine learning. This has made it difficult for experts on both sides to engage in discussion: linguists may have trouble running language models, and computer scientists may have trouble designing robust suites of test items.

This is why we created SyntaxGym: a unified platform where language and NLP researchers can design psycholinguistic tests and visualize the performance of language models. Our goal is to make psycholinguistic assessment of language models more standardized, reproducible, and accessible to a wide variety of researchers.
SyntaxGym has three main components:
1. Browse or create psycholinguistic test suites.
2. Browse containerized language models.
3. View performance across models and test suites through interactive visualizations..
In information theory, the surprisal of a word $$w$$ measures the amount of information gained from observing $$w$$, conditioned on the context in which it occurs. Surprisal is commonly used in psycholinguistics, as it has been shown to correlate with human behavioral measures (e.g. Smith & Levy 2013).

Formally, surprisal is given by the negative log probability, or $S(w|\text{context}) = - \log_2 p(w|\text{context}).$ As a general rule of thumb, we expect ungrammatical (or unexpected) constructions to have high surprisal and grammatical (or predictable) constructions to have low surprisal.
Test suites consist of a set of sentences known as items. Items are split into chunks called regions, which take different forms in different conditions.

Unlike other NLP benchmarks like GLUE, the syntactic generalization benchmark provided by SyntaxGym is meant for testing only. That means you can train your language model however you'd like; the test suites can be used to evaluate any pre-trained or off-the-shelf model.
We are glad that you find SyntaxGym useful! If you use the website or command-line tools, we ask that you please cite the ACL 2020 system demonstration paper:

@inproceedings{gauthier-etal-2020-syntaxgym,
title = "{S}yntax{G}ym: An Online Platform for Targeted Evaluation of Language Models",
author = "Gauthier, Jon and Hu, Jennifer and Wilcox, Ethan and Qian, Peng and Levy, Roger",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-demos.10",
pages = "70--76",
abstract = "Targeted syntactic evaluations have yielded insights into the generalizations learned by neural network language models. However, this line of research requires an uncommon confluence of skills: both the theoretical knowledge needed to design controlled psycholinguistic experiments, and the technical proficiency needed to train and deploy large-scale language models. We present SyntaxGym, an online platform designed to make targeted evaluations accessible to both experts in NLP and linguistics, reproducible across computing environments, and standardized following the norms of psycholinguistic experimental design. This paper releases two tools of independent value for the computational linguistics community: 1. A website, syntaxgym.org, which centralizes the process of targeted syntactic evaluation and provides easy tools for analysis and visualization; 2. Two command-line tools, {}syntaxgym{} and {}lm-zoo{}, which allow any user to reproduce targeted syntactic evaluations and general language model inference on their own machine.",
}

If you use the original test suites, models, or results presented on the website, please cite the ACL 2020 long paper:

@inproceedings{hu-etal-2020-systematic,
title = "A Systematic Assessment of Syntactic Generalization in Neural Language Models",
author = "Hu, Jennifer and Gauthier, Jon and Qian, Peng and Wilcox, Ethan and Levy, Roger",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.158",
pages = "1725--1744",
abstract = "While state-of-the-art neural network models continue to achieve lower perplexity scores on language modeling benchmarks, it remains unknown whether optimizing for broad-coverage predictive performance leads to human-like syntactic knowledge. Furthermore, existing work has not provided a clear picture about the model properties required to produce proper syntactic generalizations. We present a systematic evaluation of the syntactic knowledge of neural language models, testing 20 combinations of model types and data sizes on a set of 34 English-language syntactic test suites. We find substantial differences in syntactic generalization performance by model architecture, with sequential models underperforming other architectures. Factorially manipulating model architecture and training dataset size (1M-40M words), we find that variability in syntactic generalization performance is substantially greater by architecture than by dataset size for the corpora tested in our experiments. Our results also reveal a dissociation between perplexity and syntactic generalization performance.",
}

We would love to hear your feedback. Please email us at contact@syntaxgym.org.