CRoW is a manually-curated, multi-task benchmark that evaluates the ability of models to apply commonsense reasoning in the context of six real-world NLP tasks. It is constructed using a multi-stage data collection pipeline that rewrites examples from existing datasets using commonsense-violating perturbations. The study reveals a significant performance gap when NLP systems are evaluated on CRoW compared to humans, indicating that commonsense reasoning is far from being solved in real-world task settings.
This page was last edited on 2024-02-20.
This page was last edited on 2024-02-20.