Quotebank

Quotebank

Corpus of quotations from a decade of news

Quotebank is a large-scale dataset of 235 million quotations extracted from 162 million English news articles published between 2008 and 2020. Speaker attribution is performed using a Wikidata-linked entity linking pipeline and a probabilistic model trained on Wikipedia-derived supervision, yielding a quotation-to-speaker corpus suitable for NLP, social science, and computational journalism research.

Natural Language
Maturity
Support
C4DT
Inactive
Lab
Unknown

Data Science Lab

Data Science Lab
Robert West

Prof. Robert West

Our research aims to make sense of large amounts of data. Frequently, the data we analyze is collected on the Web, e.g., using server logs, social media, wikis, online news, online games, etc. We distill heaps of raw data into meaningful insights by developing and applying algorithms and techniques in areas including social and information network analysis, machine learning, computational social science, data mining, natural language processing, and human computation.

This page was last edited on 2024-04-16.