When it comes to quotes, their accurate attribution is of utmost importance. Not only do quotes allow for the direct transmission of information, but they also bring stories to life and play a critical role in accurate reporting. Extracting information from quotes can even provide valuable insights into public opinions and societal trends. However, attributing quotes correctly can be a complex task.
To tackle the challenge of quote attribution, researchers from UCL’s Centre for Doctoral Training in Data Intensive Science joined forces with The Guardian. Combining their expertise in deep learning and natural language processing, they explored the application of machine learning techniques, specifically coreference resolution, to accurately attribute quotes.
Coreference resolution refers to the task of grouping together all mentions in a piece of text that refer to the same entity. This is particularly difficult due to the multiple layers of complexity involved. Ambiguous anaphora, where different expressions refer to the same entity, and the presence of irrelevant entities within the text pose challenges for accurate coreference resolution.
Traditional rules-based methods alone are insufficient for addressing this task. Instead, machine learning techniques offer a more effective approach. By using language models, which are probability distributions over sequences of words, researchers can extract features and train the model to identify coreferent mentions.
In this collaboration, language models developed by ExplosionAI were employed. These models utilize word embeddings, which are mappings of words to points in a semantic space, to understand the contextual meaning of text. Training the language models involved manually labeling over a hundred Guardian articles to create a robust dataset for accurate attribution.
The successful application of AI and coreference resolution techniques in quote attribution provides exciting possibilities for the field of journalism. By leveraging the power of machine learning, news organizations can enhance the accuracy and reliability of their reporting, gaining a deeper understanding of public sentiments and societal shifts.
What is coreference resolution?
Coreference resolution is the task of grouping together all mentions in a text that refer to the same entity. It involves identifying the antecedent, the original entity, and subsequent mentions, known as anaphora. This process can be challenging due to ambiguous anaphoric expressions and the presence of irrelevant entities within the text.
Why is coreference resolution difficult?
Coreference resolution is complex because it requires linking ambiguous anaphora to unambiguous antecedents, which may be several sentences or even paragraphs away. Additionally, the choice of words and their semantics play a crucial role in understanding the sentiment conveyed in the text, making it challenging to rely solely on grammar-based methods for accurate resolution.
How does AI help in coreference resolution?
AI, specifically language models, can utilize word embeddings and contextual meaning to identify coreferent mentions. By training the model with labeled examples and leveraging machine learning techniques, it becomes capable of accurately attributing quotes and identifying mentions referring to the same entity. This enhances the accuracy and reliability of quote attribution in journalism.