NORMA eResearch @NCI Library

Bounded Memory Coreference Resolution Using SpanBERT on Litbank Dataset

Taneja, Mandeep Kaur (2022) Bounded Memory Coreference Resolution Using SpanBERT on Litbank Dataset. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (612kB) | Preview
[thumbnail of Configuration manual]
Preview
PDF (Configuration manual)
Download (957kB) | Preview

Abstract

One of the key undertakings of Natural Language Processing (NLP) is Coreference Resolution (CR) that attempts to distinguish and determine various references to an item in a record. It attempts to find all semantic articulations - "mentions" that allude to a single "entity". CR is a fundamental stage in numerous semantic benchmarks, similar to address replying, regular language derivation, and named element distinguishing. These semantic seat markings have shown huge upgrades with the present-day transformer-based BERT models. With the effective use of present-day Bidirectional Encoder Representations for Transformers model, these semantic benchmarking have shown tremendous upgrades in their overall efficiency and accuracy. SpanBERT is an expansion of the BERT model that predicts ranges of text all the more accurately, especially better at separating related but distinguishable elements (e.g., President and CEO). In spite of the fact that BERT and SpanBERT models perform wonderfully in short sentences, they have very large runtime, memory, and computational asset prerequisites in preparing and modelling, when performed on lengthy records as they require keeping every token/entity in memory all the time. To solve the issue of huge computational resource requirements, this paper proposes a technique of storing only a limited number of tokens at a given instance of time (bounded memory architecture) and effectively "forgetting" a previously tracked entity whenever a new token is introduced in a heuristic manner. In most of the CR research, a classical dataset was used called OntoNotes. However, this dataset was created in 2012 and lacked quality annotations for our present-day usage. Hence, this paper has performed analysis on a newer Litbank dataset which is a collection of 100 classic Literature novels and can be categorised as a long text document. This dataset is annotated and maintained using Automatic Content Extraction guidelines, hence making it a better choice than OntoNotes dataset.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Horn, Christian
UNSPECIFIED
Subjects: P Language and Literature > PN Literature (General)
Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Tamara Malone
Date Deposited: 27 May 2023 10:52
Last Modified: 27 May 2023 10:52
URI: https://norma.ncirl.ie/id/eprint/6674

Actions (login required)

View Item View Item