Distributed Representation of Entity Mentions Within and Across Multiple Text Documents
Keywords:
Coreference Resolution, Cross-Document Coreference Resolution, Distributed Representation of Words, Information Extraction, Natural Language ProcessingAbstract
Regarding to the importance of entities as a base of information for several NLP applications, Cross- Document Coreference Entity Resolution (CDCR) provides techniques for the identification of textual mentions of entities and clustering co-referent mentions across multiple documents. In such context, while prior works employ Knowledge Bases (KB) as a structured information resource to enrich the context of mentions, however these methods have limitations with KB’s unknown entities, with effects on the accuracy and performance of the task. Accordingly, this paper presents a new approach to improve the state-of-the-art by concentration on the knowledge provided by the input text of the mentions, regardless of any external knowledge resource. For this purpose, we first construct the context of mentions using the sequence of informative words around the mention (known as content-words). Furthermore, by abstraction of the mention vector representation to a limited size using an artificial neural network technique of continuous representation of words (i.e. Word2Vec), we reduce the computational cost of the co-referring mentions sub-task. By analyzing the results of experiments with two datasets, significant gains in the accuracy of CDCR as well as run-time efficiency are achieved, compared to the best prior methods.