• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Models and Methods for Extracting Named Entities in Legal Documents in Russian

Student: Bissenbayeva Sakura

Supervisor: Dmitry Ilvovsky

Faculty: Graduate School of Business

Educational Programme: Big Data Systems (Master)

Year of Graduation: 2020

LegalTech refers to technologies that are applied to a wide range of legal business tasks. As language and law have always been closely connected, NLP has a broad range of application areas, such as document storage and processing, and information retrieval. NLP in legal area is now developing and there is a business need to automate the work of lawyers. Thus, this area is actively introducing NLP, machine learning and deep learning into their workflow processes. Since NER implemented models can greatly help in processing large portions of legal texts in few minutes instead of days as it was before, there is a daily need to use NER system. The model for NER was constructed using the Models as Bidirectional LSTM-CRF and RuBERT (Adaptation of BERT Language Model for Russian language). In my thesis I analyzed the markup errors of the model, which is built for recognizing Conceptual Units, that is named entities, and their Attributes from Election Protocols. In the framework of this model the method to improve the model through analysis and correction of markup errors was developed. For this, the following steps have been implemented within the framework of objectives of the thesis: • Data preprocessing, in my case data are legal documents in Russian. I used annotation guidelines specially created for markup Conceptual Units in legal texts. • Exploring existing methods for extracting named entities from texts in several languages (English, German, Spanish and Russian) • Analyzing the advantages and disadvantages of each methods • Developing a new method based on existing methods • Testing a new method on real data and comparing the method with the baseline model. Estimation of results of experiment using Evaluation Metrics of NER.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses