Models Based on Credit Bureau Data

Student: Polyakov Alexander

Educational Programme: Financial Technology and Data Analysis (Master)

Final Grade: 8

Year of Graduation: 2021

In this final qualification paper, modeling using alternative data sources is considered. The question is raised how the data of the credit bureau can be useful for the client's business. We discuss the business formulation of the machine learning problem, decomposition into the necessary steps. Examples of features for machine learning, optimized metrics and metrics are given. It is shown which machine learning algorithm best copes with the task. The pipeline of works, solved problems, and model quality control are described. The benefits of implementing this model for business are described. In particular, it describes in detail how gradient boosting solves the problem of identifying target customers who have visited the online loan application form. The paper presents all the steps of solving this problem: from formulation to implementation. In this task, there are several problems, such as: how to match clients by an incomplete set of identifiers, how to increase the percentage of data found using heuristics and pulling up additional information, how not to overdo it with matching and not start to degrade the quality, which features are best suited for the task, how to collect them, how to select them. It shows how to choose the best model and what statistics have to do with it. It describes how models should be validated to prevent quality drawdowns, and which metrics should be looked at. After that, the pilot of checking the utility of the model is described, what is the difference between offline and online metrics. The purpose of the work is to demonstrate an example of a solved problem on real tabular data, the complexity of data collection and process construction. Show the pros and cons of well-known metrics such as ROC AUC, the pros and cons of well-known models such as gradient boosting. The following technology stack is used: Python, SQL, Docker, Pandas, Scikit- Learn, LightGBM. It describes why this particular implementation of gradient boosting was chosen.

Full text (added April 5, 2021)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses