Practical Deployment of End-to-End Machine Learning Pipeline: Analyzing Sentiments in Sephora Beauty Product Reviews

Student: Pavlova Ol`ga

Educational Programme: Master of Data Science (Master)

Year of Graduation: 2024

The development of a practical deployment of end-to-end machine learning pipeline for processing and analyzing sentiments of Sephora beauty products’ customer reviews is presented. The pipeline encompasses three main iterative phases: data ingestion and transformation, training of machine learning models, and deployment of these models in the web-service using cloud technology. The data ingestion phase involves facilitating the automatic collection and storage of large datasets (up to 4 million text reviews with meta-data) by web-scraping and the establishment of a database architecture. Based on the statistical analysis of significant variations in the distribution of the target variable across different review types and product categories, there are formed 8 subsets with train, validation and test parts with equal target proportion for a fair comparison. During the model training phase, 17 experiments were conducted using both simple and advanced algorithms. The DistilBERT model was identified as the most efficient, balancing technical constraints and predictive accuracy. While the Naïve Bayes model served as a lightweight model even if not with the best accuracy rate. Using the title as an important summarization and source of sentiment along with the review text helped significantly improve model accuracy. Challenges in sentiment differentiation, particularly in the “Fragrance” category, and limitations in linguistic processing were discussed. The deployment phase involved creating a web-service architecture with frontend and backend components for both main and fallback solutions. DistilBERT model is used in the main application for making predictions and is used within the fallback application the Naïve Bayes model. Key technological implementations include Docker image’s build optimization, Kubernetes cluster deployment in the Cloud, and load balancer’s using for error handling with the default fallback. A CI/CD pipeline ensures reliable code updates.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses