• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Visual-Linguistic Encoders for Multimodal Fashion Search

Student: Kravczov Denis

Supervisor: Elena Kantonistova

Faculty: Faculty of Computer Science

Educational Programme: Machine Learning and Data-Intensive Systems (Master)

Year of Graduation: 2025

This paper focuses on the investigation and development of visual-language encoder architectures designed for multimodal search in the specific domain of e-commerce – the fashion industry. The objective of this study was to develop and comparatively evaluate the effectiveness of approaches based on Self-Supervised Learning (SSL) and Supervised Learning paradigms for image-based product retrieval using "image-text" pairs on the target Wildberries dataset. The SigLIP2 visual encoder and the USER-bge-m3 text encoder served as the base models. In this study, the VISTA and MagicLens multimodal architectures were analyzed, implemented, and adapted. Key research directions included: 1) the development and evaluation of methods for reducing the number of visual tokens from the SigLIP2 encoder's output to decrease computational complexity and memory requirements; 2) the application of an optimized CLIP loss function implementation based on the Triton framework, enabling the use of larger batch sizes; and 3) the integration of a custom projection matrix into the MagicLens architecture to enhance the alignment of representations from different modalities. Additionally, an SSL modification of the MagicLens architecture was proposed and evaluated for the "image-to-image" search task. The main results demonstrate that the proposed modification of the MagicLens model, incorporating an integrated projection matrix, achieved an mAP@10 score of 0.9075 on the validation pipeline in the "image-to-image-text" search mode, surpassing the performance of the existing production model. The findings confirm the promise of the investigated architectural solutions and methods for increasing batch size in contrastive learning for enhancing the quality of multimodal search.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses