The first sub-project “Mapping ethnic attitudes in the Russian language LiveJournal with advanced topic modeling” investigates representations of ethnic groups in Russian blogs while working on the problem of stability of the topic modeling algorithm being used for mining ethnic discourse. The second sub-project “Structure of communities in social networking websites” studies a range of network, structural and content features of certain types of groups in Vkontakte social network, including professional groups of software developers, groups of St.Petersburg observers social movement and anti-medical groups. The third project “Online recommender systems: analysis of publications and new developments” focuses of recommender system development for sparse data and publications in this sphere.
Research goals. The first sub-project maps attitudes of Russian-language blogger towards various ethnic groups; it also optimizes stability of topic modeling algorithm aimed at finding ethnic-related topics. The second sub-project aims at finding relationship of network structure of communities in VKontakte SNS with socio-demograhpic and other properties of these communities. The third project develops new algorithms for recommender systems and analyses latest publications in this sphere.
Empirical base of the research. Empirical base of the first sub-project includes: (a) 363579 posts of top 2000 users of LiveJournal blogging platform according to the social capital rating; time period: 11 weeks from February 4 to May 19, 2013; 990 posts selected for manual analysis; (b) dataset with 101481 posts for testing regularizers for Latent Dirichlet Allocation topic modeling algorithm. Empirical base of the second sub-project consists of: (a) 11 groups of software developers in VKontakte SNS with over 10,000 users, including one group selected for in-depth analysis with 15,451 users; (b) 17 district groups of St.Petersburg observers and one all-city group whose data were collected in 16 time points, totally over 13 thousand users; (c) a 2.0 level hyperlink ego-network of an AIDS-denialist group in VKontakte SNS consisting of 11 groups. Empirical base of the third sub-project contains: (a) a dataset from FMhost online radio broadcasting consisting of 4266 users, 3618 tags, 2209 radio stations and 4165 tracks; (b) automatically generated collection of research papers about recommender systems created from the top 18 relevant conferences.
Sub-project 1. It has been found out that most intensively two types of ethnic groups are discussed in blogs: most often, distant “geopolitical foes” (e.g. Americans) and a less often proximate, but socially problematic groups (e.g. Tajiks). Three quarters of texts discuss ethnic groups either in political of cultural / ritual contexts, and the former prevails. With high probability some nations are discussed in one particular context, while others are associated with another one. The five most negatively described nations are “Caucasian”, Tajik, Dagestani, American and African/Negro. Tajik and Chechen are also among the most inferior ethnicities. Dagestani, American, British and Caucasian are among six most dangerous; Dagestani, American, British, German and Chechen are among six most alien. It has been also found out that already in winter and early spring 2013, quite long before the Ukrainian crisis, two Ukrainian topics were present in the blogosphere that included all main parties, characters and problematic points of the future conflict.
The research of stability of three topic modeling algorithms has shown that the proposed method of granulated sampling leads to the highest increase of the number of stable topics, as compared to the other two algorithms: to 135 of 200 against 84 and 135, when measured with normalized Kullback-Leibler metric. It also gives a much higher value of Jaccard index (0.6 against 0.3).
Sub-project 2. The research has revealed that the district groups of St.Petersburg observers are not independently emerged movements, but branches of the all-city movement, albeit affiliated with it to a varying degree. Their activity peaks at the very start during 2011-2012 national electoral cycle, however, those groups that were “alive” from the beginning to not die, but stabilize. Group moderators, that it movement leaders, set the agenda, while the community expresses opinions on it (comments) and approval / solidarity (likes). Group size is dependent on the number of moderators’ posts, but not individual posts. This indicates the central role of leadership for a group’s success. Offline leaders of the movement are well predicted with their online properties, in particular with their centrality in the overall network of friendship, the number of district groups they belong to, and the volume of feedback they receive.
It has been also found out that no ties established in the studied professional community of software developers emerge based on users’ geolocation; this confirms a hypothesis about existence of geo-independent communities online. The study of egonetwork of the AIDS-denilaist movement has not found a sufficient proof of it being a part of a broader anti-medical movement.
Sub-project 3. Three new algorithms for recommender systems with tags have been developed, including TagLDA. Experiments have shown that they perform better for relatively small sparse datasets compared to traditional algorithms. An overview of results and trends in the sphere of recommender systems has been made based on the latest relevant publications. A software has been developed that trains TagLDA algorithm.
Implementation of research results. Algorithms tested in the first sub-project are implemented into a software that is used in other projects of the Laboratory for Internet Studies. Recommender algorithms may be used in any small-size commercial recommender system. Methods of analysis of ethnic discourse may be used for mapping other types of user attitudes and thus serve an analytical base for policies in the relevant areas.