This study represents a combination of three inderdisciplinary subprojects that are related to each other but have each their own objectives, methods, and results.
The goal of the first subproject, "Behaviour of Vkontakte Users: a Regional Analysis Based on Improved Big Data Methods", is to establish the relations between network, textual, and sociodemographic characteristics of users or groups of users and track their regional differences and similarities.
Empirical base: a random sample of users for the Vkontakte social network who declare their location to be in Russia that includes data on the activity on the "walls", texts, data on friendship and metadata (gender, region and so on). The sample contained 7827384 statuses from the "walls" of 42459 users from 69 regions of Russia.
Methods: data collection: automated scraping via the official API. Analysis: descriptive statistics, correlational and regression analysis, topic modeling of stable topics on a subsample of texts of 36396 users united into a single text for each user.
Results: the majority of content in the social network relates to mundane or recreational topics and does not touch sensitive social issues, with the exception of the Ukrainian conflict and religious topics (Christianity and Islam). The most important differences in content have been found depending on the user's gender, and the differences reproduce gender sterotypes ("female' topics include cooking, beauty, and children; "male" topics, soccer and politics). Differences in topical content, activity, and user metadata due to the user's region and town size are almost nonexistent, which indicates territorial and geographical independence of online behaviour. However, on the individual level the rates of activity, feedback, and connectivity between users have great differences and are distributed according to a power law. These results are very important for the methodology of constructing samples from social networks.
The objective of the second subproject, ``Inter-Country Comparison of the Influence of Internet Consumption on Protest Behaviour", was to establish a connection between the consumption of online news and participation in peaceful street protests.
Empirical base: approximately 50 thousand respondents from 49 countries of the sixth wave of the international World Values Survey project, collected over 2011-2014, and country-level parameters from official international sources (e.g., World Bank).
Methods: data collection: the database is freely available. Analysis: multilevel logit and probit regression.
Results: reading news online is statistically significantly (but not strongly) positively related to the probability of protests in all countries, and there is no country where the relation would turn out to be negative. The effect of the news is stable to a number of control variables; it is stronger than the effect of receiving news from friends and from newspapers, but weaker than the effect of interest in politics. Besides, people who combine interest in politics and reading news online are significantly more inclined to protest than people who are only interested in politics or only consume news on the Web.
The objective of the third subproject, ``Developing Probabilistic Models for Natural Language Processing and Internet User Behaviour'', was to develop new probabilistic models and algorithms for processing texts generated by Internet users and to model behaviour and ranking in the context of Internet users.
Empirical base: this project analyzes the properties of new algorithms. Topic modeling algorithms were tested on a collection of blog posts from the LiveJournal social network, in total 101481 posts and 172939 unique words. Ranking algorithms were tested on the data on team tournament results in the form of vector representations of players, in total 680 tournaments and approximately 50000 participants.
Methods: data collection: automated scraping via the official API. Analysis: new topic models were tested for stability with normalized Kullback-Leibler divergence and Jaccard similarity; for quality (topic coherence), with AUC and tf-idf coherence. The predictive power of the new ranking algorithm based on training a deep neural network was evaluated by comparing the predictions with real results of tournaments.
Results: in experiments with topic modeling algorithms, the GLDA algorithm with a new regularizer based on defining local density functions was proposed; it has led to significant improvements in topic modeling stability while at the same time preserving its quality. It has also been established that existing regularizers can both somewhat improve stability (e.g., regularizer in the form of Dirichlet distributions) and reduce it (e.g., sparsity regularizer for the "documents-topics" matrix). In experiments with the new neural network architecture, after two-three training epochs the network achieved the accuracy level of 65-69% which remained at the same level in further training. This level corresponds to the basic TrueSkill model and is only slightly worse than a more complex Bayesian model developed previously, which suggests a possibility to solve the Bayesian rating problem without developing complicated Bayesian models with complex custom optimization algorithms.