The goal of the first subproject—Social Networks as A Space for Social Capital Formation—is to study the influence of user behavior, their privacy concerns and propensity to make connections on their perceived social capital.
The study is based on two types of data and two datasets respectively: 1) server-level data, representing natural user behavior, 2) and self-reported survey data. Server-level data is the sample of users from VK—an online social network—with a declared location in Vologda city, and includes data on user activity on ‘walls’, friendship relations as well as metadata (gender, age and so on). The server-level sample contains 193,335 users and 9,800,107 edges in the friendship network. The survey dataset contains complete responses from a representative sample of 375 VK users from Vologda.
Methods: The user data were collected using VKMiner, a software developed at the Laboratory for Internet Studies. The survey is conducted through a specially designed application for the VK website. The analysis was performed with various statistical methods (multiple regression analysis, correlation analysis, etc.), implemented for R, an environment for statistical computing (stats, igraph, mediation, lavaan, sem, semTools packages).
Results: The study shows the importance of propensity to make connections for predicting user’s social capital and influence of relationship maintenance online as a mediating factor. We found no supporting evidence for the idea that privacy concerns influence social capital; they only influence the informational dimension of privacy (who can see their profiles and messages). Another finding is younger people, and people with higher self-esteem tend to see their bridging social capital as higher.
Recommendations: The utilized scales, imported from Facebook studies, showed their reliability and could be used in studies of Russian language online social networks. Effect size and significance of predictors in modeling of social capital is sensitive to sample features, as we found the dependence of users’ social capital on age and purposes of using the VK website.
The objective of the second subproject—Validation of Sentiment Lexicon for Analysis of Socio-Political Messages of Social Media Users—is to compare the quality of two Russian language lexicons on socio-political document collection sampled from user posts on social media. The empirical basis of the subproject consists of (1) a large database of LiveJournal blog posts and their comments by top-2000 users from March 203—March 2014 period, and (2) a sample of public user messages collected from all Russian social media by commercial aggregator IqBuzz for January 2014—December 2015 period.
Methods: Lexicon validation and comparison was conducted in a series of experiments with machine learning algorithms (SVM, KNN and Naïve Bayes) as well as one rule-based algorithm (SentiStrength). The results were analyzed in term of algorithm performance in the sentiment classification task with F-macro, accuracy, precision and recall metrics.
Results: Experiments show that the rule-based algorithm on average is better than any machine learning algorithm (ML) for sentiment classification of socio-political messages in Russian. LINIS-CROWD (PolSentiLex)—developed by the Internet Studies Lab—is better than RuSentiLex showing higher or comparable results while being only 33% the size of RuSentiLex.
Recommendation: The results suggest that for SA of socio-political messages on Russian language social media, given the available resources, a social scientist will be better off with using a rule-based algorithm with LINIS-CROWD (PolSentiLex). Along with being more accurate, this combination is cheaper and more accessible than any ML solution or proprietary service and much faster, showing better results with smaller lexicon and with no need for costly statistical computation.
The goal of the third subproject—Estimating the Effect of Semantic Stability and the Topic Number Choice on Topic Modeling of Internet Content—is to develop a generalized version of the thermodynamic approach to the choice of optimal topic number for topic modeling that would account for the semantic stability of topic models. The empirical basis of the subproject consists of two publicly available document collection in Russian and English with a known number of topics.
The subproject is realized through development of necessary mathematical formulations as well as through computational experiments: (1) mathematical formulation of the thermodynamic approach to choosing the optimal topic number based on the James formalism considering the two-parameter entropy; (2) mathematical analysis of limitations of the thermodynamic approach in relation to various formulations of average values definitions; (3) computational experiments with two-parameter entropy estimation with various topic models on the two datasets.
Results: The two-parametric Sharma-Mittal entropy in ‘2-q’ formulation allows, on the one hand, to find the entropy minimum with changing the deformation parameter q = 1/T, on the other hand, while using as the second parameter of deformation r the Jaccard coefficient, allows to account for the semantic aspect of topic models. We formulated the algorithm of finding stability zones and optimal parameters of topic models. The criterion of choosing the optimal topic number is the minimum of the Sharma-Mittal entropy in the ‘2-q’ formulation which corresponds to the minimum of the Renyi entropy. The criterion of finding zones of semantic stability is the entropy minimum. Entropy peaks correspond to low Jaccard values which correspond to halting of semantic stability. Thus, the stability zone should be chosen based on the minimum of the two-parameter entropy.
Recommendations for implementation: For analysis of large document collections in the context of social sciences it is important to choose the right number of topics to look for in a dataset. Thus, it is necessary to use an algorithm for finding the optimal topic number and semantic stability zones. The results show that the Sharma-Mittal entropy in the two-parameter formulation is the best measure for quality of topic models and should be used in such algorithms.