This study represents a combination of three interdisciplinary subprojects that are related to each other but have each their own objectives, methods, and results.
The goal of the first subproject, "Social Network Sites, Social Capital, and Privacy", is to examine the factors influencing accumulation and perception of social capital in online social network.
Empirical base: The study uses two types of data and two datasets respectively: server-level data, representing natural user behavior, and self-reported survey data. Server-level data is the sample of users from the Vkontakte social network site who declare their location to be in Vologda city that includes data on the activity on the "walls", data on friendship relations and metadata (gender, age and so on). The final server-level sample contained 193,335 users and 9,800,107 edges in friendship network. The survey sample contained 375 completed responses from Vologda users.
Methods: data collection: automated scraping via the official Vkontakte API; online survey via questionnaire-application. Analysis: social network analysis, descriptive statistics, correlation, OLS and logistic regression analysis.
Results: the number of joined online communities, the share of other's posts and the number of likes (as relationship maintenance and social grooming behavior instances) on the user's wall, and the number of shared images (compensating the lack of social context) is statistically significantly (but not strongly) related to the indicators of structural social capital in the local friendship network: degree centrality, Burt's network constraint index and local clustering coefficient. Besides the indicators of structural social capital on social network site are not almost connected to the perceptions of social capital. Finally, users who protected their online social network profiles with privacy settings tend to seek for new social contacts, participate in online communities and in online relationship maintenance behavior to the greater extent.
Recommendations for implementation: The results of this study could be applied in marketing research for detection of opinion leaders in networks of local communities, and detection of brokers, who bridge separated social groups in urban setting.
The objective of the second subproject, "News coverage of Ukrainian crisis on the websites of Russian and Ukrainian TV channels: comparative analysis 2013 - 2014", is to identify and explain differences and similarities in agenda setting on websites of Russian and Ukrainian channels.
Empirical base: the collection of news texts from websites of First Channel (Russia) and Fifth Channel (Ukraine) covering the time period from September 1, 2013 to September 1, 2014. The collection contained 44,989 texts, of which 20,025 belong to the Fifth Channel and 24,964 to the First. Text preprocessing included following steps: automatic translation of Ukrainian texts to Russian, lemmatization ( MyStem), and removing of stop-words.
Methods: data collection: directed parsing from TV channels' websites. Analysis: LDA topic modeling with Gibbs sampling (5 solutions by 100 topics each). The topical similarity was measured by Kullback–Leibler divergence.
Results: There were revealed differences in news agenda between Russian and Ukrainian TV channels; each channel kept silent some unique topics. Fifth Channel kept silent topics regarding refugees from areas of armed clashes in the south-east of Ukraine, and activity of «Right sector». First Channel kept silent topic of international sanctions, topic of Ukrainian president elections 2014 campaign, release of Yulia Tymoshenko and problems with Gas supply to Ukraine. These silent topics reflect political interests of conflicting countries. Thus, we cannot conclude that one of the channels (neither Russian or Ukrainian) is more objective than the other.
Recommendations for implementation: Topic modeling could be used in the development of software applications for automatic content analysis (monitoring) of news media and for agenda and framing detection in news.
The objective of the third subproject "Optimization of topic models for analysis of Internet-content", was the developing the thermodynamic approach to determination of the optimal number of topics in mixture of distributions and to examine hypothetic relation between the stability of topic modeling and the choice of the optimal number of topics.
Empirical base: Two datasets in Russian and English. Russian dataset is the collection of 101,481 posts total from LifeJournal social media. English collection is the well-known dataset ‘20 newsgroup dataset’ (News20) consisted of 15,404 posts total. Text preprocessing included following steps: lemmatization, removing of stop-words, and converting to 'crc32' format.
Methods: data collection: Russians dataset was collected in previous projects; English dataset was downloaded from open data source. Analysis: topic modeling based on 4 algorithms (pLSA, LDA (E-M algorithm), LDA и GLDA (Gibbs sampling)) with the variation of 'number of topics' parameter. Computing of free energy and two versions of entropy (Renyi and Tsallis) for each model and each chosen number of topics. Calculation of the Jaccart coefficient for each topic model as a function of 'number of topics' parameter.
Results: This is the important step on the way towards improving topic modeling algorithms, needed for analysis of large collections of Internet texts ('Big Social Data'). The entropy approach for analysis of complex texts systems is formulated on the basis of Renyi's and Tsallis's entropy and accounting for zones of semantic stability. Developed approach allows determine the optimal number of topics in topic models. The optimal number of topics refers to minimum of non-extensive entropy in any topic model. The study shows that GLDA topic models and E-M algorithm perform similar results in terms of global minimum of non-extensive entropy. Besides GLDA topic models perform additional local minimums that could be of interest for sociological analysis.
Degree of implementation: Obtained results was reported in Yandex School of Data Analysis (Moscow) on October 4, 2017 as a first step of implementation results into the industry practice.