The project, being a multidisciplinary ensemble, has three parts with distinct yet complementing goals.
(1) The first is to study the relationship between online ego-network features connected with structural online social capital, privacy attitudes, and cognitive limits identified for offline relationship networks in the case of VK users. Then, to formulate research design and a data collection procedure for an in-depth study of the structure of online friendship networks, patterns of online communication, and their limits in the case of VK users.
(2) Improve obtained previously results of automatic detection of ethnic hate speech in informal user messages using linguistic rules on the sentence level.
(3) Adapt methods of the renormalization theory for parameter optimization (the number of topics) in topic modeling.
(1) Correlation, multiple regression analysis, path-analysis;
(2) machine learning experiments (SVM, Naïve Bayes, Logistic regression, LSTM/GRU, Word2Vec);
(3) search for necessary mathematical formulations and machine learning experiments (VLDA, LDA with Gibbs sampling, PLSA).
Empirical base of research
(1) First, we used data collected for our subproject in 2017, “Social Capital and privacy online: urban communities on online social networks.” These data were collected in August—October 2017 and include results of a socio-psychological survey (n=357), information on the social network in an urban community (used for computing of structural capital on the friendship networks), data on user ego-networks (a subgraph of the urban network comprised of the survey participants). Second, we constructed a sample of VK users (n=35) and collected survey data on the structure of ego-networks as well as data on the volume of directed communication between participants and their online friends.
(2) First, we used a corpus (n=2.7 mil.) of all messages from Russian-language online social media for an entire year containing at least one ethnonym from a dictionary of post-soviet ethnic groups. These data were collected from the database of social media aggregator IQBuzz. Second, a sample (n=15,000) of the corpus marked up by three independent coders with estimates of author sentiment towards mentioned ethnicity; also, we used verb-ethnicity pairs marked-up by 14 coders.
(3) Experiments were based on three corpora: a Russian corpus (n=8,624) coded by human coders with ten topics; an English corpus (n=15,404) coded by human coders with 15 topics; a French corpus (n=25,000) without any human markups.
(1) It was found that propensity to make connections define a user’s social network. In addition, the relationship between several types of privacy behaviors, privacy attitudes and structural capital was found significant. This result contributes to the understanding of the mechanism involved in the formation of online social capital. Moreover, during this project, a new research design and a data collection tool were developed and used for data collection (n=41) for testing hypotheses on patterns of online communication.
(2) The quality of ethnic hate speech detection was improved by 15% compared to results from our project in 2017 while the average quality of detecting all types of sentiment toward ethnic groups – by 16% in terms of F-macro measure. Moreover, the latter is at least 20% higher than the results of similar tasks for similar data, including a solution based on a neural network. Additionally, the neural network solution, when reproduced on our data showed comparable to previous results. This point out that neural networks might not be optimal for similar to our cases with relatively small data.
(3) A renormalization procedure was adapted and applied for parameter optimization of topic modeling, which improved the computation speed of our approach by at least 68% compared to the traditional approach.
Recommendations and area of application
(2) The new models could be used for automatic ethnic hate speech detection in large collections of informal user messages posted on social media. This could be used for the moderation of user-generated content according to laws on the spread of extremist information.
(3) The developed approach could be used for fast tuning of topic models, which is particularly valuable for social sciences where topic modeling is primarily used for empirical estimation of topics present in data. This approach reduces the time and cost of manual labor for coding data.