Year of Graduation
Data Mining and Machine Learning for social networking websites
School of Applied Mathematics and Information Science
The aim of this work was to solve the problem of predicting the number of likes to a post, analyzing not only its content, but also a potential audience using Data Mining and Machine Learning.The data of social network "Odnoklassniki", provided by SNA Hackathon competition (St. Petersburg, Russia, April 2014. http://sh2014.org/), was used. Initially, the problem was solved by linear regression. Various factors of regression were considered such as the presence of images, links, the ratio of letters (caps), belonging to a frequency dictionary specially composed for this purpose, days of week, the average number of likes in the group etc., from which the ones, which give the most accurate result, were chosen. After that another predictor - clusters - was added. All experiments were taken in «Ipython Notebook».Splitting into two clusters was produced by methods based on modularity: "Fast greedy community" and "Edge between ness community". Both of these methods are presented in «Pajek» which was used for the clustering.The solution of the problem of SNA Hackaton contest (0.231) is not much inferior to the leader of the competition, which scored 0.303, can be considered as a result. Besides, there is a constructed model of linear regression and the selected predictors for it. The most notable result is the successful usage of clustering methods in a predictive problem of this type. There have also been proposal hypotheses of how to improve the results and the direction of future research indicated.