Machine Learning and Data Mining

Master 2018/2019

Category 'Best Course for Career Development'

Category 'Best Course for New Knowledge and Skills'

Type: Elective course (Modern Social Analysis)

Area of studies: Sociology

Delivered by: Department of Sociology

Where: Saint-Petersburg School of Social Sciences

When: 1 year, 4 module

Mode of studies: offline

Instructors: Sergei Koltsov

Master’s programme: Modern Social Analysis

Language: English

ECTS credits: 2

Contact hours: 28

Full Syllabus

Abstract

Rapid developments of social networking sites, online media and other internet-generated data are making machine learning an essential analytical tool of social scientists and industrial analysts of social data. Nowadays, social researchers should not only be able to work with different types of data, such as textual or relational data, but should also have skills to interpret results obtained with complex mathematical algorithms. In this course students will, first, get to know basic machine learning algorithms and their main advantages and limitations for social science goals. Second, they will obtain skills to work with machine learning software / codes. Third, by the end of the course all students will produce small-scale research project that may be used in their Master theses. Depending on the level of a student group, the course may be based on one of the following software tools: 1. Orange 2. R 3. Python

Learning Objectives

Learn algorithms and their main advantages and limitations for social science goals
Obtain skills to work with machine learning software / codes
Be able to work with different types of data, such as textual or relational data

Expected Learning Outcomes

Analyze textual and numerical data
Do textual preprocessing (lemmatization and tokenization)
Analyze data with machine learning tools
Visualize results of the analysis

Course Contents

Topic 1. Introduction to machine learning.
Introduction to machine learning and software review. Overview of the application of machine learning methods in various industries, including social science. A discussion of how modern methods of machine learning and artificial intelligence change approaches in many scientific fields, and why knowledge of these methods becomes part of the researcher’s general scientific culture, regardless of the specific subject area. Discussions of data types, quality metrics, methodology for conducting experiments on data of various types.
Topic 4. Regression (overview models).
The application of linear, logistic and non-linear regressions for text and tabular data. Impact of using categorical variables on regression results. Basic quality measures for regressions. Discussion of the results of applying regressions on different datasets.
Topic 5. Feature selection.
An overview of models for extracting useful features in a dataset (Univariate Selection, Recursive Feature Elimination, Principal Component Analysis, Random Forest). Implementation of the specified models in python (using the sklearn library).
Topic 6. Cluster analysis (Kmeans, Cmeans, Hierarchical clustering).
Implement cluster analysis in Jupiter notebook. Development of a cluster analysis pipeline based on K means. Ways to initialize the algorithm. Visualization of cluster analysis results. Development of a pipeline for hierarchical cluster analysis. Visualization of the results of hierarchical data clustering. Discussion of clusterisation problems. Discussion of the results of clustering on different datasets.
Topic 7. Linear models of classification and regressions.
Introduction to the classification procedure. Discussion of the difference between classification and regression. Mathematical model of logistic regression for classification purposes. Optimization of classification results based on regularization procedures. Discussion of the quality metrics of classifiers (Precision, Recall, F measure, ROC, confusion matrix). Realization of logistic regression in Jupiter notebook. Example of classification on real data.
Topic 8. KNN and SVM classification.
Discussion of the KNN algorithm. Analysis of the advantages and disadvantages of KNN. The problem of choosing the number of neighbors. Evaluation of the method of selecting the number of neighbors. Discussion of the SVM (Support Vector Machines) algorithm. Analysis of the advantages and disadvantages of this algorithm. Discussion of parameters in linear and polynomial SVM models. Implementation of the KNN model in the Jupiter notebook. Quality assessment model KNN. Implementation of the SVM model in Jupiter notebook. Evaluation of the quality of KNN and SVM. Comparison of KNN and SVM results with each other on real data (text and non-text data). Discussion of the results.
Topic 9. Naïve Bayes classifier.
Introduction probability theory. The classic and Bayesian version of calculating the event probability. Discussion of Bayes Rule. A priori and a posteriori judgments. The use of a naive Bayes algorithm for classification purposes. Discussion of the advantages and disadvantages of the Bayes classifier. Implementation of the classification pipeline based on the naive Bayes algorithm in Jupiter notebook. Comparison of the work of the Bayesian classifier with the KNN and SVM classifiers on real datasets (textual and non-textual data).
Topic 10. Topic modeling.
Introduction to topic modeling. Probabilistic formulation of the classification problem. Discussion of various models in the field of topic modeling (E-M and Gibbs sampling algorithms). Discussion of the problem of topics number. Assess the similarities and differences between the topic solutions. Review of software tools in the field of topic modeling. The implementation of the pipeline for topic modeling in the Jupiter notebook and ‘TopicMiner’ software. Evaluation of the effect of preprocessing on the results of topic modeling. Application of topic modeling for the analysis of socio - political data. Review of hierarchical topic models and models with word embedings and it realization in python (genism, tomotopy).
Topic 2. Overview of mathematical formalism necessary for understanding of machine learning.
Overview of the mathematical model of machine learning. Overview of basic concepts from the field of linear algebra. Overview of elements of mathematical analysis. Overview of Jupiter notebook, the general principles of working with notebook.
Topic 3. Data preprocessing.
Vector model of text collections. Discussion of the data cleansing process, text lematization, stop words removal. Implementing the preprocessing process Python (Jupiter notebook) for Russian and English languages. Discussion of the lematization procedures: Mystem. 2. Pymorphy. 3. NLTK.

Assessment Elements

Presentation of project
An project is a written self-study on a topic offered by the teacher or by the student him/herself approved by teacher. The topic for project includes development of skills for critical thinking and written argumentation of ideas. An project should include clear statement of a research problem; include an analysis of the problem by using concepts and analytical tools within the subject that generalize the point of view of the author
Homework

Interim Assessment

Interim assessment (4 module)
0.7 * Homework + 0.3 * Presentation of project

Bibliography

Recommended Core Bibliography

Murphy, K. P. (2012). Machine Learning : A Probabilistic Perspective. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=480968

Recommended Additional Bibliography

A Tutorial on Machine Learning and Data Science Tools with Python. (2017). Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.E5F82B62

Course Syllabus