Delivered at:: Faculty of World Economy and International Affairs

Course type:: Compulsory course

When:: 3 year, 3, 4 module

Instructor

Karpov, Maksim

Full Syllabus

Abstract

The course provides students with a basic knowledge of statistics and data analysis techniques. The course consists of three parts. In the first part we will talk about general ideas of statistics and data analysis, mainly discussing descriptive statistics and basic data manipulations. In the second part of the course we will move towards inferential statistics and hypothesis testing. In the third part, we will apply machine learning techniques for data analysis. All the course practice will be conducted in Python. There are 3 credits for this course.

Learning Objectives

Via this course, students will acquire a solid basis in data manipulation and visualization.

Expected Learning Outcomes

ability to perform exploratory data analysis, hypothesis testing and visualization
ability to preprocess text and retrieve information with modern approaches
Intermediate proficiency in Python libraries for data analysis and visualization (NumPy, Pandas, Matplotlib, Plotly, Scikit-Learn, etc.)
Skill of evaluation of model predictive accuracy.
Skill of interpreting text and media as big data.
Skill of using logistic regression.
the knowledge and skills for implementation of own projects in Python

Course Contents

Review of the basic data manipulation and visualization R packages: tidyverse, ggplot2. Summary statistics of a dataset.
Text and media as big data. Concepts of structuring text and assessing the sentiment of an expression.
Introduction to graph theory. The Euler’s ‘Seven bridges of Koenigsberg’ problem. Using the ‘rgraph’ library to manipulate a network structure. Assigning additional properties to edges and vertices.
Visualizing the network structure of VK friends. Accessing VK account from R via API, obtaining individual account data and building a graph. Plotting and labeling the result.
Introduction to logistic regression. Evaluation of model significance. P-value, confidence intervals, pseudo-R-squared.
Evaluation of model predictive accuracy. Contingency table. ROC – curve. Selecting an optimal separation threshold.
Review of Python basics, concepts and syntax for data manipulation
Exploratory data analysis and descriptive statistics using Python packages (Pandas, NumPy)
Data visualization using matplotlib, seaborn, plotly
Hypothesis testing (t-test, z-test, etc). Confidence intervals
Linear regression. Metrics for quality evaluation (MSE, RMSE, MAE, R2, etc)
Logistic regression. Metrics for quality evaluation (Accuracy, Precision, Recall, AUC-ROC, etc).
k-Nearest Neighbours. Model selection, validation and analysis. Cross-validation, train-test split. Parameter tuning

Assessment Elements

Quizzes
There will be short in-class quizzes distributed throughout the course. Each quiz will take 5-10 minutes and will cover the material of the previous weeks (particularities will be communicated at least one week in advance). Question types might be a multiple-choice or a short answer. The sum of all grades will count towards the final grade with a weight of 10%.
Exam (Project Defence)
At the end of the course, students will have to participate in the group project. Groups will consist of 2 students. They will have to gather data from the Internet via Python, write it to a file and then calculate some statistics. Students will have to submit their code and project description during the exam week and then defend it on the day of the exam. Students will be asked questions about the code they have submitted. The total grade will consist of a grade for the written part and a grade for the Q&A. All students in the group receive the Q&A grade based on the performance of the weakest student in the group (e.g. if one of the participants cannot answer any question, then the entire group gets a 0 for a defence part). Particularities of the project will be announced in the second part of the 4th module. The project grade will count towards the final grade with a weight of 30%.
Home Assignments
There will be homework assignments with data analysis in Python. Solutions should be submitted via SmartLMS platform. Each assignment will have its own deadline and will be graded from 0 to 10 points. The mean of all assignments will count towards the final grade with a weight of 20%. A selective oral examination of a home assignment is performed. In case of plagiarism (or if the student cannot explain his/her solutions), the grades for all homework assignments submitted by the student are zeroed.
Midterm Test
There will be a midterm test at the end of the third module. It will be conducted via SmartLMS platform. The test will consist of a quiz and data analysis problems. A Mock Test will be published a few weeks in advance. The grade for the test is from 0 to 10 and will count towards the final grade with a weight of 20%.
Seminar Participation
There will be mini-tasks during the seminars. The student needs to continue the snippet of code on a given task or answer the question. Semi-points and no points are allowed to assess the students' performance. The total grade will be normalised from the maximum in the group.

Interim Assessment

2023/2024 4th module
0.3 * Exam (Project Defence) + 0.2 * Home Assignments + 0.2 * Midterm Test + 0.1 * Quizzes + 0.2 * Seminar Participation

Bibliography

Recommended Core Bibliography

Garrett, N. (2015). Textbooks for Responsible Data Analysis in Excel. Journal of Education for Business, 90(4), 169–174. https://doi.org/10.1080/08832323.2015.1007908
Introduction to natural language processing, Eisenstein, J., 2019
Pattern recognition and machine learning, Bishop, C. M., 2006
The data science handbook, Cady, F., 2017

Recommended Additional Bibliography

De-Arteaga, M., & Boecking, B. (2019). Killings of social leaders in the Colombian post-conflict: Data analysis for investigative journalism. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.1906.08206
Döbler, M., & Grössmann, T. (2019). Data Visualization with Python : Create an Impact with Meaningful Data Insights Using Interactive and Engaging Visuals. Packt Publishing.
Houston, B., & Houston, B. (2019). Data for Journalists : A Practical Guide for Computer-Assisted Reporting (Vol. Fifth edition). New York, NY: Routledge. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1989291
Natural language processing and information systems : 18th international conference on applications of natural language to information systems, NLDB 2013 Salford, UK, June 19-21, 2013: proceedings, , 2013

Bachelor’s Programme 'International Program 'International Relations and Global Studies''

Address

Data Analysis