• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Data Analysis

2023/2024
Academic Year
ENG
Instruction in English
3
ECTS credits
Course type:
Compulsory course
When:
3 year, 3, 4 module

Instructor

Course Syllabus

Abstract

The course provides students with a basic knowledge of statistics and data analysis techniques. The course consists of three parts. In the first part we will talk about general ideas of statistics and data analysis, mainly discussing descriptive statistics and basic data manipulations. In the second part of the course we will move towards inferential statistics and hypothesis testing. In the third part, we will apply machine learning techniques for data analysis. All the course practice will be conducted in Python. There are 3 credits for this course.
Learning Objectives

Learning Objectives

  • Via this course, students will acquire a solid basis in data manipulation and visualization.
Expected Learning Outcomes

Expected Learning Outcomes

  • ability to perform exploratory data analysis, hypothesis testing and visualization
  • ability to preprocess text and retrieve information with modern approaches
  • Intermediate proficiency in Python libraries for data analysis and visualization (NumPy, Pandas, Matplotlib, Plotly, Scikit-Learn, etc.)
  • Skill of evaluation of model predictive accuracy.
  • Skill of interpreting text and media as big data.
  • Skill of using logistic regression.
  • the knowledge and skills for implementation of own projects in Python
Course Contents

Course Contents

  • Review of the basic data manipulation and visualization R packages: tidyverse, ggplot2. Summary statistics of a dataset.
  • Text and media as big data. Concepts of structuring text and assessing the sentiment of an expression.
  • Introduction to graph theory. The Euler’s ‘Seven bridges of Koenigsberg’ problem. Using the ‘rgraph’ library to manipulate a network structure. Assigning additional properties to edges and vertices.
  • Visualizing the network structure of VK friends. Accessing VK account from R via API, obtaining individual account data and building a graph. Plotting and labeling the result.
  • Introduction to logistic regression. Evaluation of model significance. P-value, confidence intervals, pseudo-R-squared.
  • Evaluation of model predictive accuracy. Contingency table. ROC – curve. Selecting an optimal separation threshold.
  • Review of Python basics, concepts and syntax for data manipulation
  • Exploratory data analysis and descriptive statistics using Python packages (Pandas, NumPy)
  • Data visualization using matplotlib, seaborn, plotly
  • Hypothesis testing (t-test, z-test, etc). Confidence intervals
  • Linear regression. Metrics for quality evaluation (MSE, RMSE, MAE, R2, etc)
  • Logistic regression. Metrics for quality evaluation (Accuracy, Precision, Recall, AUC-ROC, etc).
  • k-Nearest Neighbours. Model selection, validation and analysis. Cross-validation, train-test split. Parameter tuning
Assessment Elements

Assessment Elements

  • non-blocking Quizzes
    There will be short in-class quizzes distributed throughout the course. Each quiz will take 5-10 minutes and will cover the material of the previous weeks (particularities will be communicated at least one week in advance). Question types might be a multiple-choice or a short answer. The sum of all grades will count towards the final grade with a weight of 10%.
  • blocking Exam (Project Defence)
    At the end of the course, students will have to participate in the group project. Groups will consist of 2 students. They will have to gather data from the Internet via Python, write it to a file and then calculate some statistics. Students will have to submit their code and project description during the exam week and then defend it on the day of the exam. Students will be asked questions about the code they have submitted. The total grade will consist of a grade for the written part and a grade for the Q&A. All students in the group receive the Q&A grade based on the performance of the weakest student in the group (e.g. if one of the participants cannot answer any question, then the entire group gets a 0 for a defence part). Particularities of the project will be announced in the second part of the 4th module. The project grade will count towards the final grade with a weight of 30%.
  • non-blocking Home Assignments
    There will be homework assignments with data analysis in Python. Solutions should be submitted via SmartLMS platform. Each assignment will have its own deadline and will be graded from 0 to 10 points. The mean of all assignments will count towards the final grade with a weight of 20%. A selective oral examination of a home assignment is performed. In case of plagiarism (or if the student cannot explain his/her solutions), the grades for all homework assignments submitted by the student are zeroed.
  • non-blocking Midterm Test
    There will be a midterm test at the end of the third module. It will be conducted via SmartLMS platform. The test will consist of a quiz and data analysis problems. A Mock Test will be published a few weeks in advance. The grade for the test is from 0 to 10 and will count towards the final grade with a weight of 20%.
  • non-blocking Seminar Participation
    There will be mini-tasks during the seminars. The student needs to continue the snippet of code on a given task or answer the question. Semi-points and no points are allowed to assess the students' performance. The total grade will be normalised from the maximum in the group.
Interim Assessment

Interim Assessment

  • 2023/2024 4th module
    0.3 * Exam (Project Defence) + 0.2 * Home Assignments + 0.2 * Midterm Test + 0.1 * Quizzes + 0.2 * Seminar Participation
Bibliography

Bibliography

Recommended Core Bibliography

  • Garrett, N. (2015). Textbooks for Responsible Data Analysis in Excel. Journal of Education for Business, 90(4), 169–174. https://doi.org/10.1080/08832323.2015.1007908
  • Introduction to natural language processing, Eisenstein, J., 2019
  • Pattern recognition and machine learning, Bishop, C. M., 2006
  • The data science handbook, Cady, F., 2017

Recommended Additional Bibliography

  • De-Arteaga, M., & Boecking, B. (2019). Killings of social leaders in the Colombian post-conflict: Data analysis for investigative journalism. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.1906.08206
  • Döbler, M., & Grössmann, T. (2019). Data Visualization with Python : Create an Impact with Meaningful Data Insights Using Interactive and Engaging Visuals. Packt Publishing.
  • Houston, B., & Houston, B. (2019). Data for Journalists : A Practical Guide for Computer-Assisted Reporting (Vol. Fifth edition). New York, NY: Routledge. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1989291
  • Natural language processing and information systems : 18th international conference on applications of natural language to information systems, NLDB 2013 Salford, UK, June 19-21, 2013: proceedings, , 2013