Research Seminar "Data Scraping"

Master 2022/2023

Category 'Best Course for Career Development'

Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'

Category 'Best Course for New Knowledge and Skills'

Type: Compulsory course

Area of studies: Applied Mathematics and Informatics

Delivered by: Big Data and Information Retrieval School

Where: Faculty of Computer Science

When: 1 year, 3 module

Mode of studies: offline

Open to: students of one campus

Instructors: Anastasia Maximovskaya

Master’s programme: Магистр по наукам о данных (заочная)

Language: English

ECTS credits: 9

Contact hours: 16

Full Syllabus

Abstract

Data Scraping is importing information from a website, spreadsheets, PDF's and other data sources. Using machine learning methods without a well-prepared dataset will not lead to good results. Qualitatively prepared datasets suitable for machine learning algorithms are a rarity. Automating the preparation of such data sets is the task of data scraping. The course examines the issues of text file encoding, network interaction with web servers, the basics of the HTML hypertext markup language, XML and JSON data storage and exchange formats, interaction with servers using the API, and work with non-static sites. The course uses Python and its libraries to access data. At the end of the course, students will implement a data scraping project.

Learning Objectives

Learn to process excel/xml/json/pdf files using Python
Learn ip, dns, http. GET- and POST- requests
Learn HTML basics
Learn to implement BeautifulSoup library, automatization with Selenium
Learn to use API's

Expected Learning Outcomes

Learn most popular encodings
Change encoding of a text from one to another
Navigate through JSON & XML
Extract text and images from PDF
Apply regular expressions
Understand HTML
Create a simple HTML-page
Understand CSS
Analyze the connection between HTML and CSS
Create a more complicated HTML page
Apply CSS to add style to HTML page
Analyze HTTP protocol message format
Learn about Python Web-Tools
Apply Python requests module
Apply Python requests module to deal with headers, user-sessions, POST-requests, files
Apply Python BeautifulSoup module to scrape static pages
Analyze the difference between static and dynamic pages
Understand Silenium library capabilities, its functions and methods
Apply Silenium library to scrape data from a dynamic page
Recognize the concept of Web-API
Contrast the process of scraping via Web-API and via page source
Examine the process of web-development
Create your own simple web-service & web-API
Implement a scraping script from scratch
Understand legal & ethical nuances of data scraping

Course Contents

1. Character Encodings
2. Popular File Formats
3. Regular Expressions and HTML
4. HTML and CSS
5. Internet
6. Scraping HTML
7. Selenium
8. Web API
9. Web development 101
10. Practice

Assessment Elements

Final Project
Programmin Assignments
Quizzes
Peer Review
Staff-graded Assignment

Interim Assessment

2022/2023 3rd module
0.02 * Staff-graded Assignment + 0.04 * Peer Review + 0.14 * Quizzes + 0.42 * Programmin Assignments + 0.38 * Final Project

Bibliography

Recommended Core Bibliography

Matt West and Matt West - HTML5 Foundations - John Wiley & Sons, Incorporated , 2012-386 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=1120310

Recommended Additional Bibliography

Ian Pouncey and Richard York - Beginning CSS : Cascading Style Sheets for Web Design - John Wiley & Sons, Incorporated, 2011-466 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=693510

Course Syllabus