• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Master 2022/2023

Research Seminar "Data Scraping"

Category 'Best Course for Career Development'
Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'
Category 'Best Course for New Knowledge and Skills'
Type: Compulsory course
Area of studies: Applied Mathematics and Informatics
When: 1 year, 3 module
Mode of studies: offline
Open to: students of one campus
Master’s programme: Магистр по наукам о данных (заочная)
Language: English
ECTS credits: 9
Contact hours: 16

Course Syllabus

Abstract

Data Scraping is importing information from a website, spreadsheets, PDF's and other data sources. Using machine learning methods without a well-prepared dataset will not lead to good results. Qualitatively prepared datasets suitable for machine learning algorithms are a rarity. Automating the preparation of such data sets is the task of data scraping. The course examines the issues of text file encoding, network interaction with web servers, the basics of the HTML hypertext markup language, XML and JSON data storage and exchange formats, interaction with servers using the API, and work with non-static sites. The course uses Python and its libraries to access data. At the end of the course, students will implement a data scraping project.
Learning Objectives

Learning Objectives

  • Learn to process excel/xml/json/pdf files using Python
  • Learn ip, dns, http. GET- and POST- requests
  • Learn HTML basics
  • Learn to implement BeautifulSoup library, automatization with Selenium
  • Learn to use API's
Expected Learning Outcomes

Expected Learning Outcomes

  • Learn most popular encodings
  • Change encoding of a text from one to another
  • Navigate through JSON & XML
  • Extract text and images from PDF
  • Apply regular expressions
  • Understand HTML
  • Create a simple HTML-page
  • Understand CSS
  • Analyze the connection between HTML and CSS
  • Create a more complicated HTML page
  • Apply CSS to add style to HTML page
  • Analyze HTTP protocol message format
  • Learn about Python Web-Tools
  • Apply Python requests module
  • Apply Python requests module to deal with headers, user-sessions, POST-requests, files
  • Apply Python BeautifulSoup module to scrape static pages
  • Analyze the difference between static and dynamic pages
  • Understand Silenium library capabilities, its functions and methods
  • Apply Silenium library to scrape data from a dynamic page
  • Recognize the concept of Web-API
  • Contrast the process of scraping via Web-API and via page source
  • Examine the process of web-development
  • Create your own simple web-service & web-API
  • Implement a scraping script from scratch
  • Understand legal & ethical nuances of data scraping
Course Contents

Course Contents

  • 1. Character Encodings
  • 2. Popular File Formats
  • 3. Regular Expressions and HTML
  • 4. HTML and CSS
  • 5. Internet
  • 6. Scraping HTML
  • 7. Selenium
  • 8. Web API
  • 9. Web development 101
  • 10. Practice
Assessment Elements

Assessment Elements

  • non-blocking Final Project
  • non-blocking Programmin Assignments
  • non-blocking Quizzes
  • non-blocking Peer Review
  • non-blocking Staff-graded Assignment
Interim Assessment

Interim Assessment

  • 2022/2023 3rd module
    0.02 * Staff-graded Assignment + 0.04 * Peer Review + 0.14 * Quizzes + 0.42 * Programmin Assignments + 0.38 * Final Project
Bibliography

Bibliography

Recommended Core Bibliography

  • Matt West and Matt West - HTML5 Foundations - John Wiley & Sons, Incorporated , 2012-386 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=1120310

Recommended Additional Bibliography

  • Ian Pouncey and Richard York - Beginning CSS : Cascading Style Sheets for Web Design - John Wiley & Sons, Incorporated, 2011-466 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=693510