• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Project seminar "Data Scraping"

2020/2021
Academic Year
ENG
Instruction in English
7
ECTS credits
Course type:
Compulsory course
When:
1 year, 2 semester

Course Syllabus

Abstract

Data scraping is the process of importing information from websites, spreadsheets, PDF files and other data sources. Machine learning without a well-prepared dataset will never yield good results, but datasets of proper quality, which would be suitable for use in machine learning, are very hard to find. Data scaping solves this problem by automating the preparation of such datasets. This course will examine text file encodings, network interaction with web servers, the fundamentals of the HTML hypertext markup language, XML and JSON data storage and exchange formats, interaction with servers using the API, and work with non-static sites. Python and its libraries will be used to retrieve the data. At the end of the course, you will be expected to complete a data scraping project. Course topics: ● Processing excel/xml/json/pdf files using Python ● ip, dns, http. GET- and POST- requests ● HTML basics ● BeautifulSoup library, automatization with Selenium ● Using APIs ● Project preparation The Internet is a great source of information, and the good thing is that it is at arm’s length nowadays. However, the amount of data may seem overwhelming; it comes in many forms, tends to grow exponentially fast and sometimes gets hard to cope with. In this course, we will help you master the tools that are necessary to transform the seemingly immense ocean of data into meaningful, useful information. We will examine most common data formats, study the Internet architecture, investigate the structure of a webpage and learn how to create one of our own, as well as dive into the concept of API. Finally, we will consolidate the knowledge we acquired by implementing a project. After the end of the course, you will know how to deal with a complex practical task like data scraping and will have completed your own project.
Learning Objectives

Learning Objectives

  • The goal of the course is to teach students concepts and tools necessary for Data Scraping. We are going to cover Python libraries for communicating through the Web, learn how to deal with popular file formats. By the end of the course a student will be capable to use Python to collect any data present on the Web, and will be able to use the data for further analysis. Many programming assignments will help to get a hands-on experience. In the final project a student will demonstrate skills of scraping, parsing, analysing and demonstrating data.
Expected Learning Outcomes

Expected Learning Outcomes

  • You will learn the concept of character encodings: what they are and why we need them.
  • you will study Unicode standard, that specifies all possible characters, and UTF-8 encoding, the most popular  character encoding nowadays.
  • You will learn the structure of popular file formats for data exchange (XML, Excel, JSON, PDF), and how to process them in python.
  • You will learn regular expressions, a powerful tool for working with texts, and you will start learning HTML, an essential part of WWW and of Data Scraping.
  • You will continue learning HTML and you will learn CSS, to make your web-pages look better.
  • You will get a high-level view on how the Internet works, and will learn HTTP-requests more specifically.
  • You will start learning python web-scraping tools. This week they will be python requests module and BeautifulSoup library.
  • You will continue learning python web-scraping tools. This week this will be Selenium library, that comes in handy with dynamic web-pages
  • You will study the concept of API, and you will practice on real life examples.
  • You will learn how to create your own simple web-service.
  • You will start working on the final project, and you will see demonstration videos of lecturer working on some other project.
  • You will continue working on the final project, and you will see more demonstration videos of lecturer working on some other project.
  • You will finish working on the final project, and you will see more demonstration videos of lecturer working on some other project.
  • you will learn something about legal and ethical aspects Data Scraping
Course Contents

Course Contents

  • Character Encodings
  • Regular Expressions
  • HTML & CSS
  • Popular File Formats
  • Internet
  • Scraping HTML
  • Selenium
  • Web API
  • Web development 101
  • Project Demo 1
  • Project Demo 2
  • IMDB Visualization
Assessment Elements

Assessment Elements

  • non-blocking Programming Assignments
  • non-blocking Quizzes
  • non-blocking Staff-Graded Assignments
Interim Assessment

Interim Assessment

  • Interim assessment (2 semester)
    0.58 * Programming Assignments + 0.17 * Quizzes + 0.25 * Staff-Graded Assignments
Bibliography

Bibliography

Recommended Core Bibliography

  • Dynamic HTML : the definitive reference, Goodman, D., 2007

Recommended Additional Bibliography

  • Classification, clustering, and data analysis : recent advances and applications, , 2002