Bachelor
2021/2022
Distributed Databases and Data Warehouses
Category 'Best Course for Career Development'
Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'
Category 'Best Course for New Knowledge and Skills'
Type:
Elective course (Software Engineering)
Area of studies:
Software Engineering
Delivered by:
School of Software Engineering
Where:
Faculty of Computer Science
When:
3 year, 3, 4 module
Mode of studies:
offline
Open to:
students of one campus
Instructors:
Alexander Breyman
Language:
English
ECTS credits:
5
Contact hours:
60
Course Syllabus
Abstract
Course presents a detailed introduction into distributed data processing, relational data ware-houses, multidimensional OLAP tools and massive parallel data processing systems (Hadoop, Cassandra, MongoDB). Students will develop understanding in the design methodology for distributed databases and data warehouses. Practice studies include implementing databases and applications software in map/reduce paradigm and in several NoSQL data models
Learning Objectives
- The objective of the course is to form professional competencies related to design and imple-mentation of several kinds of distributed databases, including data warehouses, online analyti-cal data processing and big data management tools. Students will get a grasp on strengths and weaknesses of wide spectrum of approaches to data storage, search and retrieval, resulting in informed choice of database model. This course studies different conceptual database models and their properties. The models that will be discussed are: • Relational data warehouse; • Multidimensional data warehouse; • Online analytical processing; • Map/reduce massive parallel data processing; • Key/value, document, graph and wide columnar database models; • Data stream processing. For these conceptual models the course will concentrate on the following points: Why was the database model introduced? Which of the shortcomings of other models does it address? What are the most important concepts and notions for the database model? How is the model imple-mented? Which are the main techniques? The importance of understanding the internals of a particular database model cannot be overemphasized as it is closely connected to its limitations.
Expected Learning Outcomes
- Compare databases and data warehouses
- Construct Data Cleaning Pipelines
- Construct MapReduce programs for execution under YARN
- Construct Streaming Processing Programs
- Create data models for Cassandra
- Create Data Models for Streams
- Create document databases, fill it with data, retrieve documents
- Create multidimensional models
- Describe implementation of in-memory database operations
- Describe storage details of in-memory databases
- Design Data Integration Solutions
- Discuss applicability of in-memory databases
- Discuss Cassandra concepts and architecture
- Discuss the basic concepts of Data warehousing and OLAP technology
- Discuss the multidimensional model and present its main characteristics, components and operations
- Employ HDFS for file storage and retrieval
- Explain distributed database systems architecture and design
- Understand use cases for HDFS and Hadoop
- Use key/value data storages
- Write Queries and Statements in CQL
Course Contents
- Data Warehousing and Big Data Management
- Data Warehousing Architectures and Models
- Data Cleaning And Integration
- Key/Value and Document Databases
- Map/Reduce and Hadoop
- Large-scale Distributed Databases
- In-memory Databases
- Data Streams Management
Assessment Elements
- Work on Seminars
- Quiz
- Midterm Test
- Essay
- ExamЭкзамен в MS Teams. В качестве резервного канала будет одновременно запущена встреча в Zoom. Письменный экзамен. Асинхронный прокторинг
- Group Project
Interim Assessment
- 2021/2022 4th module0.07 * Quiz + 0.15 * Work on Seminars + 0.07 * Essay + 0.5 * Exam + 0.07 * Midterm Test + 0.14 * Group Project
Bibliography
Recommended Core Bibliography
- Berg, Silvia, P., & Frye, R. (2016). SAP HANA : An Introduction (Vol. Fourth edition). Bonn: SAP PRESS. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1350145
- Golab, L., & Özsu, M. T. (2010). Data Stream Management. [San Rafael, Calif.]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=440359
- Jukic, N., Vrbsky, S., & Nestorov, S. (2017). Database Systems : Introduction to Databases and Data Warehouses. Burlington, Virginia: Prospect Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562389
- Parsian, M. (2015). Data Algorithms : Recipes for Scaling Up with Hadoop and Spark. [Sebastopol, CA]: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1028927
Recommended Additional Bibliography
- Antony, B., Boudnik, K., Adams, C., Shao, B., Lee, C., & Sasaki, K. (2016). Professional Hadoop. Indianapolis, IN: Wrox. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1233763
- Carpenter, J., & Hewitt, E. (2016). Cassandra: The Definitive Guide : Distributed Data at Web Scale (Vol. Second edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1271661
- Deka, G. C. (2017). NoSQL : Database for Storage and Retrieval of Data in Cloud. Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1521297
- Deshpande, T. (2014). Mastering DynamoDB. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=836700
- Doan, A., Halevy, A., & Ives, Z. G. (2012). Principles of Data Integration. [Waltham, MA]: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=465063
- Edward, S. G., & Sabharwal, N. (2015). Practical MongoDB : Architecting, Developing, and Administering MongoDB. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1124206
- Fowler, A. (2015). NoSQL For Dummies. Hoboken, NJ: For Dummies. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=942547
- Ganti, V., & Das Sarma, A. (2013). Data Cleaning : A Practical Perspective. [San Rafael, California]: Morgan & Claypool Publishers. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=646833
- Haq, Q. M. R. U. (2016). Data Mapping for Data Warehouse Design. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1115852
- Hows, D., Membrey, P., Plugge, E., & Hawkins, T. (2015). The Definitive Guide to MongoDB : A Complete Guide to Dealing with Big Data Using MongoDB (Vol. Third edition). [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1109647
- Hueske, F., & Kalavri, V. (2019). Stream Processing with Apache Flink : Fundamentals, Implementation, and Operation of Streaming Applications (Vol. First edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102045
- Inmon, W., & Krishnan, K. (2011). Building the Unstructured Data Warehouse : Architecture, Analysis, and Design (Vol. First edition). Westfield: Technics Publications. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1005034
- Kimball, R., & Caserta, J. (2004). The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Indianapolis, IN: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=124355
- Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit : The Definitive Guide to Dimensional Modeling (Vol. 3rd edition). Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=605991
- Kimball, Ralph, and Margy Ross. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons, 2013.
- KOROTKEVITCH, D. (2017). Expert SQL Server In-Memory OLTP (Vol. 2nd ed). Berkeley, CA: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1589522
- Krish Krishnan. (2019). Building Big Data Applications. [N.p.]: Academic Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1892146
- Linstedt, D., & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0. Amsterdam: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1065504
- Nabi, Z. (2016). Pro Spark Streaming : The Zen of Real-Time Analytics Using Apache Spark. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1174432
- Nelson, J. (2016). Mastering Redis. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1243702
- Nishant Garg. (2014). HBase Essentials. [N.p.]: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=906714
- Perkins, L., Redmond, E., & Wilson, J. R. (2018). Seven Databases in Seven Weeks : A Guide to Modern Databases and the NoSQL Movement (Vol. Second edition). Raleigh, N. C: Pragmatic Bookshelf. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1806794
- Ray Rankins, Paul Bertucci, Chris Gallelli, & Alex T. Silverstein. (2015). Microsoft SQL Server 2014 Unleashed. [N.p.]: Sams Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1601720
- Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, & Shuen Mei. (2018). Apache Spark 2: Data Processing and Real-Time Analytics : Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1991793
- Shrivastava, A., & Deshpande, T. (2016). Hadoop Blueprints. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1364692
- Tae, K. H., Roh, Y., Oh, Y. H., Kim, H., & Whang, S. E. (2019). Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.1904.10761