Principles of Data-Intensive Systems

Principles of Data-Intensive Systems

This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing systems, streaming and machine learning systems. Topics include database system architecture, storage, query optimization, transaction management, fault recovery, and parallel processing, with a focus on the key design ideas shared across many types of data-intensive systems.

Location: Tuesday/Thursday 3:15 PM in Skillaud or on Zoom. We're on Zoom only for the first two weeks; join our meeting from the Zoom tab in the Canvas sidebar.

Announcements and Questions

We're using Ed Discussion for all course communications. You can access it from the sidebar in Canvas.

Course Staff

Instructor: Matei Zaharia (Office hours: by appointment, please email matei@cs.stanford.edu).

Course Assistants and Office Hours:

Currently, all office hours are online over zoom; as the quarter progresses, we may change some office hours to be in person, depending on the situation.

To join these zoom links, you must be logged into zoom with your Stanford email. We will be using QueueStatus at the link here: https://queuestatus.com/queues/1786. When entering the queue, please write a bit about the problem you're having and what you've done so far to address it.

The schedule is as follows.:

Silvia Gong:

Deepti Raghavan:

Tina Li:

Sanjari Srivastava:

Silvia Gong (Office Hours: Monday 4-5:30, email: silvgong@stanford.edu)

Deepti Raghavan (Office Hours: Tuesday 5-6:30, email: deeptir@stanford.edu)

Tina Li (Office Hours: Friday 4:30-6, email: tinally@stanford.edu)

Sanjari Srivastava (Office Hours: Saturday 1-2:30, email: sanjari4@stanford.edu)

Schedule

1/4 Introduction
1/6

Database System Architecture

Reading: A History and Evaluation of System R (You can skip the sections on pages 643-644 called "The Recovery Subsystem", "The Locking Subsystem", "The Convoy Phenomeon" and "Additional Observations". Question: Why did the storage system change from inversions to B-trees?)
Optional Reading: How to Read a Paper

1/11

Database System Architecture 2 & Storage

Assignment 1 Posted

1/13 Storage Formats and Indexing
1/18

Storage Formats and Indexing 2

Reading: Integrating Compression and Execution in Column-Oriented Database Systems (Question: How might the conclusions of the paper change if it ran on an NVMe SSD?)
Optional Reading: C-Store: A Column-Oriented DBMS

1/20 Query Execution
1/25 Query Optimization
1/27

Query Optimization 2

Reading: Spark SQL: Relational Data Processing in Spark (Question: Can you think of any limitations to the way Catalyst supports external extensions?)
Assignment 1 Due (at 11:59pm)
Assignment 2 Posted

2/1

Transactions and Failure Recovery

2/3

Guest Talk: Tianqi Chen on Abstractions for Machine Learning Compilation

Optional Reading: TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

2/8 Failure Recovery 2
2/10 Concurrency
2/14

Take-Home Midterm Posted (solutions)

You can do the midterm in any 2-hour window until midnight on 2/15.

Past midterms for reference: Winter 2021 (solutions), Winter 2020 (solutions), Spring 2019 (solutions)

2/15

No class, Take-Home Midterm Due (at midnight)

2/17

Concurrency 2

Reading: Granularity of Locks and Degrees of Consistency in a Shared Data Base (Read up to page 372. Question: Draw the data structures in a DBMS that holds a table sorted by primary key, with multiple pages. What locks have to be acquired to change one record's primary key?)

2/22

Concurrency 3 and Distributed Databases

Optional Reading: Time, Clocks and the Ordering of Events in a Distributed System
Assignment 2 Due (at 11:59pm)
Assignment 3 Posted

2/24

Distributed Databases 2

Optional Reading: Lessons from Internet Services: ACID vs. BASE

3/1

Cloud Database Systems

Optional Readings: Amazon Aurora, Dynamo, Delta Lake

3/3

Guest Talk: Rebecca Taft on CockroachDB: The Resilient Geo-Distributed SQL Database

Optional Reading: CockroachDB: The Resilient Geo-Distributed SQL Database

3/8

Streaming Systems

Optional Reading: The CQL Continous Query Language, The Dataflow Model

3/10

Security and Data Privacy

Reading: Privacy Integrated Queries (Question: Describe a computation on data for which it's hard to provide any differential privacy.)
Optional Readings: Robust De-anonymization of Large Sparse Datasets, Opaque, Splinter

Assignment 3 Due (at 11:59 pm)

3/18

Final Exam (Solutions)

Past finals for reference: Winter 2021 (solutions), Winter 2020 (solutions), Spring 2019 (solutions)

Prerequisites

Students should ideally have taken CS 145 and CS 161. If you haven't taken CS 145, the main thing we assume from it is knowledge of basic SQL syntax; you can also read a SQL tutorial to learn about this.

Lectures and Video Recordings

Lectures for the class will be given live on Zoom and recorded. You can find our Zoom link and the lecture recordings on Canvas. Please note that these recordings might be reused in other Stanford courses, viewed by other Stanford students, faculty, or staff, or used for other education and research purposes. If you have questions about video recording, please contact a member of the teaching team.

Assignments and Exams

We will have three programming assignments and two exams. The programming assignments are designed to be runnable on your personal machine and should be submitted through Gradescope.

The exams are open-book, meaning that you can use your course notes, slides, books, or online resources, except that communication is not be allowed during them (e.g., you can't ask a question on Stack Overflow or contact another student). Tests will cover material in the lectures, readings and assignments.

Readings

We have occasional readings for the lectures. We expect students to complete these and think about the questions we list for each paper on their own (you do not need to turn in answers). Our exams will cover content in the readings.

Optional Textbook

Database Systems: The Complete Book (2nd Edition), by Garcia-Molina, Ullman and Widom, covers a lot of the technical material in the course and may be helpful as a study guide. We focus on chapters 13-20. We will also cover the material in lectures, but this book is a good source of additional information.

Grading

  • Assignments: 16% each (total: 48%)
  • Midterm: 22%
  • Final: 30%

Late Policy

Students each have up to 2 late days that they may use for assignments. Assignments submitted after these late days have been used up will incur a penalty of 10% per extra day late. In addition, we will not accept submissions after March 19th at 11:59 PM Pacific to give the staff enough time for grading.

Course Summary:

Date Details Due