Building File Systems and Distributed Data Management Systems for Performance and Reliability
File systems are a cornerstone on any computers to support lower-level data management, and distributed file systems play a critical role in providing file services to all kinds of applications. In recent years, management of data in a flat name space, key-value (KV) store, is widely used for its simplicity and efficiency. These systems are often critical layers of a software stack in a large-scale data center supporting Internet-wide services driven by data, in particular big data. Provisioning the services, such as search, advertising, email, maps, video, chat, blogger, entails collection, storage, and access of data as well as computation on the data. A unique challenge posed on the software infrastructure is the reality that it runs on a very large number of mostly off–the–shelf hardware parts, including processors, network adapters/routers, and disks. This has substantially changed the landscape of the research and practice of large-scale distributed computing, which now has to assume that failure is the norm, rather than an exception. Accordingly, fault tolerance must take the first priority in the design. Further, because of huge and ever-increasing data set and system scale, many other issues must also be re-examined to meet the system’s requirements on reliability, scalability, availability, and efficiency. Understanding the design challenges, issues, scope, and the state-of-the-arts is essential not only for system researchers and practitioners, but also for application developers who access and process (big) data on the cloud.
This course has three sections. It will first cover basic concepts and design techniques for file systems, including data structures and algorithms used in fast file systems, log-structured file system as well as journaling and copy-on-write techniques. This is followed by discussions on distributed file systems, in which issues such as replication, consistency, synchronization, and fault tolerance will be covered. In its case studies, some well-known systems, such as Google’s GFS and Ceph, will be discussed. In the last section, the instructor will focus on key-value stores to show how the issue of read and write amplifications is addressed. Example stores to be studied include Google’s LevelDB and SILT. He will conclude this section by introducing the LSM-trie KV store, one of his recent works with open-source code available for learning and adopting.
This course will focus on understanding of issues, design choices, and problem solving skills, rather than simply sharing concepts and facts about the systems. Students will immerse themselves in the knowledge and skills that are highly relevant to today’s IT practices about big data.
The course has three key objectives for the students:
1. To become aware of specific challenges and issues facing today’s big data processing systems from both system and workload perspectives.
2. To understand why and how some well-known systems address issues they were designed to attack as well as their relative weaknesses.
3. To have hands-on experience on manipulating a data management system.