Datenbanksysteme und moderne CPU-Architekturen

Information

Content

  • This lecture covers the implementation of database systems, including how to leverage modern hardware architectures.
  • The lectures are held in English.
  • In the Exercises for this lecture, you will have the chance to build a tiny database system from scratch.

Prerequisites

  • Grundlagen der Informatik
  • Grundlagen Datenbanksysteme (GDB) (IN0008)

Organization

  • First Lecture: Tuesday, April 25th 2017
  • Lecture + Exercises: 2pm-5pm
  • Bonus: 75% required for the 0.3/0.4 grade bonus
  • Room: 02.09.014 (seminar-room)

Literature

  • Theo Härder, Erhard Rahm. Datenbanksysteme: Konzepte und Techniken der Implementierung. Springer, Berlin; 2nd ed.
  • Hector Garcia-Molina, Jeff Ullman, Jennifer Widom. Database Systems: The Complete Book
  • D. E. Knuth. The Art of Computer Programming Volume III
  • Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System

Lecture Slides



Project

  • Please read this page for some general notes on the programming assignments.
  • Please provide (minimal) documentation and comment your code.
  • You should work on the project in teams of two students. (The first task can be done alone.)
  • Send an email to kohna@in.tum.de with a zip/tar.gz solution file. Please use the prefix [dbimpl] in your submission email’s subject.

Task 1: External Sort (due: May 2, 10am)

  1. Write a function void externalSort(int fdInput, uint64_t count, int fdOutput, uint64_t memSize) that sorts count 64 bit unsigned integer values stored in the file referred to by the file descriptor fdInput using memSize bytes of main memory and stores the result in the file associated with the file descriptor fdOutput. Your function should implement the external merge sort algorithm and should perform a k-way merge during the merge phase, i.e. merge k runs together at once. To sort individual runs, you may use STL’s std::sort. To manage the k-way merge, the STL std::priority_queue (from std:queue) may be helpful. Useful system calls are open/cose, read/write, pread/pwrite, posix_fallocate.
  2. Write a test case that sorts 5GB of data and verifies the order of the output. The command-line interface must be sort inputFile outputFile memoryBufferInMB. Use the input file generator for testing. Your data format must adhere to the format specified in the program.

Task 2: Buffer Manager (due: May 16, 10am)

Write a basic buffer manager that manages buffer frames and controls concurrent access to these frames. The buffer manager should offer the following functionality:
  • BufferManager::BufferManager(unsigned pageCount)
    Create a new instance that keeps up to pageCount frames in main memory.
  • BufferFrame& BufferManager::fixPage(uint64_t pageId, bool exclusive)
    A method to retrieve frames given a page ID and indicating whether the page will be held exclusively by this thread or not. The method can fail (by throwing an exception) if no free frame is available and no used frame can be freed. The pageID variable is split into a segment ID and the actual page ID. Each page is stored on disk in a file with the same name as its segment ID (e.g., "1").
  • void BufferManager::unfixPage(BufferFrame& frame, bool isDirty)
    Return a frame to the buffer manager indicating whether it is dirty or not. If dirty, the page manager must write it back to disk. It does not have to write it back immediately, but must not write it back before unfixPage is called.
  • void* BufferFrame::getData()
    A buffer frame should offer a method giving access to the buffered page. Except for the buffered page, BufferFrame objects can also store control information (page ID, dirtyness, ...).
  • BufferManager::~BufferManager()
    Destructor. Write all dirty frames to disk and free all resources.
Your buffer manager should have the following features:
  • High performance. Release locks as early as possible.
  • Concurrency: It should be able to handle concurrent method invocations efficiently (e.g. using pthread_rwlock_t). Requests to fixPage should block until the requested access (exclusive or shared) can be fulfilled.
  • Buffering: It should use a buffer of size frames to keep pages in memory as long as possible. If no free frames are available, old frames should be reclaimed using some reasonable strategy.
Your buffer manager does not need to flush of pages to disk asynchronously or prefetch pages that are likely to be accessed in the near future. Use this test program to validate your implementation.

Task 3: Slotted Pages (due: May 30, 10am)

Create a metadata/schema segment for your database system. For each relation store its name, segment id, size in pages, and all its attributes and types. You should be able to serialize and deserialize the meta data segment to/from disk. Always store it in segment 0, e.g. in a file called 0. If you want to able to read SQL schemas, you may find the schema parser and the schema example file useful. Alternatively, you can simply create the schema programmatically.

Implement slotted pages for your database system.
  1. Define a segment type SPSegment that operates on slotted pages. A slotted page consists of three parts: A header, the slots and the (variable-length) records. Records are addressed by TIDs (tuple identifier), consisting of a page ID and a slot ID.
  2. Provide an interface to insert, remove, update and lookup "records". Records consist of a size and the data (you could also think of them as strings: they contain a length indicator len followed by a pointer to len characters).
    TID SPSegment::insert(const Record& r)
    Searches through the segment’s pages looking for a page with enough space to store r. Returns the TID identifying the location where r was stored. Note: This can be implemented much more efficiently with a free space bitmap as described in chapter 3, slide 3, but you are not required to do this.
    bool SPSegment::remove(TID tid)
    Deletes the record pointed to by tid and updates the page header accordingly.
    Record SPSegment::lookup(TID tid)
    Returns the read-only record (Record.hpp) associated with TID tid.
    bool SPSegment::update(TID tid, const Record& r)
    Updates the record pointed to by tid with the content of record r.

Test your implemention, for example with this slotted pages test skeleton