Thesis Topics

If you are interested in one of these topics contact me with your transcript and some information about you and your motivation. Topics marked with BA are suitable for both, a bachelor thesis and a guided research. Topic marked with MA are suitable for a masters thesis.

[BA,MA] Network Benchmarks

Implement and compare a bunch of ways to transfer data over networks fast. You can support both TCP/IP and RDMA. Evaluate bandwidth, latency, and CPU usage for several approaches. Technologies to compare: OpenMPI, boost::asio, io_uring, RDMA, ...

[BA, MA] Distributed Protoype

Implement a protoype for a database system that can execute queries in a distributed fashion. Build distributed join and aggregation operators. Connect nodes with data shuffle operators. Add disaggregated storage, to load data from. The system does not have to be as capable or fast as state-of-the art database systems, but should be suitable for some benchmarks.

[BA, MA] Prune DPccp

Prune the search space of DPccp with an upper bound for the final cost. This could be obtained, e.g. by running GOO. Prune all entries of the DP table, that exceed this upper bound. This should make things faster.

[BA, MA] Block Based Cardinality Estimation

Find a semantic representation for data blocks (instead of relations). Use this representation to compute cardinality estimates for operations. This might be done using Deep Learning, but unlike previous work, do not train a model that predicts the cdf of data, but train a model that converts data into a semantic representation. Then train other models, that take semantic representations of any kind and output new semantic representations or cardinalities directly. This should be much faster than training new models all the time.

Previous topics that are no longer available

[MA] Pipeline Performance Prediction (TAKEN)

Common cost models for relational query optimization use the number of tuples of intermediate results. This has proven to be more robust and sufficient for in-memory database systems. For some optimization considerations (e.g. considering distributed computation with limited network bandwidth instead of local computation), this is not sufficient. Creating a cost model that reliably predicts the execution time of queries has proven to be a hard problem in the past [1].

Beischl et al. presented a thorough profiling method for our database system Umbra [2]. Using this detailed information you should create a operator based model that tries to accurately predict running time of individual pipelines.

[BA, MA] Dynamic Programming with Distribution Annotations and Graceful Scaling (TAKEN)

Dynamic Programming (DP) is a powerful method for optimal query optimization. As the problem size becomes larger, it becomes unfeasible to perform DP without restricting the search space in some way [3].

In this work you should model different data layouts for distributed query processing as interesting properties. You should consider the cost of query plans using different data distributions for intermediate results. (Possible are: Present on single node, broadcast to all nodes, hash partitioned and distributed by attribute, ...) This will increase the search space exponentially. You should evaluate the performance of a DP optimizer using these distribution annotations. Then you should think about when and how the search space could be restricted to make this approach scale to larger queries.

Haffner et al. reduced the join ordering problem to the shortest path problem in a graph [5].

In this work you should extend their search strategy using variants of beam search (e.g. bi-directional search, early pruning, ...). You should analyze the result quality and runtime trade-off of these variants. Additionally, you should evaluate the extensibility of this method, e.g. regarding adding interesting properties, with respect to its runtime.