Mining Massive Datasets


Please note the new location for the tutorial (room MW 0001)!

Data has supported research since the dawn of time, but recently there has been a paradigm shift in the way data is used. Today researchers and practitioners are mining data for patterns and trends that lead to new hypotheses. This shift is caused by the huge volumes of data available from, e.g., social media websites, web query logs, sensors, and medical devices. "Big Data" has been established as an umbrella term to cover such high-volume and complex data.

In this course, you will learn data mining and machine learning techniques to process large datasets and extract valuable knowledge from them. We will study modern computing frameworks for large-scale data analytics (e.g., Apache Hadoop/Spark) as well as models and algorithms for pattern detection in large data. In particular, we will discuss principles that are designed for today's complex data such as networks or temporal data. The practical relevance of these methods will be highlighted by multiple important applications such as fraud detection, recommendation, or community detection.

The preliminary syllabus of the course is as follows:

  • Introduction
    • Data Mining and Knowledge Discovery Process, Machine Learning
    • Applications, Tasks
  • Hashing & Sketches
    • Similarity search
    • Min-Hashing, Locality Sensitive Hashing
    • Bloom Filter
  • Dimensionality Reduction & Matrix Factorization
    • Feature Selection & Random Projections
    • PCA / SVD
    • Non-Negative Matrix Factorization and Extensions
  • (Distributed) Optimization
    • Unconstrained / Constrained Optimization
    • Convex Optimization
    • (Stochastic) Gradient descent
  • Network Data
    • Laws/Patterns and Generators
    • PageRank and Extensions, HITS
    • Clustering/Community Detection, Spectral Clustering
    • Probabilistic Models: Inference, Distributed Learning, Models for Network Data
  • Temporal Data & Streaming
    • Sampling Techniques
    • Counting Distinct Elements
    • Estimating moments
  • Systems & Tools
    • MapReduce and Extensions (e.g. Spark)
    • Big Learning Systems
    • Graph Processing Systems


  • Lecture: Wednesdays, 12:00pm - 1:40pm, room 00.13.009A
  • Exercise: Tuesdays, 2:00pm - 3:30pm, room MW 0001
  • For more details see TUMonline
  • All course material will be made available via Moodle