Code Generation for Data Processing

Contents

Code generation is a key technique for efficient program execution and data processing. This lecture will cover the following topics from a practical perspective with accompanying hands-on exercises:

  • Execution models of programs (interpretation, bytecode, machine code generation, etc.)
  • Program representations (source code, intermediate representations (IRs), different forms of bytecode)
  • Classical techniques of code generation
    • SSA and optimization techniques, exemplary described on LLVM-IR
    • Machine code generation: instruction selection and register allocation
  • Execution of programs in virtual machines (e.g., WebAssembly, BPF, JavaScript)
    • Sandboxing and optimizations for JIT compilation
  • Execution of database queries (e.g., SQL, data frame API)
    • Execution models and code representations
  • Execution of machine code/binary translation (e.g., RISC-V)
    • Specifics when translating machine code

Organization

  • Lecture with integrated exercises: Thu 10–14 (c.t., with break) in 02.11.018
  • Exercises will include hands-on programming tasks
    • (Basic) C++ knowledge is required
  • Language: English
  • Module: CIT3230001, 6 ECTS, Bachelor/Master elective
    • Area "Databases and Information Systems" for Informatics, Wirtschaftsinformatik/Information Systems, Informatics: Games Engineering, Biomedical Engineering
    • B1.2 "Advanced Topics in Data Engineering" for Data Engineering and Analystics
  • Written exam (90 minutes), might change to oral on low registration count. Exam from winter 2022/23: [exam2223.pdf].
  • Zulip stream for this lecture; private contact via e-mail

Prerequisites

The course is aimed at bachelor/master students who have taken the following (or similar) courses:

  • IN0004 Introduction to Computer Architecture
  • IN0008 Fundamentals of Databases

Material

Material and exercises will be regularly provided throughout the semester.

Note: The schedule below is preliminary and is likely to change during the semester.

DateLecture TopicHomework
17.10. Overview, Motivation, Interpretation Techniques
Exercise: (no exercise session)
24.10. Compiler Front-end
Exercise: TBD/discussion of homework
31.10. IR Concepts, Control Flow Graph, SSA Construction
Exercise: IR design
07.11. LLVM-IR and IR Design Considerations
Exercise: LLVM hands-on
14.11. Analyses and Transformations
Exercise: writing an LLVM-IR pass
21.11. Vectorization
Exercise: auto-vectorization
28.11. Instruction Selection
Exercise: TBD
05.12. (no lecture/exercise, Dies Academicus)
12.12. Register Allocation
Exercise: SSA destruction
19.12. Object Files, Linker, Loader
Exercise: ELF file construction
09.01. Unwinding and Debuginfo
Exercise: unwind info generation
16.01. JIT-compilation and Sandboxing
Exercise: WebAssembly hands-on
23.01. Query Compilation
Exercise: SQLite hands-on
30.01. Binary Translation
Exercise: QEMU hands-on
06.02. TBD (possible guest lecture)
Exercise: TBD/wrap up