HexHive research focus

The HexHive laboratory at EPFL researches novel technologies to protect applications in the presence of vulnerabilities along two dimensions. First, discovering exploitable bugs in source code through dynamic analysis techniques, building on sanitization and dynamic testing. Second, detecting the active exploitation of a vulnerability through lightweight mitigations. These technologies are integrated into both compiler toolchains (analyzing the code during compilation and integrating into the developer workflow) and binary rewriting platforms (allowing the protection of binary-only code, e.g., for libraries where the developer or user does not have access to the source code). All code developed in the HexHive laboratory is released as open source.

The implementation prototypes of all published papers are available as open-source on the HexHive GitHub page. We welcome any feedback and comments.

HexHive PhD, MSc, BSc projects

This is a list of available unassigned potential BSc, MSc, or PhD semester research projects in the HexHive group. The projects are designed to be adjustable and scalable according to the type of project (BSc, MSc, or PhD research project) depending on the depth of the evaluation and exploration. For all projects we expect an open mind, good coding skills (especially in C/C++), and the willingness to learn. Previous experience with compilers (LLVM), build systems, and reverse engineering helps but is not required.

If you are interested in any of these topics, then contact Mathias or any of the doctoral students. In the HexHive we welcome independent ideas from students as well, as long as they focus on the topics of system and software security, especially (but not exclusively) in sanitization of unsafe code, interactions between different components, mitigations, compiler-based security analyses, or fuzzing. So if you have an idea for your own project let us know and we can discuss!

API flow graph sanitization

Software components (e.g., libraries) often expose a complex API. When using a component, sequences of API calls will build up deep API state that is custom to the component. These API calls are often highly complex and dependent on each other. API call sequences for high level languages (Java) have been analyzed in software engineering but C/C++ libraries so far were neglected due to lack of type information.

In this project, we will build an API inference graph that, using a whole system analysis, collects all uses of a library (starting with a simple file-based search but potentially moving to a compiler-based analysis). This API graph encodes dependencies between individual API calls (e.g., read can only be called after open) and parameters for API calls (e.g., open may specify if the file can be written or not). The API graph encodes all possible API interactions and defines the parameter space for each method.

Based on the API graph we will then search for violations of the inferred values in real applications by implementing a sanitizer that tracks the API state and searches for violations of the valid state.

  • Keywords: sanitizer, API graph, component analysis, compiler analysis

Transformational Fuzzing

Fuzzing is experiencing a renaissance through coverage-guided greybox fuzzing. AFL and its descendants have been used to find thousands of security bugs. A fuzzing engine consists of an input generator and an execution engine. Coverage guided fuzzing collects coverage information (what parts of the program were executed how many times) from the execution engine to influence the input generator to increase coverage. Existing coverage-guided fuzzing engines often hit a so-called coverage wall where the required input changes are too complex and fuzzing no longer makes progress.

Instead of only mutating the program input, we have recently proposed transformational fuzzing that analyzes the program and mutates the program alongside the input to reach new coverage. Crashes in the mutated program then must be mapped back to the original program to ensure these are real crashes and not false positives, we call this component the crash analyzer.

After publishing the initial results we want to increase the precision of the mutation. The goal of this project are to develop a mutation engine and policy on how to mutate the program to ensure the best possible coverage expansion. Potential directions are the inference of non-conditional functions that may be pruned or co-dependent checks along an execution path that trigger violations. Orthogonally, mutational fuzzing should become context sensitive so that the program is mutated on the fly, depending on the execution context with the help of a feedback-guided process.

  • Keywords: fuzzing, program transformation, exploitation

LLVM-based T-Fuzzing

Similar to the previous project, this project extends T-Fuzz (see the first two paragraphs of the previous project proposal for context). The existing T-Fuzz prototype relies on binary rewriting and is restricted to if conditions (i.e., it cannot support switch statements). The goal of this project is to implement a T-Fuzzing engine on top of LLVM to support a flexible and (runtime) configurable way to readjust checks depending on a transformation policy. Each conditional program location would be extended with a switch that, when triggered would alternate the result.

  • Keywords: fuzzing, compiler-based transformation, exploitation

Automatic ISA/architecture inference

Modern CPU architectures are incredibly complex and the documentation of the architecture may be partial, incomplete, or completely lacking. Examples of architectural details are cache hierarchies, cache sizes, branch target buffers, branch predictor strategies, or cache associativity strategies. Knowledge of the underlying architecture can be used for, e.g., performance optimization, reasoning about the presence/absence of side channels, timing information.

The goal of this project is to develop a fully automatic mechanism that can infer the different layers of caches by generating a set of programs that test sets of hypothesis and thereby infer hardware configurations. As a continuation of this project we will categorize side channels and evaluate any newly found issues.

  • Keywords: hardware security, ISA inference, side channels, low level coding

Code pruning and debloating

Current systems leverage libraries to facilitate code reuse across processes. If a program uses data structures or functions from a library then the runtime linker connects those data structures when the program is instantiated. The intention of using shared libraries was to reduce memory pressure across processes. Nowadays where the amount of loaded code is dwarfed by the amount of data it is time to revisit this assumption.

In this project we will develop a process to statically link binaries, removing any dependencies to other libraries. Libraries will be compiled as LLVM bitcode and inlined during the LTO/thinLTO process. We will measure performance of the newly compiled binaries as well as code reduction and security metrics, e.g., the increased precision of Control-Flow Integrity due to the decreased amount of targets.

Evaluation targets are: security (reducing the amount of code, target locations, e.g., for CFI, and code pointers in general) and performance (relative accesses instead of GOT access, inlining, map pages/TLB optimization)

  • Keywords: debloating, TCB reduction, code analysis, reflowing software

Incremental fuzzing and sanitization

Dynamic software testing is currently an all or nothing approach. Software is compiled with sanitization and fuzzing and then evaluated. As software is developed in small steps, it would be advantageous to fuzz the delta that was added. This project will focus on developing a software testing environment that keeps track about areas that were already fuzzed or tested and which test cases had what coverage. This allows that, whenever a part of the software is changed, we can focus on that new part.

  • Develop a mechanism to track coverage across input and sanitizer checks
  • Correlate the coverage to specific areas of software
  • Focus testing on tests that explored the changed area and see if you can focus testing on only the changed part

Evaluation targets are common software repositories for open source projects.

  • Keywords: fuzzing, sanitization

Evaluating control-flow hijacking defenses

Buffer overflows, use after free attacks, and generally control-flow hijacking due to memory safety and type safety errors have been the cause of countless security vulnerabilities that lead to exploitation of software. Over the past 20 years a large amount of mitigations has been proposed that mitigate control-flow hijacking attacks. The only solution to prohibit the exploitation of this attack vector is memory safety, all other defenses may leave some avenue for attack.

We will evaluate different control-flow hijacking mechanisms such as Control-Flow Integrity, executable but unreadable code, and other mechanisms, comparing them along different metrics. The goal of this project is to evaluate the power of the different mitigations and to rank them according to completeness, power, overhead, or other systematic weaknesses.

  • Keywords: systematization of knowledge, control-flow hijacking, security benchmark

Deep trust chains may violate security expectations

Modern frameworks such as JavaScript, NodeJS/NPM, Docker, Python/PIP rely on a vast distribution of libraries with deep dependencies. Whenever you install a certain library or functionality, it may pull in a lot of other dependencies. The security of your application (and sometimes system) now depends on the security of all these additional libraries.

This project consists of two phases. In the first phase, you will compute the dependency graph across the system, marking individual dependency chains and evaluating common dependencies. In this phase we are interested in the number of dependencies and the different chains and we will evaluate different "hot" packages that are used across large amounts of installations.

In the second phase, we will evaluate the dependencies of common programs/libraries and study how these dependencies change over time. This allows us to assess the threat surface for given programs and evaluate given security risks.

  • Keywords: software ecosystem, components, dependency graph

Information leakage in blockchain protocols

Blockchains are currently a hot topic with many different implementations of distributed, append-only ledgers. At the core, each blockchain protocol relies on a consensus protocol on when and how new data can be appended to the global ledger. Any information that is appended will remain on the ledger forever. Protocols may be designed for efficiency and try to minimize the amount of storage space needed for a single transaction. Due to the fast evolving platforms, the security of implementations may be of secondary concern.

In this project, we will evaluate existing protocols for information leakage. Information leakage happens, for example, if a struct is not packed, leaking padding when an int follows a char. Other sources of information leakage could be due to weak random numbers that are used or different forms of padding. The first step is to identify different implementations, classify their protocols and data structures. In a second step, we will manually search for information leakage. In a third step, we will try to automate the search through a static analysis.

  • Keywords: blockchain, information leakage, static analysis.

Secure Memory Allocator

Memory allocators are primarily geared towards performance. Orthogonal to performance, memory allocators can harden a program against memory safety bugs. Use after free and other buffer overflows are common mistakes that compromise adjacent data structures on the heap. In this project we evaluate memory allocators and design a secure memory allocator alternative. The project consists of the following steps:

  • Design a benchmark for memory allocators
  • Record allocations/deallocations and writes of benchmarks, allow replay
  • Evaluate size classes according to object sizes For the analysis of existing programs, we will check number of allocation sites, number of alloc sizes, number of alloc sizes per site. As benchmarks we will leverage SPEC CPU2017 and parallel programs.
  • Make the allocator C++ aware! (according to type classes)
  • Use classes of allocations (depending on location/type) As part of the optimization, we will discuss how to store the metadata, e.g., per thread, per allocation size, or using a mixed data structure.

Discussion points for the memory allocator are: (i) if meta-data and malloc'd data is separated, (ii) (configurable) canaries between blocks, (iii) (configurable) overwrite on free, (iv) (configurable) no-reuse policy, (v) as a performance optimization: leverage guard pages and efficiently allocate them.

  • Keywords: memory allocation, memory corruption, heap protection

Leveraging Intel PT to enforce online dispatch safety

Intel PT is a technology that allows recording of control flow decisions and evaluation of them on a different thread. The idea of this project is to implement a security monitor on top of Intel PT. By following certain jumps, we can encode messages and send them to the monitor to keep track of them.

  • Encode object allocations into indirect (but controlled) calls to magic
  • location so that they show up in the PT trace.
  • Read out PT trace from second core to build up object graph
  • Translate virtual dispatch to include the object location that is dispatched on (i.e., including it in the PT trace)
  • Verify integrity on second core, pause before I/O or state system calls so that second core can catch up

This approach allows us to keep a shadow object graph in a validator process and check every virtual dispatch at low overhead. The policy is the same as CPI but at much lower overhead as pointer writes no longer need to be checked inline in the original process.

  • Keywords: mitigation, control-flow hijacking

Binary vulnerability inference

Deployed programs may include third party code in inlined form or as customized libraries. Keeping these external code repositories up to date is challenging. Ideally, the program would be updated immediately with the underlying repository. Unfortunately, these updates may be forgotten. In this project, we want to study the prevalence of such misuses and forgotten updates.

  • First, we have to detect the library version and compilation settings by developing a binary analysis that leverages features of a binary to detect its version.
  • Classify vulnerabilities and vulnerable code according to code patterns by evaluating libraries and creating a ground truth data set of code patterns and vulnerabilities.
  • Search for the classified vulnerabilities and code patterns in binaries
  • Study on packaged/prepackaged libraries and track vulnerability provenance

Note that this project could also focus on Android applications.

  • Keywords: binary analysis, vulnerability study, pattern-based search, measurement study

Information Hiding: A New Hope

Many mitigations require some form of meta data to check the integrity of the process and to detect violations. This defines two privilege domains with the application data and the security data. Bugs in the code that computes on application data should not be allowed to access or modify the security data. Information hiding allows low-overhead fault isolation by separating the two types of data and making it unlikely that the data can be accessed. As there is no actual separation, correctly guessing the address of the secure data would allow application code to overwrite secure data. Information leaks are therefore a problem, guessing oracles or allocation oracles probe for the secret area.

  • Study existing attacks against information hiding and their weaknesses.
  • Create a kernel module and protocol that allows the allocation of safe metadata areas (the high bits on x64 are reserved to the kernel; in this vast address space we can hide some memory pages; these pages can be used to store the secure metadata). Allow allocation of memory areas in high level memory. Extend syscall of clone to allocate safe stack when creating a new thread Add syscall to allocate a new hidden memory area Add syscall to disable this feature (on per-thread basis)
  • Develop a compiler pass to move meta data to this memory area.
  • Evaluate: safe stack, OpenSSL to store keys, CFIXX, or type safety

Key idea: Allocate the hidden memory areas in a safe space. High level memory reserved for kernel is the perfect place, most is unmapped and traps. The attacker cannot probe for this area.

  • Keywords: information hiding, kernel, systems security

Other projects

Several other projects are possible in the areas of software and system security. We are open to discuss possible projects around the development of security benchmarks, using machine learning to detect vulnerabilities, secure memory allocation, sanitizer-based coverage tracking, and others.