HexHive PhD, MSc, BSc projects

This is a list of possible open, unassigned BSc or MSc research projects in the HexHive group for EPFL students.

The projects are designed to be adjustable and scalable according to the type of BSc, MSc, or short PhD research project depending on the depth of the evaluation and exploration. For all projects we expect an open mind, good coding skills (especially in C/C++), and the willingness to learn. Previous experience with compilers (LLVM), build systems, and reverse engineering helps but is not required.

If you are interested in any of these topics then contact the doctoral students assigned with the project. Please start your email with "[(MSc|BSc) (Project|Thesis)]", e.g., "[MSc Project] LLVM-based code analysis". Please explain your background such as particular coding skills or classes you took in sufficient detail, also highlight why you're interested in the project.

In the HexHive we welcome independent ideas from students as well, as long as they focus on the topics of system and software security, especially (but not exclusively) in sanitization of unsafe code, interactions between different components, mitigations, compiler-based security analyses, or fuzzing. So if you have an idea for your own project let us know and we can discuss! Reach out to the people most closest to your ideas and we'll let the ideas bubble up.

GUI Fuzzing
  • Point of contact: Mathias Payer
  • Keywords: Android, GUI, fuzzing

Fuzzing has recently seen a lot of use when analyzing libraries and different CLI applications. So far, GUI applications have remained elusive due to the complexity of interactions and different entry points into the code. Especially the complex interaction patterns with the code makes fuzzing GUI applications challenging. Additionally, a driver needs to create events for the individual aspects of the application. Similarly, modern applications frequently interact with a server whose state may be unknown.

The key research questions in this project are:

  • How can we design a fuzzer that handles event-based systems
  • When fuzzing event-based systems, how can the input generation keep track of fired events and interactions with the GUI app
  • If the app has a remote component, how can we model the abstract space of that component

After reading into the topic of fuzzing and event-based state machines, the student will develop a model for event-based fuzzing with GUI interactions as the key part of the project. To evaluate the project, the student will then evaluate the newly generated fuzzer using a set of different GUI applications, targeting the exploration of Linux GUI apps. As a second goal, if time permits, the student will extend the exploration to apps that have remote components.

A candidate should be interested in (or familiar with):

  • GUI programming
  • Android
  • C/C++ programming and event-based systems
Google-Apple Exposure Notification (GAEN) Experiments

Google and Apple joined their efforts to develop a digital Exposure Notification system (GAEN) to help trace and slow-down the spread of SARS-CoV-2 (which can cause COVID-19). The GAEN system is based on several research efforts including the Decentralized Privacy-Preserving Proximity Tracing (DP3T) system developed by EPFL and other European research institutes.

GAEN makes use of Bluetooth Low Energy (BLE) and mobile applications to wirelessly register contacts between application users in proximity. All contacts are registered in a secure and privacy-preserving way and they allow to automatically notify application users that were in contact with COVID-19 positive individuals. GAEN provides a unified API for iOS and Android and the API is evolving according to the developers’ needs and user feedback. Several countries in the EU have already developed and shipped their contact tracing app (e.g., SwissCovid app in Switzerland, Corona-Warn-App in Germany, and Immuni app in Italy).

The GAEN-based contact tracing apps are used a huge set of heterogeneous smartphones. Smartphones’ diversity is due to different factors including the OS (e.g., Android and iOS), the Bluetooth chip (e.g., Broadcom and Qualcomm) and the supported Bluetooth version (e.g., 4.2 and 5.0). At EPFL we are running large scale measurement and calibration experiments to check that the GAEN API works as expected across different dimensions (OS, Bluetooth chip, Bluetooth Version, GAEN version). We are looking for motivated students interested in running such large-scale experiments. The results of those experiments are and will help to shape the GAEN API to stop the spread of SARS-CoV-2.

A candidate should be interested in (or familiar with):

  • Android and iOS
  • Bluetooth
  • Mobile app development

Recommended readings:

Bluetooth Security
  • Point of contact: Daniele Antonioli
  • Keywords: Bluetooth, BR/EDR, BLE, Attacks, Defenses, Testing, RE, IoT

Bluetooth is a pervasive technology for low power wireless communications, and it is used every day by billions of devices, including smartphone, laptops, wearables, cars, medical devices, industrial devices, etc. Bluetooth is specified in an open standard and it provides two main stacks: Bluetooth BR/EDR (aka Classic) and Bluetooth Low Energy (BLE). One (1) vulnerability in the Bluetooth standard results into billions of exploitable devices (e.g. the KNOB attack).

We are interested in all security aspects of Bluetooth BR/EDR and BLE including attacks, defenses, reverse engineering of closed source (proprietary) software and hardware, static and dynamic analysis and testing/fuzzing. A project will likely cover one of the listed aspects but according to the student interest and progresses we can cover more.

A candidate student should be familiar with:

  • Bluetooth Host Controller Interface (HCI), e.g, HCI commands, HCI events, and the HCI protocol.
  • Bluetooth security mechanisms, e.g, pairing, authentication, session establishment, and association
  • Network analysis tools, e.g., ifconfig wireshark
  • Bluetooth tools such as hciconfig, hcitool

And ideally have some knowledge about:

  • Bluetooth host stack codebase and API, e.g., BlueZ
  • Bluetooth controller stack, e.g, LM and LMP
  • Internalblue RE toolkit
  • Bluetooth SoC and development boards

Recommended reading:

  • The KNOB is Broken: Exploiting Low Entropy in the Encryption Key (Daniele Antonioli et al)
  • Low Entropy Key Negotiation Attacks on Bluetooth and Bluetooth Low Energy (Daniele Antonioli et al)
  • Guide to Bluetooth Security (NIST)
Reverse-engineering CPU microarchitecture
  • Point of contact: Atri Bhattacharyya
  • Keywords: microarchitectural attacks, side channels

A number of microarchitectural side-channel attacks have emerged since 2018, of which Spectre and Meltdown surfaced in January. A critical step in all of these attacks has been a reverse-engineering effort to determine the exact behavior of components of the microarchitecture: the cache organization and replacement policy, the branch prediction algorithm and structures.

We would need to characterize the behavior of these components by designing experiments which target specific behavior, and can leak the structure of internal components. An example is the associativity of caches. An expt which leaks this information is the size of eviction sets. By accessing a set of memory addresses which we increase until we have cache evictions, we can figure out associativity. By modifying the access patterns, we can also infer the cache’s eviction policy. A great example on reverse engineering branch predictors on older Intel processors is this blog post series.

We propose a semester project for masters students or dedicated bachelors students in which we tackle this challenge. An ideal candidate have taken a course on computer architecture and be familiar with the microarchitectural components. Experience with coding in x86 or ARM assembly will be very useful (but is not essential).

API flow graph sanitization
  • Point of contact: Nicolas Badoux
  • Keywords: sanitizer, API graph, component analysis, compiler analysis

Software components (e.g., libraries) often expose a complex API. When using a component, sequences of API calls will build up deep API state that is custom to the component. These API calls are often highly complex and dependent on each other. API call sequences for high level languages (Java) have been analyzed in software engineering but C/C++ libraries so far were neglected due to lack of type information.

In this project, we will build an API inference graph that, using a whole system analysis, collects all uses of a library (starting with a simple file-based search but potentially moving to a compiler-based analysis). This API graph encodes dependencies between individual API calls (e.g., read can only be called after open) and parameters for API calls (e.g., open may specify if the file can be written or not). The API graph encodes all possible API interactions and defines the parameter space for each method.

Based on the API graph we will then search for violations of the inferred values in real applications by implementing a sanitizer that tracks the API state and searches for violations of the valid state.

Code pruning and debloating
  • Point of contact: Tesic Uros
  • Keywords: debloating, TCB reduction, code analysis, reflowing software

Current systems leverage libraries to facilitate code reuse across processes. If a program uses data structures or functions from a library then the runtime linker connects those data structures when the program is instantiated. The intention of using shared libraries was to reduce memory pressure across processes. Nowadays where the amount of loaded code is dwarfed by the amount of data it is time to revisit this assumption.

In this project we will develop a process to statically link binaries, removing any dependencies to other libraries. Libraries will be compiled as LLVM bitcode and inlined during the LTO/thinLTO process. We will measure performance of the newly compiled binaries as well as code reduction and security metrics, e.g., the increased precision of Control-Flow Integrity due to the decreased amount of targets.

Evaluation targets are: security (reducing the amount of code, target locations, e.g., for CFI, and code pointers in general) and performance (relative accesses instead of GOT access, inlining, map pages/TLB optimization)

Incremental fuzzing and sanitization

Dynamic software testing is currently an all or nothing approach. Software is compiled with sanitization and fuzzing and then evaluated. As software is developed in small steps, it would be advantageous to fuzz the delta that was added. This project will focus on developing a software testing environment that keeps track about areas that were already fuzzed or tested and which test cases had what coverage. This allows that, whenever a part of the software is changed, we can focus on that new part.

  • Develop a mechanism to track coverage across input and sanitizer checks
  • Correlate the coverage to specific areas of software
  • Focus testing on tests that explored the changed area and see if you can focus testing on only the changed part

Evaluation targets are common software repositories for open source projects.

Evaluating control-flow hijacking defenses
  • Point of contact: Tesic Uros
  • Keywords: systematization of knowledge, control-flow hijacking, security benchmark

Buffer overflows, use after free attacks, and generally control-flow hijacking due to memory safety and type safety errors have been the cause of countless security vulnerabilities that lead to exploitation of software. Over the past 20 years a large amount of mitigations has been proposed that mitigate control-flow hijacking attacks. The only solution to prohibit the exploitation of this attack vector is memory safety, all other defenses may leave some avenue for attack.

We will evaluate different control-flow hijacking mechanisms such as Control-Flow Integrity, executable but unreadable code, and other mechanisms, comparing them along different metrics. The goal of this project is to evaluate the power of the different mitigations and to rank them according to completeness, power, overhead, or other systematic weaknesses.

Secure Memory Allocator
  • Point of contact: Ahmad Hazimeh
  • Keywords: memory allocation, memory corruption, heap protection

Memory allocators are primarily geared towards performance. Orthogonal to performance, memory allocators can harden a program against memory safety bugs. Use after free and other buffer overflows are common mistakes that compromise adjacent data structures on the heap. In this project we evaluate memory allocators and design a secure memory allocator alternative. The project consists of the following steps:

  • Design a benchmark for memory allocators
  • Record allocations/deallocations and writes of benchmarks, allow replay
  • Evaluate size classes according to object sizes For the analysis of existing programs, we will check number of allocation sites, number of alloc sizes, number of alloc sizes per site. As benchmarks we will leverage SPEC CPU2017 and parallel programs.
  • Make the allocator C++ aware! (according to type classes)
  • Use classes of allocations (depending on location/type) As part of the optimization, we will discuss how to store the metadata, e.g., per thread, per allocation size, or using a mixed data structure.

Discussion points for the memory allocator are: (i) if meta-data and malloc’d data is separated, (ii) (configurable) canaries between blocks, (iii) (configurable) overwrite on free, (iv) (configurable) no-reuse policy, (v) as a performance optimization: leverage guard pages and efficiently allocate them.

Leveraging Intel PT to enforce online dispatch safety

Intel PT is a technology that allows recording of control flow decisions and evaluation of them on a different thread. The idea of this project is to implement a security monitor on top of Intel PT. By following certain jumps, we can encode messages and send them to the monitor to keep track of them.

  • Encode object allocations into indirect (but controlled) calls to magic
  • location so that they show up in the PT trace.
  • Read out PT trace from second core to build up object graph
  • Translate virtual dispatch to include the object location that is dispatched on (i.e., including it in the PT trace)
  • Verify integrity on second core, pause before I/O or state system calls so that second core can catch up

This approach allows us to keep a shadow object graph in a validator process and check every virtual dispatch at low overhead. The policy is the same as CPI but at much lower overhead as pointer writes no longer need to be checked inline in the original process.

Binary vulnerability inference
  • Point of contact: Ahmad Hazimeh
  • Keywords: binary analysis, vulnerability study, pattern-based search, measurement study

Deployed programs may include third party code in inlined form or as customized libraries. Keeping these external code repositories up to date is challenging. Ideally, the program would be updated immediately with the underlying repository. Unfortunately, these updates may be forgotten. In this project, we want to study the prevalence of such misuses and forgotten updates.

  • First, we have to detect the library version and compilation settings by developing a binary analysis that leverages features of a binary to detect its version.
  • Classify vulnerabilities and vulnerable code according to code patterns by evaluating libraries and creating a ground truth data set of code patterns and vulnerabilities.
  • Search for the classified vulnerabilities and code patterns in binaries
  • Study on packaged/prepackaged libraries and track vulnerability provenance

Note that this project could also focus on Android applications.

Information Hiding: A New Hope
  • Point of contact: Ahmad Hazimeh
  • Keywords: information hiding, kernel, systems security

Many mitigations require some form of meta data to check the integrity of the process and to detect violations. This defines two privilege domains with the application data and the security data. Bugs in the code that computes on application data should not be allowed to access or modify the security data. Information hiding allows low-overhead fault isolation by separating the two types of data and making it unlikely that the data can be accessed. As there is no actual separation, correctly guessing the address of the secure data would allow application code to overwrite secure data. Information leaks are therefore a problem, guessing oracles or allocation oracles probe for the secret area.

  • Study existing attacks against information hiding and their weaknesses.
  • Create a kernel module and protocol that allows the allocation of safe metadata areas (the high bits on x64 are reserved to the kernel; in this vast address space we can hide some memory pages; these pages can be used to store the secure metadata). Allow allocation of memory areas in high level memory. Extend syscall of clone to allocate safe stack when creating a new thread Add syscall to allocate a new hidden memory area Add syscall to disable this feature (on per-thread basis)
  • Develop a compiler pass to move meta data to this memory area.
  • Evaluate: safe stack, OpenSSL to store keys, CFIXX, or type safety

Key idea: Allocate the hidden memory areas in a safe space. High level memory reserved for kernel is the perfect place, most is unmapped and traps. The attacker cannot probe for this area.

Other projects

Several other projects are possible in the areas of software and system security. We are open to discuss possible projects around the development of security benchmarks, using machine learning to detect vulnerabilities, secure memory allocation, sanitizer-based coverage tracking, and others.