A ground-truth fuzzing benchmark suite based on real programs with real bugs.
We’ve compiled a list of the most frequently-asked questions:
This may be due to an "under-fit" trigger condition. The fixes written by the library developers are often more restricting than the cause for a crash. Magma extracts the bug condition from those fixes, instead of using the actual "crash" condition.
It is possible that the fix provided by the developers is underfitting to the original bug. In that sense, the trigger condition could cover cases that are not crashing, but that still violate the criteria set by the developers to fix the bug. Then, a bug can be triggered but not cause a crash.
Another reason could be that the code base had undergone changes that prevent the crash from happening at a later point after the trigger condition is evaluated. In that case, reproducing the bug may not yield the same results (crashing), but would still require satisfying complex conditions.
In contrast to the process of back-porting fixes from newer software versions to previous releases, we coin the term forward-porting for re-purposing bugs from previous releases and injecting them into later versions of the code.
An alternative approach to forward-porting bugs would be back-porting canaries. Given a library libfoo with a previous release A and the latest stable release B, the history of bugs fixed from A to B can be used to identify the bugs present in A, formulate oracles, and inject canaries in an old version of libfoo. However, when we use A as the code base for our target, we could potentially miss some bugs in the back-porting process. This increases the possibility that the instrumented version of A has bugs for which no ground-truth is being collected. In contrast, when we follow the forward-porting approach, B is used as a code base, ensuring that all known bugs are fixed, and the bugs we re-introduce will have ground-truth oracles. It is still possible that with the new fixes and features added to B, more bugs could have been re-introduced, but the forward-porting approach allows the benchmark to constantly evolve with each published bug fix.
An automated approach for injecting bugs would have resulted in an incomplete and error-prone technique, which would ultimately yield fewer lower-quality bugs. Instead, dedicating human resources for this purpose maximizes the chance for a chosen bug to be ported correctly into Magma. Thus, we opted for the latter to maximize Magma's utility.
During a semester project, two Bachelor students added over 60 bugs to Magma in three new targets. The students took less than 1.8 hours per bug on average to analyze the bug location, infer the trigger constraints, create the canary, and encode the necessary metadata. Injecting bugs into Magma requires no expert knowledge (the students are 3rd year Bachelor students without security experience). Several external collaborators also added new targets and new fuzzer configurations into Magma, and integrated Magma into their research workflow.
The main critique point of previous reviews is that bug insertion and forward-porting of bugs is not automatic. We argue that an automatic bug insertion technique will be strictly inferior to manual porting due to the difficulty and complexity of automatic forward porting. Creating an automatic technique will be extremely challenging due to the large amount of differences in bugs, bug types, bug locations, constraints required to trigger a bug, code modifications to encode these constraints, and, based on a bug report and a patch, carefully assessing where the bug location was.
All these challenges for an automatic technique are unsolved and would require large amount of time to solve, likely in the order of 5-10 person years (1-2 PhDs). Additionally, any automatic solution would likely require careful checking of the forward-ported bugs which would result in the same amount of time required to manually port the bugs.
While these problems are interesting (and should be looked at), we argue that they are orthogonal to Magma---a benchmark suite to test fuzzers. Commpared to a manual approach that costs less than 2 person hours per bug, an automatic approach will never be cost effective (or, arguably ever be useful). Given these downsides, we strongly favor the manual bug porting approach.
An oracle evaluates the current program state and determines if it is faulty (i.e., if the bug has been triggered). A canary is responsible for reporting and exporting that knowledge, through Magma's runtime library, to be used by the monitor.
Throughout the code, the documentation, and the paper, the distinction between those terms is not too emphasized, and they are often used interchangeably when the distinction is not critical.
Magma's initial set of targets was chosen to cover different computational domains of applications and standards, to provide an all-around benchmark representative of a major portion of in-the-wild fuzzing targets.
Although the concept behind Magma is not exclusive to one title under which targets fall, Magma's implementation does limit the scope of real targets that can be added. Magma currently relies on inline instrumentation written in C
, which then requires the targets to be written in C/C++
. Moreover, Magma currently does not support multi-threaded targets, since its runtime library is not thread-safe.
No specific set of criteria was imposed on the bug selection process. However, throughout our porting efforts, we often prioritized more recent bug reports, since they correlate most closely to the latest code base, and are thus more likely to remain valid. Additionally, reports marked "critical" were also given a higher priority than others.
That said, there are no constraints on bug types. Bugs in Magma can be anything from typical memory safety violations to semantic bugs, allowing for a broad range of possible sanitization and fault detection techniques.
A reached bug refers to a bug whose oracle was called, implying that the executed path reaches the context of the bug, without necessarily triggering a fault. A triggered bug, on the other hand, refers to a bug that was reached, and whose triggering condition was satisfied, indicating that a fault occurred. Whereas triggering a bug implies that the program has transitioned into a faulty state, the symptoms of the fault may not be directly observable at the oracle injection site. When a bug is triggered, the oracle only indicates that the conditions for a fault have been satisfied, but this does not imply that the fault was encountered or detected by the fuzzer.
Another distinction is the difference between triggering and detecting a bug. Whereas most security-critical bugs manifest as a low-level security policy violation for which state-of-the-art sanitizers are well-suited — e.g., memory corruption, data races, invalid arithmetic — some classes of bugs are not easily observable. Resource exhaustion bugs are often detected after the fault has manifested, either through a timeout or an out-of-memory indication. Even more obscure are semantic bugs whose malfunctions cannot be observed without some specification or reference. Different fuzzing techniques have been developed to target such evasive bugs, such as SlowFuzz and NEZHA. Such advancements in fuzzer technologies could benefit from an evaluation which accounts for detection rate as another dimension for comparison.
POLL
parameter affect? And why does the Magma monitor
need to poll?
monitor
can access. To avoid the overhead and complexity of synchronization, the Magma monitor does not synchronously read the results. Instead, it polls the file, meaning that it reads its contents every POLL
seconds.
ISan is an early alarm system that crashes the program (with a SIGSEGV
signal) when the bug trigger conditions are satisfied. It can be used when the detection capabilities of the fuzzer are of no interest to the evaluator, or when they can be evaluated separately in a post-processing step.
One such example is AFL. When using AFL with AddressSanitizer, it is possible to first run the Magma benchmark with ISan, collect crashing test-cases for all bugs, then re-compile the target without ISan and re-run it against the collected test-cases, filtering out the bugs which could not be detected by ASan. It is also important to take into account AFL's other fault detection techniques, like out-of-memory thresholds and execution timeouts. The [fuzzer]/run_once.sh
scripts in Magma are intended to emulate the fuzzer's execution environment to detect faults.
captain/run.sh
script, how should I select the value for WORKERS
?
WORKERS
specifies the number of logical cores (from 0
up to WORKERS-1
) you wish to allocate for running the benchmark. Magma will utilize these cores to run multiple campaigns in parallel. When all allocated cores are busy or occupied, Magma queues up remaining campaigns and dispatches them to the next core that frees up.
In order to obtain an interactive shell session in a Docker
container, you must execute the bash
program inside a
running container.
In the case that you already have an existing running container (e.g. a campaign that has not finished):
docker ps
bash
in a foreground TTY terminal:
docker exec -it <CONTAINER_ID> /bin/bash
Alternatively, if you want to launch a bash
shell inside a
new container, you can use the captain/start.sh
script as
follows:
cd captain
FUZZER=afl TARGET=php PROGRAM=exif ENTRYPOINT=/bin/bash ./start.sh
sudo
access inside the container?
magma
user inside the image is added to the sudo
users group. The default password is amgam
.
captain
toolset currently does not provide means by which
to manually terminate containers. Its scripts have exit handlers which
attempt a clean-up before the script exits. However, in case of a
malfunction, you can still manually check the status of current active
containers and kill/remove them:
docker ps
docker kill <CONTAINER_ID>
docker rm -f <CONTAINER_ID>
docker ps
docker rm -f `docker ps | grep magma | awk '{print $1}'`
docker rmi -f `docker image ls | grep magma | awk '{print $3}'`
Most programs that we include in Magma targets are derived from Google's OSS-Fuzz project, where the target developers write their own libFuzzer stubs. In Magma, we include a wrapper for those stubs to allow them to be fuzzed by AFL and its likes. This wrapper can either be launched with a file-name argument, which will be read and fed into the libFuzzer stub, or it can be launched without arguments, in which case it would be used by AFL for persistent fuzzing.
We also occasionally include other programs in Magma, including tools (e.g., tiffcp
, pdfimages
, ...) which require command-line arguments to properly consume the input. In that case, we provide the AFL-style arguments in the configuration, where @@
is replaced by the path to the fuzzer-generated test-case.
targets/*
) includes a configrc
file which specifies the list of programs to fuzz, and the AFL-style arguments to pass to each program.
REPEAT
parameter signify? Why do I need multiple repetitions?
patches/bugs
and patches/setup
directories (in the target configuration directory) are applied, and all files in other subdirectories are ignored. So, to select which bugs to apply, simply make sure that only your chosen bugs are inside those directories, and move the undersired bugs somewhere else. Then, rebuild the image.
This requires creating a new fuzzer configuration that resumes from an existing workdir. For this purpose, we've added an example config, afl_resume
, which is a copy of afl
where the run.sh
script was modified to use the -i -
flag in running AFL instead of using the seed corpus as input.
To resume work from a previous campaign, build the new configuration:
FUZZER=afl_resume TARGET=libtiff ./build.sh
Then, launch the campaign manually, specifying the old workdir (without emptying it):
FUZZER=afl_resume TARGET=libtiff PROGRAM=tiffcp ARGS="-M @@ tmp.out" SHARED=./workdir POLL=5 TIMEOUT=24h ./start.sh
captain/run.sh
script has terminated, but campaigns are still running. What's wrong?
It is likely that the script encountered an error while building the benchmark or processing parameters, and terminated prematurely.
To kill all campaigns:
pkill -SIGTERM 'start\.sh'
Bugs that are injected in Magma are not verified, since the process of manually crafting PoVs is arduous and requires domain-specific knowledge, both about the input format and the program or library, potentially bringing the bug-injection process to a grinding halt.
Instead, we inject bugs into the targets without first supplying PoVs, then we collect PoVs from the results of the campaigns. When available, we also extract PoVs from public bug reports.
This approach does not guarantee that all injected bugs can be used for evaluation, but it does make the development of the benchmark and contribution to it more streamlined and efficient, leaving it to the fuzzers to do all the heavy lifting.
monitor
output logs contain? Does every log file correspond to a new bug discovered?
Throughout the lifetime of the campaign, the monitor keeps track of the cumulative count of all bugs encountered. These log are not related to one specific crash or run. They are the culmination of the entire fuzzing process up until the timestamped point.
The monitor folder contains files whose names are timestamps, and whose contents are counters. The timestamps are in seconds, since the beginning of the campaign. The counters are just the number of times the bug has been reached/triggered since the beginning of the campaign.
Consider a timestamped log monitor/24100
which contains the
following:
ABC123_R, ABC123_T, XYZ001_R, XYZ001_T
63453, 29060, 23, 3
This means that, until this timestamp, the fuzzer had generated 63453 files that reach ABC123, of which 29060 files trigger it. The fuzzer may not have saved all of them because it manages to deduplicate some crashes.
monitor
will log it, thanks to the canaries.
Compiling with ISan is not necessary to see whether an input triggers a bug. You can do that with monitor --fetch watch
. But for obtaining the test cases that don't crash, you will need ISan, because the typical fuzzer would not have saved non-crashing cases.
When a canary is inserted at some line of code, it assumes that the input satisfied the conditions to reach that line of code in the original program. Then, to consider the bug triggered, an additional bug condition is supplied to the canary. Thus, the condition for triggering a bug is the AND of the reach condition and the bug condition: trigger = reach AND bug
.
However, when the program is modified, the reach condition could have been violated, and the assumption made by the canary would be broken. The canary only evaluates the bug condition, and uses that as a trigger condition. Thus, if T-Fuzz reaches a bug by transforming the program, then triggers it through random mutation, the canary may record a false positive.
One way we suggest to address this would be:
This way, the transformed program could still crash when a bug is triggered, thus informing T-Fuzz to perform post-analysis. After post-analysis, the fuzzer would run the adapted crashing input and feed it to the original program (with canaries enabled). This way, bug triggers are only recorded in the original program after T-Fuzz creates/synthesizes valid crashing test cases for the original program.