How the Atheris Python Fuzzer Works
Posted by Ian Eldred Pudney, Google Information Security
def TestOneInput(data): # Our entry point
if data == b”bad”:
Atheris is a native Python extension, and uses libFuzzer to provide its code coverage and input generation capabilities. The entry point passed to atheris.Setup() is wrapped in the C++ entry point that’s actually passed to libFuzzer. This wrapper will then be invoked by libFuzzer repeatedly, with its data proxied back to Python.
Python Code Coverage
Atheris is a native Python extension, and is typically compiled with libFuzzer linked in. When you initialize Atheris, it registers a tracer with CPython to collect information about Python code flow. This tracer can keep track of every line reached and every function executed.
We need to get this trace information to libFuzzer, which is responsible for generating code coverage information. There’s a problem, however: libFuzzer assumes that the amount of code is known at compile-time. The two primary code coverage mechanisms are __sanitizer_cov_pcs_init (which registers a set of program counters that might be visited) and __sanitizer_cov_8bit_counters_init (which registers an array of booleans that are to be incremented when a basic block is visited). Both of these need to know at initialization time how many program counters or basic blocks exist. But in Python, that isn’t possible, since code isn’t loaded until well after Python starts. We can’t even know it when we start the fuzzer: it’s possible to dynamically import code later, or even generate code on the fly.
Thankfully, libFuzzer supports fuzzing shared libraries loaded at runtime. Both __sanitizer_cov_pcs_init and __sanitizer_cov_8bit_counters_init are able to be safely called from a shared library in its constructor (called when the library is loaded). So, Atheris simulates loading shared libraries! When tracing is initialized, Atheris first calls those functions with an array of 8-bit counters and completely made-up program counters. Then, whenever a new Python line is reached, Atheris allocates a PC and 8-bit counter to that line; Atheris will always report that line the same way from then on. Once Atheris runs out of PCs and 8-bit counters, it simply loads a new “shared library” by calling those functions again. Of course, exponential growth is used to ensure that the number of shared libraries doesn’t become excessive.
What’s Special about Python 3.8+?
In the README, we advise users to use Python 3.8+ where possible. This is because Python 3.8 added a new feature: opcode tracing. Not only can we monitor when every line is visited and every function is called, but we can actually monitor every operation that Python performs, and what arguments it uses. This allows Atheris to find its way through if statements much better.
When a COMPARE_OP opcode is encountered, indicating a boolean comparison between two values, Atheris inspects the types of the values. If the values are bytes or Unicode, Atheris is able to report the comparison to libFuzzer via __sanitizer_weak_hook_memcmp. For integer comparison, Atheris uses the appropriate function to report integer comparisons, such as __sanitizer_cov_trace_cmp8.
In recent Python versions, a Unicode string is actually represented as an array of 1-byte, 2-byte, or 4-byte characters, based on the size of the largest character in the string. The obvious solution for coverage is to:
- first compare two strings for equivalent character size and report it as an integer comparison with __sanitizer_cov_trace_cmp8
- Second, if they’re equal, call __sanitizer_weak_hook_memcmp to report the actual string comparison
However, performance measurements discovered that the surprising best strategy is to convert both strings to utf-8, then compare those with __sanitizer_weak_hook_memcmp. Even with the performance overhead of conversion, libFuzzer makes progress much faster.
Related Google News:
- Analyzing Python package downloads in BigQuery March 18, 2021
- Supporting the Python ecosystem February 11, 2021
- Improve the data science experience using scalable Python data processing December 11, 2020
- New ways Google Workspace works with tools you already use December 10, 2020
- Announcing the Atheris Python Fuzzer December 4, 2020
- 'Hey Google' now works with your Android apps October 8, 2020
- Maab Ibrahim works each day to fight for racial justice September 24, 2020
- How spam reports are used at Google July 3, 2020