Translating binaries to LLVM with Revgen

Revgen is a tool that takes a binary and turns (“lifts”) it into LLVM bitcode. It does so in four steps:

Disassemble the binary using IDA Pro
Recover the control flow graph (CFG) using McSema
Translate each basic block in the CFG into a chunk of LLVM bitcode by using QEMU’s translator
Stitch together translated basic blocks into LLVM functions

Revgen relies on QEMU’s translator to extract the semantics of machine instructions into a simple intermediate representation (IR) and S2E’s LLVM translator to turn this IR into LLVM bitcode. This brings two main advantages over competing approaches:

Better precision. QEMU developers spent 15+ years building translators that are precise enough to emulate actual operating systems. Even a seemingly simple x86 instruction such as call needs more than ten pages of pseudo code in the Intel instruction set manual. Revgen effortlessly captures the complex behavior of all machine instructions, including system instructions not normally used in user programs.
Better reliability. Implementing a translator from scratch brings bugs and incompleteness. Revgen, however, can handle pretty much every instruction that QEMU supports, even the most exotic ones.

In the rest of this document, we will show how to use Revgen, how it works under the hood, and give a brief overview of its performance. As we will see, the main drawback of relying on QEMU’s translator is the high overhead of the generated code (15-30x larger binaries). Revgen would need to reuse various optimization passes in order to bring this overhead down.

Using Revgen

In this section, we show how to use Revgen on DARPA CGC binaries. We cleaned up and made stable our initial prototype of Revgen in order to analyze CGC binaries, so it supports these binaries the best. Other types of binaries will be supported in the future.

Warning

Revgen is not supported anymore and has been removed from the master branch of S2E. Use Remill instead if you want to translate binaries to LLVM.

1. Prerequisites

Before starting, make sure that you have a functional S2E environment and that you have a working IDA Pro setup. Check that:

IDA can disassemble CGC binaries. You will need to install the CGC plugins, which you can find here. If you cannot find pre-compiled plugins for your IDA version, you will need to compile the plugin yourself using the IDA Pro SDK.

IDA has a working Python environment with the protobuf library installed. Running pip install protobuf should work in most cases.

The IDA_ROOT variable is set to the IDA Pro path, e.g., export IDA_ROOT=/opt/ida-6.8.

Revgen uses McSema to recover the CFG of the binary, and that requires IDA Pro. Note that this is an older version of McSema scripts (from 2016). McSema2 has been released in the meantime, which should have much better CFG recovery. We plan to port Revgen to the latest version.

2. Build the CGC binaries

S2E comes with a Docker image and instructions on how to build all DARPA CyberGrandChallenge binaries. After you have completed this step, you should have a samples folder that contains ~280 binaries:

ls -1 $S2E_ENV/source/decree/samples
CADET_00001
CADET_00003
CROMU_00001
CROMU_00002
CROMU_00003
CROMU_00004
CROMU_00005
CROMU_00006
...

3. Translating a binary

Let us first translate the CADET_00001 binary. For this, set the S2E_PREFIX and IDA_ROOT variables, then run revgen.sh. This tutorial assumes that your S2E environment is in $S2E_ENV.

export IDA_ROOT=/opt/ida-6.8
export S2E_PREFIX=$S2E_ENV/install
$S2E_PREFIX/bin/revgen.sh $S2E_ENV/source/decree/samples/CADET_00001

You should get the following console output:

stat: cannot stat '$S2E_ENV/source/decree/samples/CADET_00001.pbcfg': No such file or directory
[IDA    ] Writing CFG to $S2E_ENV/source/decree/samples/CADET_00001.pbcfg...

[REVGEN ] Translating $S2E_ENV/source/decree/samples/CADET_00001 to $S2E_ENV/source/decree/samples/CADET_00001.bc...
warning: Linking two modules of different data layouts: '$S2E_ENV/install/lib/X86BitcodeLibrary.bc' is 'e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128' whereas 'tcg-llvm' is 'e-m:e-i64:64-f80:128-n8:16:32:64-S128'

[4] RevGen:generateFunctionCall -   >Function 0x804860c does not exist
[LLVMDIS] Generating LLVM disassembly to $S2E_ENV/source/decree/samples/CADET_00001.ll...
[CLANG  ] Compiling LLVM bitcode of CGC binary to native binary $S2E_ENV/source/decree/samples/CADET_00001.rev...

You will find the following output files in the $S2E_ENV/source/decree/samples folder. The meaning of each file is explained in the revgen.sh script, but here is an explanation of the most important ones:

CADET_00001: the original binary

CADET_00001.pbcfg: the CFG extracted by IDA Pro / McSema

CADET_00001.bc: the LLVM bitcode file created by RevGen

CADET_00001.rev: the LLVM bitcode file compiled to an ELF binary that you can run on your Linux host

4. Running a translated CGC binary

Revgen comes with a runtime library that translates Decree system calls to their Linux counterparts. This allows you to run the translated Decree binaries on your Linux host. For example, running CADET_00001 as follows:

user@ubuntu:~$ $S2E_ENV/source/decree/samples/CADET_00001.rev

Welcome to Palindrome Finder

    Please enter a possible palindrome: sdf
            Nope, that's not a palindrome

    Please enter a possible palindrome: aaa
            Yes, that's a palindrome!

    Please enter a possible palindrome:

Warning

Revgen currently supports only CGC binaries. It may or may not be able to generate a bitcode file for other kinds of binaries (e.g., Linux or Windows) and cannot run non-CGC binaries.

Some CGC binaries may fail to translate because of various limitations of the (old) McSema script that Revgen uses.

Design and implementation

Revgen’s design is straightforward: it takes a list of basic blocks, calls a translator to turn them to equivalent pieces of LLVM bitcode, then stitches these pieces of bitcode together in order to reconstruct original functions.

At a high level, the translator takes a block of machine code (e.g., x86) and turns it into a QEMU-specific intermediate representation (IR). The translator then transforms this IR to the desired target instruction set (in Revgen’s case, LLVM). The translator is composed of the CPU emulation library (libcpu), which generates the IR, and of the Tiny Code Generator library (libtcg), which handles the IR to LLVM conversion. We extracted libcpu and libtcg from QEMU and made both available as standalone libraries. We added LLVM translation capabilities to libtcg, which you can find here.

In the rest of this section, we will explain in more details how the translator works and how Revgen uses it to build an LLVM version of an entire binary. We will also see what it takes to run such binaries and discuss the assumptions that Revgen makes about them.

Translating basic blocks to LLVM

Revgen takes the binary file and the CFG recovered my McSema, and turns every basic block in that CFG into a piece of LLVM code. Revgen stops when it has translated all basic blocks in the CFG. The result is a set of independent LLVM functions, one for each basic block. Revgen’s translator handles basic blocks in two steps: (1) it turns a basic block into a sequence of micro-operations and then (2) converts them to LLVM instructions. We will see next this process in more details.

First, the translator converts machine instructions into an equivalent sequence of micro-operations. For example, the translator decomposes the x86 instruction inc [eax] into a load to a temporary register, an increment of that register, and a memory store. This implements the effects of incrementing the memory location stored in the eax register. The resulting sequence of micro-operations forms a translation block.

Second, the translator maps each micro-operation to LLVM instructions, using a code dictionary. The dictionary associates each micro-operation with a sequence of LLVM instructions that implement the operation. Most conversions are one-to-one mappings between micro-operations and LLVM instructions (e.g., arithmetic, shift, load/store operations).

The translator also handles instructions that manipulate system state. Revgen accurately translates to LLVM instructions like fsave or mov cr0, eax. The former saves the state of the floating point unit, while the latter sets the control register (e.g., to enable 32-bit protected mode, which changes the behavior of many instructions).

For this, the translator uses emulation helpers. An emulation helper is a piece of C code that emulates complex machine instructions that do not have equivalent micro-operations. Revgen compiles emulation helpers to LLVM and adds them to the code dictionary, transparently enabling the support of machine instructions that manipulate system state. Helpers are implemented in libcpu and you can find them here.

Third, the translator packages the sequence of LLVM instructions into an LLVM function that is equivalent to the original basic block taken from the binary. More precisely, given the same register and memory input, the translated code produces the same output as what the original binary does if executed on a real processor.

To illustrate this process, let us consider the following function. This function invokes the exit system call with a status code passed as a parameter on the stack. The function is composed of two basic blocks: one starting at address 0x804860C and another one at 0x8048618.

.text:0804860C ; int __cdecl sub_804860C(int status)
.text:0804860C sub_804860C     proc near
.text:0804860C
.text:0804860C
.text:0804860C status          = dword ptr  4
.text:0804860C
.text:0804860C                 mov     eax, 1
.text:08048611                 push    ebx
.text:08048612                 mov     ebx, [esp+4+status] ; status
.text:08048616                 int     80h             ; LINUX - sys_exit
.text:08048616 sub_804860C     endp
.text:08048616
.text:08048618 ; ---------------------------------------------------------------------------
.text:08048618                 pop     ebx
.text:08048619                 retn

Revgen turns these two blocks into two LLVM functions that look like this:

define i64 @tcg-llvm-tb-804860c-c-a3-0-4000b7(%struct.CPUX86State* nocapture) local_unnamed_addr #17 {
  %2 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 5

  ; mov eax, 1
  %3 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 0
  store i32 1, i32* %3, align 4

  ; push ebx
  %4 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 3
  %ebx = load i32, i32* %4, align 4, !s2e.pc !377
  %5 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 4
  %esp = load i32, i32* %5, align 4, !s2e.pc !377
  %6 = add i32 %esp, -4, !s2e.pc !378
  tail call void @__stl_mmu(i32 %6, i32 %ebx, i32 1), !s2e.pc !377

  ; mov ebx, [esp+4+status]
  store i32 %6, i32* %5, align 4
  %7 = add i32 %esp, 4, !s2e.pc !378
  %8 = tail call i32 @__ldl_mmu(i32 %7, i32 1), !s2e.pc !378
  store i32 %8, i32* %4, align 4

  ; int 0x80
  store i32 134514198, i32* %2, align 4
  tail call void @helper_raise_interrupt(i32 128, i32 2)
  ret i64 0
}

define i64 @tcg-llvm-tb-8048618-2-99-0-4000b7(%struct.CPUX86State* nocapture) local_unnamed_addr #17 {
  ; pop ebx
  %2 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 5
  %3 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 4
  %esp = load i32, i32* %3, align 4, !s2e.pc !379
  %4 = tail call i32 @__ldl_mmu(i32 %esp, i32 1), !s2e.pc !379
  %5 = add i32 %esp, 4, !s2e.pc !380
  store i32 %5, i32* %3, align 4

  ; retn
  %6 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 3
  store i32 %4, i32* %6, align 4
  %7 = tail call i32 @__ldl_mmu(i32 %5, i32 1), !s2e.pc !380
  %8 = add i32 %esp, 8, !s2e.pc !380
  store i32 %8, i32* %3, align 4
  store i32 %7, i32* %2, align 4
  ret i64 0
}

Each function takes a pointer to a CPUX86State structure. This structure models the CPU’s register file. All machine instructions are translated into LLVM instructions that operate on this CPU state structure. To handle memory accesses, the translator emits calls to __stX_mmu and __ldX_mmu helpers. We will explain later why the translator generates these instead of native LLVM load/store instructions. The int 0x80 instruction is complex and the translator calls the helper_raise_interrupt function to handle it.

Stitching basic blocks into functions

Now that Revgen created a set of LLVM functions that represent individual basic blocks of the binary, it needs to assemble them into a bigger function that represents the original function of the binary. This is straightforward: Revgen creates a new LLVM function and fills it with calls to the translated basic blocks. So our example above would look like this:

define i64 @__revgen_sub_804860c_804860c() local_unnamed_addr #0 {
  %1 = getelementptr %struct.CPUX86State, %struct.CPUX86State* @myenv, i64 0
  br label %2

; <label>:2:                                      ; preds = %0
  %9 = call i64 @tcg-llvm-tb-804860c-c-a3-0-4000b7(%struct.CPUX86State* %1)
  br label %10

; <label>:3:                                     ; preds = %2
  %11 = call i64 @tcg-llvm-tb-8048618-2-99-0-4000b7(%struct.CPUX86State* %1)
  ret i64 %11
}

__revgen_sub_804860c_804860c is an LLVM function that represents the function called sub_804860c in the original binary.

Notice how basic blocks are connected together with branch instructions. The example above shows a simple case where control goes directly from the first basic block to the second (which assumes that the int 0x80 instruction actually returns).

Handling basic blocks that have several successors is more complex. There can be as few as two successors for simple direct conditional branches, and many more for switch statements. Luckily for Revgen, IDA Pro and McSema perform the (very) complex task of computing successors. All Revgen does is read the program counter and call the basic block associated with it, like this:

eip = tcg-llvm-tb-abc(env);

if (eip == 0xabc) {
    tcg-llvm-tb-abc(env);
} else if (eip == 0xdef) {
    tcg-llvm-tb-def(env);
} else {
    abort();
}

The call to abort() is important to terminate the translated program cleanly in case of unexpected program counters. This may happen in cases where the binary’s CFG was not recovered properly and the program modifies the program counter to an unexpected value, e.g., in case of self modifying code.

As an exercise, open the CADET_00001.ll file that Revgen generated and try to find the translated basic blocks and functions.

Assumptions

The current implementation of Revgen makes several assumptions about the input binary.

First, the binary is statically linked. Revgen does not currently handle calls to dynamically linked library functions. An approach to handle calls is to read the emulated stack content and pass its data to LLVM call instructions, as well as store the return values to the appropriate register (e.g., to env->regs[R_EAX] for x86 programs). Doing this requires to know the calling convention of the API and to assume that the API is not sensitive to the aspect of the translated binary (dual stack, different program counters, etc.).

Second, an x86 binary runs in a 32/64-bit protected mode environment with a flat memory model and in user space. This is important, as the translator may disassemble instructions differently depending on the execution mode. For example, attempting to translate the x86 sysret instruction outside protected mode will cause the translator to emit a general protection fault exception, aborting the translation process. This behavior is inherited from QEMU’s dynamic binary translator. In general, binaries should come with some sort of section headers describing which execution model they assume so that Revgen can configure the translator properly.

Third, the input binary may not have self-modifying code. Removing this restriction would certainly be possible given adequate runtime support, but in that case we would pretty much end up re-implementing QEMU. QEMU handles self-modifying by detecting writes to code sections and re-translating modified code on the fly.

Running translated binaries

In the previous section, we have seen how Revgen translates machine code to LLVM. We will now see how to run it. This requires linking the translated bitcode file with a run time that sets up the initial CPU state and provides emulation helpers that resolve memory accesses and translate system calls.

Initializing the CPU state. The runtime must first initialize the emulated CPU state, in particular the stack pointer register. The translated code retains all the assumptions of the original binary about the stack layout. In particular, it assumes that local variables, return addresses, and parameters are located at precise memory addresses when doing stack pointer arithmetic. The runtime library preserves the original stack layout by using a dual-stack architecture. There is one native stack used by the LLVM program and one implicit stack, whose pointer is stored in the CPU state structure (e.g., env->regs[R_ESP] for x86) and which is manipulated by the translated code. The runtime allocates the implicit stack and sets the implicit stack pointer before calling the main entry point of the program.

Resolving pointer arithmetic. Revgen embeds a copy of the original binary in the translated binary in order to resolve accesses to its sections at runtime. Revgen stores each section of the binary in a separate LLVM array. For example, if a program contains a hard-coded load from address 0x801234 that is actually a load from offset 0x1234 of the data section, the runtime will remap the access to the appropriate array. Revgen does not make any sophisticated attempt at lifting global variables and therefore resorts to this kind of runtime patching.

Translating system calls. For a program to be useful, it has to generally interact with its environment, which is done through system calls. Depending on the system call flavor (interrupt, syscall, sysenter…), the translator generates a call to a specific helper function. The runtime needs to implement that helper function so that it can translate the system call of the original platform to that of the target platform (e.g., Decree/CGC to vanilla Linux). This is very similar to what user emulation mode in QEMU does.

Evaluation

Revgen produces a translated binary that is about 15-30x bigger than the original. This overhead is high for several reasons:

Revgen does not attempt to perform any sort of optimizations. All it does is a straightforward translation of the binary to LLVM using QEMU’s translator. It does not attempt to reconstruct variables, simplify stack accesses, recover function parameters, lift exceptions, etc.
Revgen wraps each memory access into a function call. This makes it simpler to run the translated binary, but results in a larger overhead because of the extra call instruction and possible inlining of that call.
Revgen embeds the entire original binary in the translated binary. This allows Revgen to correctly translate data memory accesses at run time, but doubles the size of the binary.

Ideally, all the above should be done statically during translation. This would make a standalone LLVM binary that can be linked with the standard library and ran directly on any architecture. This requires much more complex lifting, similar to what McSema does. An non-exhaustive list of needed tasks would be to lift local and global variables, recover library function calls, support multi-threading, signals, exceptions, long jumps, and many more. Proper lifting would massively cut overhead because it would enable the LLVM toolchain to perform code optimization. In its current state, the code generated by Revgen cannot be optimized by the compiler.

In the remainder of this section, you will find results for CGC binaries. You can generate the evaluation data using the revgen-gen-stats.sh script. It computes the size of various output files as well as the time it takes to generate them. This data is useful to benchmark Revgen. Here is a sample output of the script:

BinaryName      InputBinSize    RevgenBinSize   RevgenBcSize    CfgSize IdaTimeMs       RevgenTimeMs    CompileTimeMs
CADET_00001     4352    93124   370160  10616   1606    3508    5527
CADET_00003     4352    93124   370160  10616   2292    2348    5463
CROMU_00001     14616   195040  645784  52420   5275    2929    13887
CROMU_00002     18756   363416  782796  58706   5716    4138    19192
...

You can find the complete data in the cgc-binaries.stats in the documentation repository. The bar chart below shows the size overhead of binaries produced by Revgen by comparing the original and translated binary size. You can generate the chart using this script (written in R).