======================================== Translating binaries to LLVM with Revgen ======================================== Revgen is a tool that takes a binary and turns ("lifts") it into LLVM bitcode. It does so in four steps: * Disassemble the binary using IDA Pro * Recover the control flow graph (CFG) using `McSema `__ * Translate each basic block in the CFG into a chunk of LLVM bitcode by using QEMU's translator * Stitch together translated basic blocks into LLVM functions Revgen relies on QEMU's translator to extract the semantics of machine instructions into a simple intermediate representation (IR) and S2E's LLVM translator to turn this IR into LLVM bitcode. This brings two main advantages over competing approaches: * **Better precision.** QEMU developers spent 15+ years building translators that are precise enough to emulate actual operating systems. Even a seemingly simple x86 instruction such as ``call`` needs more than ten pages of pseudo code in the Intel instruction set manual. Revgen effortlessly captures the complex behavior of all machine instructions, including system instructions not normally used in user programs. * **Better reliability.** Implementing a translator from scratch brings bugs and incompleteness. Revgen, however, can handle pretty much every instruction that QEMU supports, even the most exotic ones. In the rest of this document, we will show how to use Revgen, how it works under the hood, and give a brief overview of its performance. As we will see, the main drawback of relying on QEMU's translator is the high overhead of the generated code (15-30x larger binaries). Revgen would need to reuse various optimization passes in order to bring this overhead down. Using Revgen ============ In this section, we show how to use Revgen on DARPA CGC binaries. We cleaned up and made stable our initial `prototype `__ of Revgen in order to analyze CGC binaries, so it supports these binaries the best. Other types of binaries will be supported in the future. .. warning:: Revgen is not supported anymore and has been removed from the master branch of S2E. Use `Remill `__ instead if you want to translate binaries to LLVM. 1. Prerequisites ---------------- Before starting, make sure that you have a functional S2E environment and that you have a working IDA Pro setup. Check that: * IDA can disassemble CGC binaries. You will need to install the CGC plugins, which you can find `here `__. If you cannot find pre-compiled plugins for your IDA version, you will need to compile the plugin yourself using the IDA Pro SDK. * IDA has a working Python environment with the ``protobuf`` library installed. Running ``pip install protobuf`` should work in most cases. * The ``IDA_ROOT`` variable is set to the IDA Pro path, e.g., ``export IDA_ROOT=/opt/ida-6.8``. Revgen uses `McSema `__ to recover the CFG of the binary, and that requires IDA Pro. Note that this is an older version of McSema scripts (from 2016). `McSema2 `__ has been released in the meantime, which should have much better CFG recovery. We plan to port Revgen to the latest version. 2. Build the CGC binaries ------------------------- S2E comes with a Docker image and `instructions `__ on how to build all DARPA CyberGrandChallenge binaries. After you have completed this step, you should have a ``samples`` folder that contains ~280 binaries: .. code-block:: bash ls -1 $S2E_ENV/source/decree/samples CADET_00001 CADET_00003 CROMU_00001 CROMU_00002 CROMU_00003 CROMU_00004 CROMU_00005 CROMU_00006 ... 3. Translating a binary ----------------------- Let us first translate the ``CADET_00001`` binary. For this, set the ``S2E_PREFIX`` and ``IDA_ROOT`` variables, then run ``revgen.sh``. This tutorial assumes that your S2E environment is in ``$S2E_ENV``. .. code-block:: bash export IDA_ROOT=/opt/ida-6.8 export S2E_PREFIX=$S2E_ENV/install $S2E_PREFIX/bin/revgen.sh $S2E_ENV/source/decree/samples/CADET_00001 You should get the following console output: .. code-block:: console stat: cannot stat '$S2E_ENV/source/decree/samples/CADET_00001.pbcfg': No such file or directory [IDA ] Writing CFG to $S2E_ENV/source/decree/samples/CADET_00001.pbcfg... [REVGEN ] Translating $S2E_ENV/source/decree/samples/CADET_00001 to $S2E_ENV/source/decree/samples/CADET_00001.bc... warning: Linking two modules of different data layouts: '$S2E_ENV/install/lib/X86BitcodeLibrary.bc' is 'e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128' whereas 'tcg-llvm' is 'e-m:e-i64:64-f80:128-n8:16:32:64-S128' [4] RevGen:generateFunctionCall - >Function 0x804860c does not exist [LLVMDIS] Generating LLVM disassembly to $S2E_ENV/source/decree/samples/CADET_00001.ll... [CLANG ] Compiling LLVM bitcode of CGC binary to native binary $S2E_ENV/source/decree/samples/CADET_00001.rev... You will find the following output files in the ``$S2E_ENV/source/decree/samples`` folder. The meaning of each file is explained in the ``revgen.sh`` script, but here is an explanation of the most important ones: * ``CADET_00001``: the original binary * ``CADET_00001.pbcfg``: the CFG extracted by IDA Pro / McSema * ``CADET_00001.bc``: the LLVM bitcode file created by RevGen * ``CADET_00001.rev``: the LLVM bitcode file compiled to an ELF binary that you can run on your Linux host 4. Running a translated CGC binary ---------------------------------- Revgen comes with a runtime library that translates Decree system calls to their Linux counterparts. This allows you to run the translated Decree binaries on your Linux host. For example, running ``CADET_00001`` as follows: .. code-block:: console user@ubuntu:~$ $S2E_ENV/source/decree/samples/CADET_00001.rev Welcome to Palindrome Finder Please enter a possible palindrome: sdf Nope, that's not a palindrome Please enter a possible palindrome: aaa Yes, that's a palindrome! Please enter a possible palindrome: .. warning:: Revgen currently supports only CGC binaries. It may or may not be able to generate a bitcode file for other kinds of binaries (e.g., Linux or Windows) and cannot run non-CGC binaries. Some CGC binaries may fail to translate because of various limitations of the (old) McSema script that Revgen uses. Design and implementation ========================= Revgen's design is straightforward: it takes a list of basic blocks, calls a translator to turn them to equivalent pieces of LLVM bitcode, then stitches these pieces of bitcode together in order to reconstruct original functions. At a high level, the translator takes a block of machine code (e.g., x86) and turns it into a QEMU-specific intermediate representation (IR). The translator then transforms this IR to the desired target instruction set (in Revgen's case, LLVM). The translator is composed of the `CPU emulation library (libcpu) `__, which generates the IR, and of the `Tiny Code Generator library (libtcg) `__, which handles the IR to LLVM conversion. We extracted ``libcpu`` and ``libtcg`` from QEMU and made both available as standalone libraries. We added LLVM translation capabilities to ``libtcg``, which you can find `here `__. In the rest of this section, we will explain in more details how the translator works and how Revgen uses it to build an LLVM version of an entire binary. We will also see what it takes to run such binaries and discuss the assumptions that Revgen makes about them. Translating basic blocks to LLVM -------------------------------- Revgen takes the binary file and the CFG recovered my McSema, and turns every basic block in that CFG into a piece of LLVM code. Revgen stops when it has translated all basic blocks in the CFG. The result is a set of independent LLVM functions, one for each basic block. Revgen's translator handles basic blocks in two steps: (1) it turns a basic block into a sequence of micro-operations and then (2) converts them to LLVM instructions. We will see next this process in more details. First, the translator converts machine instructions into an equivalent sequence of micro-operations. For example, the translator decomposes the x86 instruction ``inc [eax]`` into a load to a temporary register, an increment of that register, and a memory store. This implements the effects of incrementing the memory location stored in the ``eax`` register. The resulting sequence of micro-operations forms a *translation block*. Second, the translator maps each micro-operation to LLVM instructions, using a code dictionary. The dictionary associates each micro-operation with a sequence of LLVM instructions that implement the operation. Most conversions are one-to-one mappings between micro-operations and LLVM instructions (e.g., arithmetic, shift, load/store operations). The translator also handles instructions that manipulate system state. Revgen accurately translates to LLVM instructions like ``fsave`` or ``mov cr0, eax``. The former saves the state of the floating point unit, while the latter sets the control register (e.g., to enable 32-bit protected mode, which changes the behavior of many instructions). For this, the translator uses *emulation helpers*. An emulation helper is a piece of C code that emulates complex machine instructions that do not have equivalent micro-operations. Revgen compiles emulation helpers to LLVM and adds them to the code dictionary, transparently enabling the support of machine instructions that manipulate system state. Helpers are implemented in ``libcpu`` and you can find them `here `__. Third, the translator packages the sequence of LLVM instructions into an LLVM function that is *equivalent* to the original basic block taken from the binary. More precisely, given the same register and memory input, the translated code produces the same output as what the original binary does if executed on a real processor. To illustrate this process, let us consider the following function. This function invokes the exit system call with a status code passed as a parameter on the stack. The function is composed of two basic blocks: one starting at address ``0x804860C`` and another one at ``0x8048618``. .. code-block:: asm .text:0804860C ; int __cdecl sub_804860C(int status) .text:0804860C sub_804860C proc near .text:0804860C .text:0804860C .text:0804860C status = dword ptr 4 .text:0804860C .text:0804860C mov eax, 1 .text:08048611 push ebx .text:08048612 mov ebx, [esp+4+status] ; status .text:08048616 int 80h ; LINUX - sys_exit .text:08048616 sub_804860C endp .text:08048616 .text:08048618 ; --------------------------------------------------------------------------- .text:08048618 pop ebx .text:08048619 retn Revgen turns these two blocks into two LLVM functions that look like this: .. code-block:: llvm define i64 @tcg-llvm-tb-804860c-c-a3-0-4000b7(%struct.CPUX86State* nocapture) local_unnamed_addr #17 { %2 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 5 ; mov eax, 1 %3 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 0 store i32 1, i32* %3, align 4 ; push ebx %4 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 3 %ebx = load i32, i32* %4, align 4, !s2e.pc !377 %5 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 4 %esp = load i32, i32* %5, align 4, !s2e.pc !377 %6 = add i32 %esp, -4, !s2e.pc !378 tail call void @__stl_mmu(i32 %6, i32 %ebx, i32 1), !s2e.pc !377 ; mov ebx, [esp+4+status] store i32 %6, i32* %5, align 4 %7 = add i32 %esp, 4, !s2e.pc !378 %8 = tail call i32 @__ldl_mmu(i32 %7, i32 1), !s2e.pc !378 store i32 %8, i32* %4, align 4 ; int 0x80 store i32 134514198, i32* %2, align 4 tail call void @helper_raise_interrupt(i32 128, i32 2) ret i64 0 } define i64 @tcg-llvm-tb-8048618-2-99-0-4000b7(%struct.CPUX86State* nocapture) local_unnamed_addr #17 { ; pop ebx %2 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 5 %3 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 4 %esp = load i32, i32* %3, align 4, !s2e.pc !379 %4 = tail call i32 @__ldl_mmu(i32 %esp, i32 1), !s2e.pc !379 %5 = add i32 %esp, 4, !s2e.pc !380 store i32 %5, i32* %3, align 4 ; retn %6 = getelementptr %struct.CPUX86State, %struct.CPUX86State* %0, i64 0, i32 0, i64 3 store i32 %4, i32* %6, align 4 %7 = tail call i32 @__ldl_mmu(i32 %5, i32 1), !s2e.pc !380 %8 = add i32 %esp, 8, !s2e.pc !380 store i32 %8, i32* %3, align 4 store i32 %7, i32* %2, align 4 ret i64 0 } Each function takes a pointer to a ``CPUX86State`` structure. This structure models the CPU's register file. All machine instructions are translated into LLVM instructions that operate on this CPU state structure. To handle memory accesses, the translator emits calls to ``__stX_mmu`` and ``__ldX_mmu`` helpers. We will explain later why the translator generates these instead of native LLVM load/store instructions. The ``int 0x80`` instruction is complex and the translator calls the ``helper_raise_interrupt`` function to handle it. Stitching basic blocks into functions ------------------------------------- Now that Revgen created a set of LLVM functions that represent individual basic blocks of the binary, it needs to assemble them into a bigger function that represents the original function of the binary. This is straightforward: Revgen creates a new LLVM function and fills it with calls to the translated basic blocks. So our example above would look like this: .. code-block:: llvm define i64 @__revgen_sub_804860c_804860c() local_unnamed_addr #0 { %1 = getelementptr %struct.CPUX86State, %struct.CPUX86State* @myenv, i64 0 br label %2 ;