Verifying the integrity of a compiler

Introduction: the catch-22 of compiler integrity.

In a seminal paper, Ken Thompson, for his Turing Award lecture in 1984 (!) pointed out a fundamental difficulty of verifying the executable code of a compiler. Indeed, there is a chicken-and-egg problem with a compiler. Even though one may have written a compiler in source code, to actually make the compiler executable, one... needs a compiler executable. Thompson, in his lecture, demonstrated that even though the source code of a compiler may be perfectly "clean", a corrupted binary executable compiler can compile this code into yet another corrupted binary like itself. The corruption can consist of any back door or other Trojan horse that will be propagated into any code the Trojan was targeted for, amongst which, of course, all variants of the compiler itself. Thompson pointed out that no matter what scrutiny one exercises towards the source code of the entire system, this kind of back door or other trojan horse attack is hard (he meant probably "impossible") to discover, apart from studying the binary code of the compiler itself. But modern compilers are very complex systems of which the complete binary code study is out of the question. A well-hidden Trojan horse will be impossible to detect.

Countering the attack: Diverse Double compilation

However, a strategy to test such a corrupt compiler binary has been exposed by David A. Wheeler in his PhD. paper. The idea is the following: a "good" binary executable does exactly what its source code (and the language definition) prescribes, no matter how this binary is constructed. As a compiler is a particular program that takes as input "source code" and has as output a binary executable constructed form the input exactly as its own source code describes. The relationship between input source code, and output binary, is hence perfectly described by the source code of the compiler. A given compiler (source code) describes what binary file should be computed on the basis of a given input source code.

The following relation:

(source code A) -> [ compiler binary X] -> (binary executable XA)

is perfectly described by the compiler source code which describes what the compiler binary should do.

If we have another compiler, then of course the same source code will be compiled in another binary executable.

(source code A) -> [ compiler binary Y ] -> (binary executable YA)

There is no reason why the binary executable YA would be identical to the binary executable XA. There are different ways of making a binary executable. However, in as much as compilers X and Y are both a correct implementation of the language L, binary executables XA and YA should behave in exactly the way the source code A (and the language L) describes.

Two different compilers A and B can generate different binary executables when presented with the code

#include <stdio.h>
 
int main()
{
  printf("Hello world\n");
  return 0;
}

But both of these executables should print "Hello world" on the standard output if both compilers are a correct implementation of the C language. One shouldn't add a carriage return, while the other doesn't. One shouldn't write "Hello World". The output of these two programs should be byte-by-byte identical.

Now, a compiler is a program like any other, which behaviour is described by its source code. As such, we have:

(source code of X) -> [compiler binary X] -> (binary executable XX) which is itself a compiler.

(source code of X) -> [compiler binary Y} -> (binary executable YX) which is itself a compiler.

The compiler executables XX and YX are different binaries, but they are both executables that correspond to the source code X. As such, if we compile any source code A with XX or with YX, we should obtain the same binary executable, because this is exactly described by source code of X, it is like the output text "Hello world" of the smaller program:

(source code of A) -> [compiler binary XX] -> (binary XXA)

(source code of A) -> [compiler binary YX] -> (binary YXA)

both should be identical.

If we apply this now to the source code of X itself:

(source code of X) -> [compiler binary XX] -> (binary XXX)

(source code of X) -> [compiler binary YX] -. (binary YXX)

should be identical.

How can this detect a corrupted Trojan horse binary ? A Trojan horse binary is such that it detects that it is compiling a compiler source code, and makes a different compiler (otherwise, the Trojan doesn't propagate).

As such, suppose that we have a corrupted compiler binary H that pretends to be the compiler X.

We must have:

(source code X) -> [compiler H] -> (executable H)

Let us now take the situation where compiler Y is correct.

(source code X) -> [compiler Y] -> (binary executable YX)

Applying this again to the source code of X, we will have:

(source code X) -> [ compiler binary H} -> executable H

(source code X) -> [compiler binary YX] -> executable YXX

Of course, executable YXX and H are NOT the identical because H contains the binary code of the Trojan horse. So if we find that these binaries are not identical, that is in any case an indication that compiler X or Y are not an identical implementation of the language L.

One could consider that the Trojan horse compiler H is in fact an implementation of a different language LH, a language that is such that it is defined to have the behaviour of the Trojan horse in those particular cases where the Trojan horse is supposed to be active (say, opening a back door).

The verification that two binary files are identical, is much easier than the verification of whether two binary executables are functionally equivalent. This is Wheeler's essential contribution.

But how do we know that compiler Y is correct ? If we knew from the start that compiler Y was correct, there was no problem: we would compile X with Y and that's it. Wheeler's point is that compiler Y may just as well contain a Trojan horse, from the moment that it is not exactly the same Trojan horse.

Indeed, if compiler Y is a Trojan horse containing compiler G, then we would have:

(source code of Y) -> [compiler binary G] -> executable G

However, compiler binary G presented with the source code of X may:

not recognize that X is a compiler and hence not insert the necessary compiling instructions to put a Trojan horse compiler in it
even if it sees that X is a compiler, not know how to modify it so that the Trojan horse is correctly implemented in this totally different compiler

Indeed, the Trojan horse writer should anticipate how another (maybe not yet existing) compiler is going to work in order to be able to modify the binary code it will generate in such a way that it will do exactly as G.

Hence, we have:

(source code of X) -> [ compiler binary G] -> executable G' X with G' the absent, or erroneously implemented horse in another compiler code the Trojan horse writer didn't know about.

As such,

(source code of X) -> [compiler binary H] -> (compiler binary H]

(source code of X) -> [compiler G' X] -> (executable G" X)

In as much as G' is non-functional or absent, we may have a correct executable X, or we may have still more deviating G-Horse code in G".

But in any case, the chances that (compiler binary H) and (executable G" X) are identical, are very slim.

As such, a difference will be produced, even if both compiler binaries were corrupted, as long as their corruption is independent or non-compatible.

Demonstration with the Tiny-C compiler.

We can use the virtual machine of ubuntu 13.10 that we used with the Shell Shock vulnerability, or we can install, following a very similar procedure, another (server) version. In fact, I did the test on many different systems, and the first ubuntu system on which everything works well is Dapper Drake (from 2006). One can download and install Dapper Drake from here. The idea is to install the server version (in order to have a very light virtual machine). The reason to use an old installation is that it is less likely that a Trojan horse existed in the binary of the C compiler in 2006 that knew about tcc as it was written in 2009, than in 2013. One can follow the tutorial from the Shell Shock demonstration until one has modified the /etc/apt/sources.list file (changing the archive into old-releases).

It might be necessary to change the Virtual Box storage from SATA into IDE (deleting the SATA drives and adding the virtual drive to the IDE connection), if during installation, the partitioning doesn't work.

Of course, this time we don't have to configure a web server. One only needs to install the gcc compiler suite, and the texinfo package:

sudo apt-get install gcc

sudo apt-get install build-essential

sudo apt-get install texinfo

sudo apt-get install wget

should now work.

One can make a tcc.d directory to work in. In this directory, one fetches the tcc compiler source code as follows:

wget http://download.savannah.gnu.org/releases/tinycc/tcc-0.9.26.tar.bz2

In order to unwrap it, one has to give the following commands:

bzip2 -kd tcc-0.9.26.tar.bz2

tar xvf tcc-0.9.26.tar

Going into the directory, one can now compile the tcc compiler source code:

./configure

make

This compiles the tcc source code with the gcc compiler.

The resulting tcc executable has a length of 385632 bytes on my 32-bit installation of Dapper.

We can verify that the compiler tcc works:

make test

and finally

make install

to put the executable and the libtcc1.a library in the system (as well as the documentation).

We can now clean out our directory:

make clean

deletes all the generated files so that we can start over with a clean slate (except that tcc has now been system-installed).

We will now make tcc, but with tcc:

./configure --cc=tcc

make

will generate, this time, the tcc executable with tcc. This executable has 489020 bytes.

make test

will verify whether this executable works well.

If we now install this executable system-wide:

sudo make install

we have now another version of the tcc compiler in the system.

We can go for another cycle:

make clean

./configure --cc=tcc

make

and... LO AND BEHOLD !

This time the tcc executable has only 486564 bytes !

it turns out that Wheeler's DDC test failed !

Indeed, we are now comparing the output of the tcc source code compiled with (tcc compiled with gcc) with the output of the tcc source code compiled with (tcc compiled with tcc). This was exactly Wheeler's test, and it failed ! Howdy, the gcc compiler binary was hence corrupt, in 2006 !

Not so fast.

It turns out that there is a problem with the way tcc works as a loader.

tcc can work as a compiler but also as a loader (just as gcc). It turns out that when tcc works as a loader, it inserts a copy of binary code from the installed libtcc1.a into the executable he's building.

But that mean that when we compiled tcc with tcc by gcc, the installed libtcc1.a library in the system was the one compiled with gcc. This piece of gcc binary code is copied in the output: the famous tcc compiler with length 489020 bytes. However, during this compilation, a new libtcc1.a library is also prepared, this time with object code generated with tcc. At this install, the installed gcc-code libtcc1.a library is now overwritten with a tcc compiled library. So if we compile again, this time, the tcc compiler will insert pieces of code coming from this tcc-compiled library.

So one shouldn't, in fact, be surprised that the binary code of tcc, containing still pieces of gcc-compiled code, is different from the binary code of tcc, where everything came from tcc.

The assumption of Wheeler was of course that the program wasn't going to copy pieces of his own binary files into the output.

The DDC test, revisited

So how do we solve this ?

First, we go back to the gcc compiled code:

make clean

./configure

make

sudo make install

Again, the gcc-compiled tcc is installed in the system.

With this we compile tcc a first time:

make clean

./configure --cc=tcc

make

Note that the tcc compiler executable has length 489020 bytes.

The trick is now to only copy the libtcc1.a library (29670 bytes long) into the system, replacing the libtcc1.a library of 40020 bytes long that had been compiled by gcc, but by leaving the gcc-build tcc executable installed in the system:

sudo cp libtcc1.a /usr/local/lib/tcc

We will rebuild tcc again with tcc. Note that the system-installed tcc is still the gcc compiled version.

We compile the source code again:

make clean

./configure --cc=tcc

make

Note that this time, the resulting tcc has length 486564 bytes (and the libtcc1.a library is 29670 bytes long).

We can put this tcc aside:

cp tcc tccB

install this tcc version (this time installing the tcc-compiled tcc in the system)

sudo make install

and go through another cycle

make clean

./configure --cc=tcc

make

We can verify Wheeler's DDC this time:

diff tcc tccB

and this time, we see that the two binaries are identical.

The binary that has been compiled with the tcc that was constructed with gcc, is identical with the binary that was compiled with tcc constructed with tcc itself.

Conclusion

What can we conclude from this ?

The only "bootstrapping" compiler we used in this story is the gcc binary. We only imported the tcc source code, and never installed "unknown" tcc binary code. So the test pertains only to the a priori binary of gcc. However, we didn't compile gcc itself with tcc (that would be a good idea, but the source code of gcc is very very large and complex). So we didn't do the full Wheeler test on gcc. But we did do the test on tcc with gcc, and we can hence conclude that there are three possibilities;

gcc containing a Trojan horse knew perfectly well how to systematically install a working and self-copying Trojan horse in the functioning of another compiler (tcc), in such a way that this corrupted tcc binary generates exactly the same Trojan horse when it compiles itself.
gcc didn't realize that tcc was a compiler, and the Trojan horse was deactivated, so that gcc worked correctly on this code (the Trojan in gcc remained dormant).
gcc was clean.

The first option seems extremely unlikely with a gcc from 2006, and a tcc from 2009. To really be sure, one should use an even older compiler. So we can probably safely exclude the first option.

To eliminate the second option, one should actually do the test on gcc, and not on tcc (with gcc source code). We didn't do it but it should be doable. However, whether it is option 2 or 3, doesn't matter, in the following sense: our tcc binary is in any case a clean compiler.

This was just an exercise, to make the reader aware of the potential danger that resides in corrupted compilers. The corrupted compiler attack is very powerful because it can put dormant code (back doors) in a lot of applications from very trusted origin if they happen to use a corrupted compiler binary. Although, as we've argued, inspecting source code is not sufficient to counter this attack, access to the source code is essential to use the Diverse Double Compiling test. Open source is no guarantee of security. Absence of open source is guarantee of vulnerability.