Optimizing for Size

Many modern C compilers offer the option of optimizing the code you are compiling. That's great. You write your code and the compiler takes care of making it run fast for you. But sometimes you don't need your compiler to produce a program that runs as fast as possible, but to generate the smallest binary size. Compression of some sort, if you apply that term liberally.

Compressing with cc?

Crazy idea: If the compiler is good at making code smaller, can it also do it with arbitrary data? For that we have to make compileable code out of that data first.

Let's write a script to do that!

#!/usr/bin/env python3
import sys

with open(sys.argv[1], 'rb') as f:
    data = f.read()

print(f'''
#include <stdio.h>

int
main(void)
{{
    static char buf[{len(data)}];
''')

for i, c in enumerate(data):
    print(f'    buf[{i}] = {c};')

print('''
fwrite(buf, 1, sizeof(buf), stdout);
return 0;
}
''')

It generates code like this:

#include <stdio.h>

int
main(void)
{
    static char buf[431888];

    buf[0] = 239;
    buf[1] = 187;
    buf[2] = 191;
    buf[3] = 84;

    [...]

    fwrite(buf, 1, sizeof(buf), stdout);
    return 0;
}

Let's try it

Let's try this out on a book I downloaded from Project Gutenberg.

Both clang and gcc offer an option to optimize for size (-Os, -Oz) and I already have them installed on my machine, so that's what I'm going to use here.

gcc

$ gcc --version
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ time gcc -Os -o a.out.gcc pg2162.c

real    5m59.957s
user    5m57.199s
sys 0m0.939s

clang

$ clang --version
clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ time clang -Os -o a.out.clang.-Os pg2162,c
pg2162.c:9:14: warning: implicit conversion from 'int' to 'char' changes value from 239 to -17 [-Wconstant-conversion]
    buf[0] = 239;
           ~ ^~~
pg2162.c:10:14: warning: implicit conversion from 'int' to 'char' changes value from 187 to -69 [-Wconstant-conversion]
    buf[1] = 187;
           ~ ^~~
pg2162.c:11:14: warning: implicit conversion from 'int' to 'char' changes value from 191 to -65 [-Wconstant-conversion]
    buf[2] = 191;
           ~ ^~~
3 warnings generated.

real    4m17.739s
user    4m15.006s
sys 0m1.274s

Interestingly clang emits warnings, even though no options have been provided. I think that's valuable, as it allows to find problems with your code early on. Also, chars are signed. Sometimes.

Conclusion

Misusing a compiler for compression does not seem to be viable. A lot of effort has gone into making compilers generate optimal code, but those optimizations are designed to work in the context of source/machine code, not as a general compression algorithm.

808K    a.out.clang.Os          # clang -Os
808K    a.out.clang.Oz          # clang -OZ
908K    a.out.gcc               # gcc -Os
424K    pg2162.txt              # The original text file
164K    pg2162.txt.gz           # gzip compressed (default options)

Pity for us, but not entirely surprising.


Posted on Dec 19, 2019 by Martin Natano