more posts

Profile-guided dead code elimination

% du -h hello hello_trimmed
168K    hello
120K    hello_trimmed
% du -h hono hono_trimmed
668K    hono
296K    hono_trimmed
% du -h elysia elysia_trimmed
5.9M    elysia
1.3M    elysia_trimmed
% du -h h3 h3_trimmed
508K    h3
296K    h3_trimmed

We can shrink Porffor binary sizes by ~1.5-5x. How? Profile-guided dead code elimination.

Dead code elimination is hard in dynamic languages

import hugeDependency from 'huge-dependency';
function foo(arg) {
  if (arg == null) return true;
  return hugeDependency.foo(arg);
}

const anArgument = // something extremely hard to infer
foo(anArgument);

Statically, we have to presume hugeDependency.foo is used as we just cannot know statically here. This might mean we have to pull in say, 1 million lines of JS, and as the compiler we cannot do anything about it. This sucks. Unless...

Profile-guided DCE

As we cannot know statically, what if we just find out at runtime? Assume a spherical cow deterministic program, we can run it once with profiling to collect every function call, delete unused functions, and compile again. This is similar to profile-guided optimization, but purely for dead code elimination.

By far, collecting the profile is the hardest part here. Not the collection itself but how/what to collect. A web server, for example, you would have to request every route and possible signature (parameters, etc) then use that, presuming no differences due to state or a database or ...

But for relatively deterministic programs, this method of just eliminating any uncalled function works great. However, there is another method that works for all programs and still has benefit.

Profile-guided selective optimization

We can do "DCE-lite" instead: not eliminating uncalled functions (in our profile) just optimizing them differently. So known called functions are optimized with -O3 while unknown functions are optimized with -Oz. This sounds eerily similar to just-in-time compilation!

Using this method, we get roughly half the gain of elimination:

% du -h hono hono_trimmed_safe hono_trimmed
668K    hono
476K    hono_trimmed_safe
296K    hono_trimmed
% du -h elysia elysia_trimmed_safe elysia_trimmed
5.9M    elysia
3.3M    elysia_trimmed_safe
1.3M    elysia_trimmed

Benefits

Not only does this help binary size, but memory usage and performance slightly too!

./hello_trimmed
  Latency             #1    median 0.177 ms, p90 0.330 ms, p99 0.394 ms
  Throughput (req/s)  #1    overall 249357.4, p10 232781.0, p1 101905.8

  Memory usage        #1    peak 2.16 MB, median 2.16 MB
  CPU time / req      #1    median 3.98 µs, p90 4.00 µs, p99 4.00 µs
  Binary size         #1    117.4 KB
  Startup             #1    7.526 ms


./hello
  Latency             #2    median 0.179 ms, p90 0.333 ms, p99 0.410 ms
  Throughput (req/s)  #2    overall 245919.9, p10 231727.5, p1 115515.0

  Memory usage        #2    peak 2.19 MB, median 2.17 MB
  CPU time / req      #2    median 4.02 µs, p90 4.10 µs, p99 4.15 µs
  Binary size         #2    165.8 KB
  Startup             #2    7.602 ms

The only downside, aside from potential crashes with strict mode and potential slowdown with optimization mode, is compilation time. Although, in strict mode, post-elimination compilation time is greatly improved as there is much less to compile. This could possibly be cached in future to result in much faster compilation time than without.