linux-packaging-mono/docs/ssapre.txt

SSAPRE stands for "SSA based Partial Redundancy Elimination".

The algorithm is explained in this paper:

Partial Redundancy Elimination in SSA Form (1999)
Robert Kennedy, Sun Chan, SHIN-MING LIU, RAYMOND LO, PENG TU, FRED CHOW
ACM Transactions on Programming Languages and Systems

http://citeseer.ist.psu.edu/kennedy99partial.html

In this document I give a gentle introduction to the concept of "partial"
redundancies, and I explain the basic design decisions I took in implementing
SSAPRE, but the paper is essential to understand the code.

Partial Redundancy Elimination (or PRE) is an optimization that (guess what?)
tries to remove redundant computations.
It achieves this by saving the result of "not redundant" evaluations of
expressions into appositely created temporary variables, so that "redundant"
evaluations can be replaced by a load from the appropriate variable.

Of course, on register starved architectures (x86) a temporary could cost more
than the evaluation itself... PRE guarantees that the live range of the
introduced variables is the minimal possible, but the added pressure on the
register allocator can be an issue.

The nice thing about PRE is that it not only removes "full" redundancies, but
also "partial" ones.
A full redundancy is easy to spot, and straightforward to handle, like in the
following example (in every example here, the "expression" is "a + b"):

int FullRedundancy1 (int a, int b) {
     int v1 = a + b;
     int v2 = a + b;
     return v1 + v2;
}

PRE would transform it like this:

int FullRedundancy1 (int a, int b) {
     int t = a + b;
     int v1 = t;
     int v2 = t;
     return v1 + v2;
}

Of course, either a copy propagation pass or a register allocator smart enough
to remove unneeded variables would be necessary afterwords.

Another example of full redundancy is the following:

int FullRedundancy2 (int a, int b) {
     int v1;

     if (a >= 0) {
          v1 = a + b; // BB1
     } else {
          a = -a; // BB2
          v1 = a + b;
     }

     int v2 = a + b; // BB3
     return v1 + v2;
}

Here the two expressions in BB1 and BB2 are *not* the same thing (a is
modified in BB2), but both are redundant with the expression in BB3, so the
code can be transformed like this:

int FullRedundancy2 (int a, int b) {
     int v1;
     int t;

     if (a >= 0) {
          t = a + b; // BB1
          v1 = t;
     } else {
          a = -a; // BB2
          t = a + b;
          v1 = t;
     }

     int v2 = t; // BB3
     return v1 + v2;
}

Note that there are still two occurrences of the expression, while it can be
easily seen that one (at the beginning of BB3) would suffice.
This, however, is not a redundancy for PRE, because there is no path in the
CFG where the expression is evaluated twice.
Maybe this other kind of redundancy (which affects code size, and not the
computations that are actually performed) would be eliminated by code hoisting,
but I should check it; anyway, it is not a PRE related thing.

An example of partial redundancy, on the other hand, is the following:

int PartialRedundancy (int a, int b) {
     int v1;

     if (a >= 0) {
          v1 = a + b; // BB1
     } else {
          v1 = 0; // BB2
     }

     int v2 = a + b; // BB3
     return v1 + v2;
}

The redundancy is partial because the expression is computed more than once
along some path in the CFG, not all paths.
In fact, on the path BB1 - BB3 the expression is computed twice, but on the
path BB2 - BB3 it is computed only once.
In this case, PRE must insert new occurrences of the expression in order to
obtain a full redundancy, and then use temporary variables as before.
Adding a computation in BB2 would do the job.

One nice thing about PRE is that loop invariants can be seen as partial
redundancies.
The idea is that you can get into the loop from two locations: from before the
loop (at the 1st iteration), and from inside the loop itself (at any other
iteration). If there is a computation inside the loop that is in fact a loop
invariant, PRE will spot this, and will handle the BB before the loop as a
place where to insert a new computation to get a full redundancy.
At this point, the computation inside the loop would be replaced by an use of
the temporary stored before the loop, effectively performing "loop invariant
code motion".

Now, this is what PRE does to the code.

But how does it work?

In "classic" solutions, PRE is formulated as a data flow analysis problem.
The Muchnick provides a detailed description of the algorithm in this way (it
is by far the most complex problem of this kind in the whole book).
The point is that this algorithm does not exploit the SSA form.
In fact, it has to perform all that amount of data flow analysis exactly
because it does not take advantage of the work already done if you have put
the program into SSA form.

The SSAPRE algorithm, on the other hand, is designed with SSA in mind.
It fully exploits the properties of SSA variables, it also explicitly reuses
some data structures that must have been computed when building the SSA form,
and takes great care to output its code in that form already (which means that
the temporaries it introduces are already "versioned", with all the phi
variables correctly placed).

The main concept used in this algorithm is the "Factored Redundancy Graph" (or
FRG in short). Basically, for each given expression, this graph shows all its
occurrences in the program, and redundant ones are linked to their
"representative occurrence" (the 1st occurrence met in a CFG traversal).
The central observation is that the FRG is "factored" because each expression
occurrence has exactly one representative occurrence, in the same way as the
SSA form is a "factored" use-definition graph (each use is related to exactly
one definition).
And in fact building the FRG is much like building an SSA representation, with
PHI nodes and all.
By the way, I use "PHI" for "PHI expression occurrences", and "phi" for the
usual phi definitions in SSA, because the paper uses an uppercase phi greek
letter for PHI occurrences (while the lowercase one is standard in SSA).

The really interesting point is that the FRG for a given expression has exactly
the same "shape" of the use-definition graph for the temporary var that must
be introduced to remove the redundancy, and this is why SSAPRE can easily emit
its output code in correct SSA form.

One particular characteristic of the SSAPRE algorithm is that it is "sparse",
in the sense that it operates on expressions individually, looking only at the
specific nodes it needs. This is in contrast to the classical way of solving
data flow analysis problems, which is to operate globally, using data
structures that work "in parallel" on all the entities they operate on (in
practice bit sets, with for instance one bit per variable, or one bit per
expression occurrence, you get the idea). This is handy (it exploits the
parallelism of bit operations), but in general setting up those data
structures is time consuming, and if the number of the things represented by
the bits is not fixed in advance the code can get quite messy (bit sets must
become growable).
Moreover, applying special handling to individual expressions becomes a very
tricky thing.

SSAPRE, on the other hand, looks at the whole program (method in our case) only
once, when it scans the code to collect (and classify) all expression
occurrences.
From here on, it operates on one expression at a time, looking only at its
specific data structures (which, for each expression, are much smaller than
the whole program, so the operations are fast).

This approach has another advantage: the data structures used to operate on
one expressions can be recycled when operating on other expressions, making
the memory usage of the compiler lower, and (also important) avoiding losing
time with memory allocations at all.
This reflects directly on the design of those data structures.
We can better see these advantages following which data structures are used
during the application of SSAPRE to a method.

The steps that are performed are the following:

    * Initialization: scan the whole program, collect all the occurrences, and
      build a worklist of expressions to be processed (each worklist entry
      describes all the occurrences of the given expression).
Here the data structures are the following:
- One struct (the working area), containing the worklist and other
 "global" data. The worklist itself contains an entry for each expression
  which in turn has an entry for each occurrence.
- One "info" struct for each BB, containing interesting dominance and CFG
  related properties of the BB.

Then, for each entry in the worklist, these operations are performed:

    * PHI placement: find where the PHI nodes of the FRG must be placed.
    * Renaming: assign appropriate "redundancy classes" to all occurrences (it
      is like assigning variable versions when building an SSA form).
    * Analyze: compute various flags in PHI nodes (which are the only places
      that define where additional computations may be added).
      This conceptually is composed of two data flow analysis passes, which in
      practice only scan the PHI nodes in the FRG, not the whole code, so they
      are not that heavy.
    * Finalize: make so that the FRG is exactly like the use-def graph for the
      temporary that will be introduced (it generally must be reshaped a bit
      according to the flags computed in the previous step).
      This is also made of two steps, but more for implementation reasons than
      for conceptual ones.
    * Code motion: actually update the code using the FRG.
Here, what's needed is the following:
- PHI occurrences (and some flags sbout them)
- PHI argument occurrences (and some flags sbout them)
- The renaming stack
In practice, one can observe that each BB can have at most one PHI (we work
on one expression at a time), and also one PHI argument (which we consider
occurring at the end of the BB). Therefore, we do not have separate structures
for these, but store them directly in the BB infos (which are kept for the
whole SSAPRE invocation).
The renaming stack is managed directly as a list of occurrences, with
special handling for PHI nodes (which, being represented directly by their
BB, are not "occurrences").


So far, the only two missing things (with respect to SSAPRE in the paper) are
unneeded PHIs eliminantion and the handling of "composite" expressions.
Otherwise, the implementation is complete.

Other interesting issues are:
- SSAPRE has the assumption that:
  - each SSA variable is related to one "original" (not SSA) variable, and
  - no more than one version of each original variable is live at the same time
    in the CFG.
  It would be better to relax these assumptions.
- SSAPRE operates on "syntactic" redundancies, not on "values".
  GVNPRE (or other means of merging GVN) would be a nice alternative, see
  "http://www.cs.purdue.edu/homes/vandrutj/papers/thesis.pdf".