a575963da9
Former-commit-id: da6be194a6b1221998fc28233f2503bd61dd9d14
209 lines
8.8 KiB
Plaintext
209 lines
8.8 KiB
Plaintext
|
|
* Proposal for the local register allocator
|
|
|
|
The local register allocator deals with allocating registers
|
|
for temporaries inside a single basic block, while the global
|
|
register allocator is concerned with method-wide allocation of
|
|
variables.
|
|
The global register allocator uses callee-saved register for it's
|
|
purpouse so that there is no need to save and restore these registers
|
|
at call sites.
|
|
|
|
There are a number of issues the local allocator needs to deal with:
|
|
*) some instructions expect operands in specific registers (for example
|
|
the shl instruction on x86, or the call instruction with thiscall
|
|
convention, or the equivalent call instructions on other architectures,
|
|
such as the need to put output registers in %oX on sparc)
|
|
*) some instructions deliver results only in specific registers (for example
|
|
the div instruction on x86, or the call instructionson on almost all
|
|
the architectures).
|
|
*) it needs to know what registers may be clobbered by an instruction
|
|
(such as in a method call)
|
|
*) it should avoid excessive reloads or stores to improve performance
|
|
|
|
While which specific instructions have limitations is architecture-dependent,
|
|
the problem shold be solved in an arch-independent way to reduce code duplication.
|
|
The register allocator will be 'driven' by the arch-dependent code, but it's
|
|
implementation should be arch-independent.
|
|
|
|
To improve the current local register allocator, we need to
|
|
keep more state in it than the current setup that only keeps busy/free info.
|
|
|
|
Possible state information is:
|
|
|
|
free: the resgister is free to use and it doesn't contain useful info
|
|
freeable: the register contains data loaded from a local (there is
|
|
also info about _which_ local it contains) as a result from previous
|
|
instructions (like, there was a store from the register to the local)
|
|
moveable: it contains live data that is needed in a following instruction, but
|
|
the contents may be moved to a different register
|
|
busy: the register contains live data and it is placed there because
|
|
the following instructions need it exactly in that register
|
|
allocated: the register is used by the global allocator
|
|
|
|
The local register allocator will have the following interfaces:
|
|
|
|
int get_register ();
|
|
Searches for a register in the free state. If it doesn't find it,
|
|
searches for a freeable register. Sets the status to moveable.
|
|
Looking for a 'free' register before a freeable one should allow for
|
|
removing a few redundant loads (though I'm still unsure if such
|
|
things should be delegated entirely to the peephole pass).
|
|
|
|
int get_register_force (int reg);
|
|
Returns 'reg' if it is free or freeable. If it is moveable, it moves it
|
|
to another free or freeable register.
|
|
Sets the status of 'reg' to busy.
|
|
|
|
void set_register_freeable (int reg);
|
|
Sets the status of 'reg' to freeable.
|
|
|
|
void set_register_free (int reg);
|
|
Sets the status of 'reg' to free.
|
|
|
|
void will_clobber (int reg);
|
|
Spills the register to the stack. Sets the status to freeable.
|
|
After the clobbering has occurred, set the status to free.
|
|
|
|
void register_unspill (int reg);
|
|
Un-spills register reg and sets the status to moveable.
|
|
|
|
FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer.
|
|
|
|
Note: the register allocator will insert instructions in the basic block
|
|
during it's operation.
|
|
|
|
* Examples
|
|
|
|
Given the tree (on x86 the right argument to shl needs to be in ecx):
|
|
|
|
store (local1, shl (local1, call (some_arg)))
|
|
|
|
At the start of the basic block, the registers are set to the free state.
|
|
The sequence of instructions may be:
|
|
instruction register status -> [%eax %ecx %edx]
|
|
start free free free
|
|
eax = load local1 mov free free
|
|
/* call clobbers eax, ecx, edx */
|
|
spill eax free free free
|
|
call mov free free
|
|
/* now eax contains the right operand of the shl */
|
|
mov %eax -> %ecx free busy free
|
|
un-spill mov busy free
|
|
shl %cl, %eax mov free free
|
|
|
|
The resulting x86 code is:
|
|
mov $fffc(%ebp), %eax
|
|
mov %eax, $fff0(%ebp)
|
|
push some_arg
|
|
call func
|
|
mov %eax, %ecx
|
|
mov $fff0(%ebp), %eax
|
|
shl %cl, %eax
|
|
|
|
Note that since shl could operate directly on memory, we could have:
|
|
|
|
push some_arg
|
|
call func
|
|
mov %eax, %ecx
|
|
shl %cl, $fffc(%ebp)
|
|
|
|
The above example with loading the operand in a register is just to complicate
|
|
the example and show that the algorithm should be able to handle it.
|
|
|
|
Let's take another example with the this-call call convention (the first argument
|
|
is passed in %ecx).
|
|
In this case, will_clobber() will be called only on %eax and %edx, while %ecx
|
|
will be allocated with get_register_force ().
|
|
Note: when a register is allocated with get_register_force(), it should be set
|
|
to a different state as soon as possible.
|
|
|
|
store (local1, shl (local1, this-call (local1)))
|
|
|
|
instruction register status -> [%eax %ecx %edx]
|
|
start free free free
|
|
eax = load local1 mov free free
|
|
/* force load in %ecx */
|
|
ecx = load local1 mov busy free
|
|
spill eax free busy free
|
|
call mov free free
|
|
/* now eax contains the right operand of the shl */
|
|
mov %eax -> %ecx free busy free
|
|
un-spill mov busy free
|
|
shl %cl, %eax mov free free
|
|
|
|
What happens when a register that we need to allocate with get_register_force ()
|
|
contains an operand for the next instruction?
|
|
|
|
instruction register status -> [%eax %ecx %edx]
|
|
eax = load local0 mov free free
|
|
ecx = load local1 mov mov free
|
|
get_register_force (ecx) here.
|
|
We have two options:
|
|
mov %ecx, %edx
|
|
or:
|
|
spill %ecx
|
|
The first option is way better (and allows the peephole pass to
|
|
just load the value in %edx directly, instead of loading first to %ecx).
|
|
This doesn't work, though, if the instruction clobbers the %edx register
|
|
(like in a this-call). So, we first need to clobber the registers
|
|
(so the state of %ecx changes to freebale and there is no issue
|
|
with get_register_force ()).
|
|
What if an instruction both clobbers a register and requires it as
|
|
an operand? Lets' take the x86 idiv instruction as an example: it
|
|
requires the dividend in edx:eax and returns the result in eax,
|
|
with the modulus in edx.
|
|
|
|
store (local1, div (local1, local2))
|
|
|
|
instruction register status -> [%eax %ecx %edx]
|
|
eax = load local0 mov free free
|
|
will_clobber eax, edx free mov free
|
|
force mov %ecx, %eax busy free free
|
|
set %edx busy free busy
|
|
idiv mov free free
|
|
|
|
Note: edx is set to free after idiv, because the modulus is not needed
|
|
(if it was a rem, eax would have been freed).
|
|
If we load the divisor before will_clobber(), we'll have to spill
|
|
eax and reload it later. If we load it just after the idiv, there is no issue.
|
|
In any case, the algorithm should give the correct results and allow the operation.
|
|
|
|
Working recursively on the isntructions there shouldn't be huge issues
|
|
with this algorithm (though, of course, it's not optimal and it may
|
|
introduce excessive spills or register moves). The advantage over the current
|
|
local reg allocator is that:
|
|
1) the number of spills/moves would be smaller anyway
|
|
2) a separate peephole pass could be able to eliminate reg moves
|
|
3) we'll be able to remove the 'forced' spills we currently do with
|
|
the return value of method calls
|
|
|
|
* Issues
|
|
|
|
How to best integrate such a reg allocator with the burg stuff.
|
|
|
|
Think about a call os sparc with two arguments: they got into %o0 and %o1
|
|
and each of them sets the register as busy. But what if the values to put there
|
|
are themselves the result of a call? %o0 is no problem, but for all the
|
|
next argument n the above algorithm would spill all the 0...n-1 registers...
|
|
|
|
* Papers
|
|
|
|
More complex solutions to the local register allocator problem:
|
|
http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html
|
|
|
|
Combining register allocation and instruction scheduling:
|
|
http://citeseer.nj.nec.com/motwani95combining.html
|
|
|
|
More on LRA euristics:
|
|
http://citeseer.nj.nec.com/liberatore97hardness.html
|
|
|
|
Linear-time optimal code scheduling for delayedload architectures
|
|
http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz
|
|
|
|
Precise Register Allocation for Irregular Architectures
|
|
http://citeseer.nj.nec.com/kong98precise.html
|
|
|
|
Allocate registers first to subtrees that need more of them.
|
|
http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf
|