a575963da9
Former-commit-id: da6be194a6b1221998fc28233f2503bd61dd9d14
207 lines
7.7 KiB
Plaintext
207 lines
7.7 KiB
Plaintext
Author: Dietmar Maurer (dietmar@ximian.com)
|
|
(C) 2001 Ximian, Inc.
|
|
(C) 2007 Novell, Inc.
|
|
|
|
[ 2007 extensions based on posts from Paolo Molaro ]
|
|
|
|
Howto trigger JIT compilation
|
|
=============================
|
|
|
|
The JIT translates CIL code to native code on a per method basis. For example
|
|
if you have this simple program:
|
|
|
|
public class Test {
|
|
public static void Main () {
|
|
System.Console.WriteLine ("Hello");
|
|
}
|
|
}
|
|
|
|
the JIT first compiles the Main function. Unfortunately Main() contains another
|
|
reference to System.Console.WriteLine(), so the JIT also needs the address for
|
|
WriteLine() to generate a call instruction.
|
|
|
|
The simplest solution would be to JIT compile System.Console.WriteLine()
|
|
to generate that address. But that would mean that we JIT compile half of our
|
|
class library at once, since WriteLine() uses many other classes and function,
|
|
and we have to call the JIT for each of them. Even worse there is the
|
|
possibility of cyclic references, and we would end up in an endless loop.
|
|
|
|
Thus we need some kind of trampoline function for JIT compilation. Such a
|
|
trampoline first calls the JIT compiler to create native code, and then jumps
|
|
directly into that code. Whenever the JIT needs the address of a function (to
|
|
emit a call instruction) it uses the address of those trampoline functions.
|
|
|
|
One drawback of this approach is that it requires an additional indirection. We
|
|
always call the trampoline. Inside the trampoline we need to check if the
|
|
method is already compiled or not, and when not compiled we start JIT
|
|
compilation. After that we call the code. This process is quite time consuming
|
|
and shows very bad performance.
|
|
|
|
The solution is to add some logic to the trampoline function to detect from
|
|
where it is called. It is then possible for the JIT to patch the call
|
|
instruction in the caller, so that it directly calls the JIT compiled code
|
|
next time.
|
|
|
|
Implementation Details
|
|
======================
|
|
|
|
Mono 1.2.6 has quite a few improvements in this area compared to mono
|
|
1.2.5 which was released just a few weeks ago. I'll try to detail the
|
|
major changes below.
|
|
|
|
The first change is related to how the memory for the specific
|
|
trampolines is allocated: this is executable memory so it is not
|
|
allocated with malloc, but with a custom allocator, called Mono Code
|
|
Manager. Since the code manager is used primarily for methods, it
|
|
allocates chunks of memory that are aligned to multiples of 8 or 16
|
|
bytes depending on the architecture: this allows the cpu to fetch the
|
|
instructions faster. But the specific trampolines are not performance
|
|
critical (we'll spend lots of time JITting the method anyway), so they
|
|
can tolerate a smaller alignment. Considering the fact that most
|
|
trampolines are allocated one after the other and that in most
|
|
architectures they are 10 or 12 bytes, this change alone saved about
|
|
25% of the memory used (they used to be aligned up to 16 bytes).
|
|
|
|
To give a rough idea of how many trampolines are generated I'll give a
|
|
few examples:
|
|
|
|
* MonoDevelop startup creates about 21 thousand trampolines
|
|
* IronPython 2.0 running a benchmark creates about 17 thousand trampolines
|
|
* an "hello, world" style program about 800
|
|
|
|
This change in the first case saved more than 80 KB of memory (plus
|
|
about the same because reviewing the code allowed me to fix also a
|
|
related overallocation issue).
|
|
|
|
So reducing the size of the trampolines is great, but it's really not
|
|
possible to reduce them much further in size, if at all. The next step
|
|
is trying just not to create them.
|
|
|
|
There are two primary ways a trampoline is generated: a direct call to
|
|
the method is made or a virtual table slot is filled with a trampoline
|
|
for the case when the method is invoked using a virtual call. I'll
|
|
note here than in both cases, after compiling the method, the magic
|
|
trampoline will do the needed changes so that the trampoline is not
|
|
executed again, but execution goes directly to the newly compiled
|
|
code. In one case the callsite is changed so that the branch or call
|
|
instruction will transfer control to the new address. In the virtual
|
|
call case the magic trampoline will change the virtual table slot
|
|
directly.
|
|
|
|
The sequence of instructions used by the JIT to implement a virtual
|
|
call are well-known and the magic trampoline (inspecting the registers
|
|
and the code sequence) can easily get the virtual table slot that was
|
|
used for the invocation. The idea here then is: if we know the virtual
|
|
table slot we know also the method that is supposed to be compiled and
|
|
executed, since each vtable slot is assigned a unique method by the
|
|
class loader. This simple fact allows us to use a completely generic
|
|
trampoline in the virtual table slots, avoiding the creation of many
|
|
method-specific trampolines.
|
|
|
|
In the cases above, the number of generated trampolines goes from
|
|
21000 to 7700 for MonoDevelop (saving 160 KB of memory), from 17000 to
|
|
5400 for the IronPython case and from 800 to 150 for the hello world
|
|
case.
|
|
|
|
Kinds of Trampolines and Thunks
|
|
===============================
|
|
|
|
This is a list of the trampolines and thunks used in Mono:
|
|
|
|
- create_fnptr
|
|
- load_aot_method
|
|
- imt thunk
|
|
|
|
Interface Method Table, this is used to dispatch calls to
|
|
interface methods.
|
|
|
|
- jump table
|
|
- debugger code
|
|
- exception call filter
|
|
- trampoline (various types)
|
|
- throw corlib exception
|
|
- restore context
|
|
- throw exception by name
|
|
- handle stack overflow
|
|
- throw exception
|
|
- delegate invoke implementation
|
|
- cpuid code
|
|
|
|
|
|
Implementation for x86/x86-64
|
|
=============================
|
|
|
|
Usually code looks like this:
|
|
|
|
mov <some address>, %r11
|
|
call *0xfffffffc(%rax)
|
|
|
|
First, the first call instruction can go directly to the compiled
|
|
address or to a trampoline.
|
|
|
|
If it goes to a trampoline, on amd64 it looks as the one above (on x86
|
|
it is different). Currently the trampoline is not modified, but it will
|
|
be in the future. On x86 the trampoline looks like:
|
|
|
|
push constant
|
|
jmp generic_trampoline
|
|
|
|
Note that constant can be a MonoMethod*, but it's not necessarily so
|
|
(these are the recent changes: this constant can be -1 or -2, the first
|
|
for the case of interface calls, the second for virtual calls).
|
|
|
|
Other architectures are similar in the semantics, but different in the
|
|
details.
|
|
|
|
The above is what happens for virtual calls.
|
|
|
|
For interfaces call 3 things can happen:
|
|
|
|
1) the calls goes directly to the method address.
|
|
|
|
2) it goes to a trampoline as described above.
|
|
|
|
3) it goes into an IMT collision resolution stub: this is a
|
|
chunk of code that, based on the constant put inside the
|
|
imt_register above, will perform a jump to the correct
|
|
vtable slot for the interface method.
|
|
|
|
Note that the vtable slot itself could then contain a trampoline.
|
|
|
|
Some functions that are used here:
|
|
|
|
emit-x86.c (arch_create_jit_trampoline): return the JIT trampoline function
|
|
|
|
emit-x86.c (x86_magic_trampoline): contains the code to detect the caller and
|
|
patch the call instruction.
|
|
|
|
emit-x86.c (arch_compile_method): JIT compile a method
|
|
|
|
Call Sites
|
|
==========
|
|
|
|
There are 3 basic different kinds of call sites:
|
|
|
|
1) normal calls:
|
|
call relative_displacement
|
|
|
|
2) virtual calls:
|
|
call *positive_offset(%register)
|
|
|
|
3) interface calls:
|
|
mov constant, %imt_register
|
|
call *negative_offset(%register)
|
|
|
|
The above is what happens on x86 and amd64, with different values of
|
|
%imt_register for each arch (this register is a constant, but it could
|
|
change with different mono builds, it should be likely one of
|
|
constants the runtime comunicates to the debugger). %register can
|
|
change depending on the callsite based on the register allocator
|
|
choices. Note that the constant for the interface calls won't
|
|
necessarily be a MonoMethod address: this could change in the future
|
|
to a simple number.
|
|
|
|
In all the 3 cases the JIT trampolines will need to inspect the call
|
|
site, but only in the first case the call site will be changed.
|
|
|