a575963da9
Former-commit-id: da6be194a6b1221998fc28233f2503bd61dd9d14
798 lines
28 KiB
Plaintext
798 lines
28 KiB
Plaintext
|
|
A new JIT compiler for the Mono Project
|
|
|
|
Miguel de Icaza (miguel@{ximian.com,gnome.org}),
|
|
Paolo Molaro (lupus@{ximian.com,debian.org})
|
|
|
|
This documents overall design of the Mono JIT up to version
|
|
2.0. After Mono 2.0 the JIT engine was upgraded from
|
|
a tree-based intermediate representation to a linear
|
|
intermediate representation.
|
|
|
|
The Linear IL is documented here:
|
|
|
|
http://www.mono-project.com/Linear_IL
|
|
|
|
* Abstract
|
|
|
|
Mini is a new compilation engine for the Mono runtime. The
|
|
new engine is designed to bring new code generation
|
|
optimizations, portability and pre-compilation.
|
|
|
|
In this document we describe the design decisions and the
|
|
architecture of the new compilation engine.
|
|
|
|
* Introduction
|
|
|
|
Mono is a Open Source implementation of the .NET Framework: it
|
|
is made up of a runtime engine that implements the ECMA Common
|
|
Language Infrastructure (CLI), a set of compilers that target
|
|
the CLI and a large collection of class libraries.
|
|
|
|
This article discusses the new code generation facilities that
|
|
have been added to the Mono runtime.
|
|
|
|
First we discuss the overall architecture of the Mono runtime,
|
|
and how code generation fits into it; Then we discuss the
|
|
development and basic architecture of our first JIT compiler
|
|
for the ECMA CIL framework. The next section covers the
|
|
objectives for the work on the new JIT compiler, then we
|
|
discuss the new features available in the new JIT compiler,
|
|
and finally a technical description of the new code generation
|
|
engine.
|
|
|
|
* Architecture of the Mono Runtime
|
|
|
|
The Mono runtime is an implementation of the ECMA Common
|
|
Language Infrastructure (CLI), whose aim is to be a common
|
|
platform for executing code in multiple languages.
|
|
|
|
Languages that target the CLI generate images that contain
|
|
code in high-level intermediate representation called the
|
|
"Common Intermediate Language". This intermediate language is
|
|
rich enough to allow for programs and pre-compiled libraries
|
|
to be reflected. The execution environment allows for an
|
|
object oriented execution environment with single inheritance
|
|
and multiple interface implementations.
|
|
|
|
This runtime provides a number of services for programs that
|
|
are targeted to it: Just-in-Time compilation of CIL code into
|
|
native code, garbage collection, thread management, I/O
|
|
routines, single, double and decimal floating point,
|
|
asynchronous method invocation, application domains, and a
|
|
framework for building arbitrary RPC systems (remoting) and
|
|
integration with system libraries through the Platform Invoke
|
|
functionality.
|
|
|
|
The focus of this document is on the services provided by the
|
|
Mono runtime to transform CIL bytecodes into code that is
|
|
native to the underlying architecture.
|
|
|
|
The code generation interface is a set of macros that allow a
|
|
C programmer to generate code on the fly, this is done
|
|
through a set of macros found in the mono/jit/arch/ directory.
|
|
These macros are used by the JIT compiler to generate native
|
|
code.
|
|
|
|
The platform invocation code is interesting, as it generates
|
|
CIL code on the fly to marshal parameters, and then this
|
|
code is in turned processed by the JIT engine.
|
|
|
|
Mono has now gone through three different JIT engines, these
|
|
are:
|
|
|
|
* Original JIT engine: 2002, hard to port, hard to
|
|
implement optimizations.
|
|
|
|
* Second JIT engine, used up until Mono 2.0, very
|
|
portable, many new optimizations.
|
|
|
|
* Third JIT engine, replaced the code generation layer from
|
|
being based on a tree representation to be based on a linear
|
|
representation.
|
|
|
|
For more information on the code generation changes, see our
|
|
web site for the details on the Linear IL:
|
|
|
|
http://www.mono-project.com/Linear_IL
|
|
|
|
* Previous Experiences
|
|
|
|
Mono has built a JIT engine, which has been used to bootstrap
|
|
Mono since January, 2002. This JIT engine has reasonable
|
|
performance, and uses an tree pattern matching instruction
|
|
selector based on the BURS technology. This JIT compiler was
|
|
designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
|
|
|
|
The existing JIT compiler has three phases:
|
|
|
|
* Re-creation of the semantic tree from CIL
|
|
byte-codes.
|
|
|
|
* Instruction selection, with a cost-driven
|
|
engine.
|
|
|
|
* Code generation and register allocation.
|
|
|
|
It is also hooked into the rest of the runtime to provide
|
|
services like marshaling, just-in-time compilation and
|
|
invocation of "internal calls".
|
|
|
|
This engine constructed a collection of trees, which we
|
|
referred to as the "forest of trees", this forest is created by
|
|
"hydrating" the CIL instruction stream.
|
|
|
|
The first step was to identify the basic blocks on the method,
|
|
and computing the control flow graph (cfg) for it. Once this
|
|
information was computed, a stack analysis on each basic block
|
|
was performed to create a forest of trees for each one of
|
|
them.
|
|
|
|
So for example, the following statement:
|
|
|
|
int a, b;
|
|
...
|
|
b = a + 1;
|
|
|
|
Which would be represented in CIL as:
|
|
|
|
ldloc.0
|
|
ldc.i4.1
|
|
add
|
|
stloc.1
|
|
|
|
After the stack analysis would create the following tree:
|
|
|
|
(STIND_I4 ADDR_L[EBX|2] (
|
|
ADD (LDIND_I4 ADDR_L[ESI|1])
|
|
CONST_I4[1]))
|
|
|
|
This tree contains information from the stack analysis: for
|
|
instance, notice that the operations explicitly encode the
|
|
data types they are operating on, there is no longer an
|
|
ambiguity on the types, because this information has been
|
|
inferred.
|
|
|
|
At this point the JIT would pass the constructed forest of
|
|
trees to the architecture-dependent JIT compiler.
|
|
|
|
The architecture dependent code then performed register
|
|
allocation (optionally using linear scan allocation for
|
|
variables, based on life analysis).
|
|
|
|
Once variables had been assigned, a tree pattern matching with
|
|
dynamic programming is used (the tree pattern matcher is
|
|
custom build for each architecture, using a code
|
|
generator: monoburg). The instruction selector used cost
|
|
functions to select the best instruction patterns.
|
|
|
|
The instruction selector is able to produce instructions that
|
|
take advantage of the x86 instruction indexing instructions
|
|
for example.
|
|
|
|
One problem though is that the code emitter and the register
|
|
allocator did not have any visibility outside the current
|
|
tree, which meant that some redundant instructions were
|
|
generated. A peephole optimizer with this architecture was
|
|
hard to write, given the tree-based representation that is
|
|
used.
|
|
|
|
This JIT was functional, but it did not provide a good
|
|
architecture to base future optimizations on. Also the
|
|
line between architecture neutral and architecture
|
|
specific code and optimizations was hard to draw.
|
|
|
|
The JIT engine supported two code generation modes to support
|
|
the two optimization modes for applications that host multiple
|
|
application domains: generate code that will be shared across
|
|
application domains, or generate code that will not be shared
|
|
across application domains.
|
|
|
|
* Second Generation JIT engine.
|
|
|
|
We wanted to support a number of features that were missing:
|
|
|
|
* Ahead-of-time compilation.
|
|
|
|
The idea is to allow developers to pre-compile their code
|
|
to native code to reduce startup time, and the working
|
|
set that is used at runtime in the just-in-time compiler.
|
|
|
|
Although in Mono this has not been a visible problem, we
|
|
wanted to pro-actively address this problem.
|
|
|
|
When an assembly (a Mono/.NET executable) is installed in
|
|
the system, it would then be possible to pre-compile the
|
|
code, and have the JIT compiler tune the generated code
|
|
to the particular CPU on which the software is
|
|
installed.
|
|
|
|
This is done in the Microsoft.NET world with a tool
|
|
called ngen.exe
|
|
|
|
* Have a good platform for doing code optimizations.
|
|
|
|
The design called for a good architecture that would
|
|
enable various levels of optimizations: some
|
|
optimizations are better performed on high-level
|
|
intermediate representations, some on medium-level and
|
|
some at low-level representations.
|
|
|
|
Also it should be possible to conditionally turn these on
|
|
or off. Some optimizations are too expensive to be used
|
|
in just-in-time compilation scenarios, but these
|
|
expensive optimizations can be turned on for
|
|
ahead-of-time compilations or when using profile-guided
|
|
optimizations on a subset of the executed methods.
|
|
|
|
* Reduce the effort required to port the Mono code
|
|
generator to new architectures.
|
|
|
|
For Mono to gain wide adoption in the Unix world, it is
|
|
necessary that the JIT engine works in most of today's
|
|
commercial hardware platforms.
|
|
|
|
* Features of the Second JIT engine.
|
|
|
|
The new JIT engine was architected by Dietmar Maurer and Paolo
|
|
Molaro, based on the new objectives.
|
|
|
|
Mono provides a number of services to applications running
|
|
with the new JIT compiler:
|
|
|
|
* Just-in-Time compilation of CLI code into native code.
|
|
|
|
* Ahead-of-Time compilation of CLI code, to reduce
|
|
startup time of applications.
|
|
|
|
A number of software development features are also available:
|
|
|
|
* Execution time profiling (--profile)
|
|
|
|
Generates a report of the times consumed by routines,
|
|
as well as the invocation times, as well as the
|
|
callers.
|
|
|
|
* Memory usage profiling (--profile)
|
|
|
|
Generates a report of the memory usage by a program
|
|
that is ran under the Mono JIT.
|
|
|
|
* Code coverage (--coverage)
|
|
|
|
* Execution tracing.
|
|
|
|
People who are interested in developing and improving the Mini
|
|
JIT compiler will also find a few useful routines:
|
|
|
|
* Compilation times
|
|
|
|
This is used to time the execution time for the JIT
|
|
when compiling a routine.
|
|
|
|
* Control Flow Graph and Dominator Tree drawing.
|
|
|
|
These are visual aids for the JIT developer: they
|
|
render representations of the Control Flow graph, and
|
|
for the more advanced optimizations, they draw the
|
|
dominator tree graph.
|
|
|
|
This requires Dot (from the graphwiz package) and Ghostview.
|
|
|
|
* Code generator regression tests.
|
|
|
|
The engine contains support for running regression
|
|
tests on the virtual machine, which is very helpful to
|
|
developers interested in improving the engine.
|
|
|
|
* Optimization benchmark framework.
|
|
|
|
The JIT engine will generate graphs that compare
|
|
various benchmarks embedded in an assembly, and run the
|
|
various tests with different optimization flags.
|
|
|
|
This requires Perl, GD::Graph.
|
|
|
|
* Flexibility
|
|
|
|
This is probably the most important component of the new code
|
|
generation engine. The internals are relatively easy to
|
|
replace and update, even large passes can be replaced and
|
|
implemented differently.
|
|
|
|
* New code generator
|
|
|
|
Compiling a method begins with the `mini_method_to_ir' routine
|
|
that converts the CIL representation into a medium
|
|
intermediate representation.
|
|
|
|
The mini_method_to_ir routine performs a number of operations:
|
|
|
|
* Flow analysis and control flow graph computation.
|
|
|
|
Unlike the previous version, stack analysis and control
|
|
flow graphs are computed in a single pass in the
|
|
mini_method_to_ir function, this is done for performance
|
|
reasons: although the complexity increases, the benefit
|
|
for a JIT compiler is that there is more time available
|
|
for performing other optimizations.
|
|
|
|
* Basic block computation.
|
|
|
|
mini_method_to_ir populates the MonoCompile structure
|
|
with an array of basic blocks each of which contains
|
|
forest of trees made up of MonoInst structures.
|
|
|
|
* Inlining
|
|
|
|
Inlining is no longer restricted to methods containing
|
|
one single basic block, instead it is possible to inline
|
|
arbitrary complex methods.
|
|
|
|
The heuristics to choose what to inline are likely going
|
|
to be tuned in the future.
|
|
|
|
* Method to opcode conversion.
|
|
|
|
Some method call invocations like `call Math.Sin' are
|
|
transformed into an opcode: this transforms the call
|
|
into a semantically rich node, which is later inline
|
|
into an FPU instruction.
|
|
|
|
Various Array methods invocations are turned into
|
|
opcodes as well (The Get, Set and Address methods)
|
|
|
|
* Tail recursion elimination
|
|
|
|
Basic blocks ****
|
|
|
|
The MonoInst structure holds the actual decoded instruction,
|
|
with the semantic information from the stack analysis.
|
|
MonoInst is interesting because initially it is part of a tree
|
|
structure, here is a sample of the same tree with the new JIT
|
|
engine:
|
|
|
|
(stind.i4 regoffset[0xffffffd4(%ebp)]
|
|
(add (ldind.i4 regoffset[0xffffffd8(%ebp)])
|
|
iconst[1]))
|
|
|
|
This is a medium-level intermediate representation (MIR).
|
|
|
|
Some complex opcodes are decomposed at this stage into a
|
|
collection of simpler opcodes. Not every complex opcode is
|
|
decomposed at this stage, as we need to preserve the semantic
|
|
information during various optimization phases.
|
|
|
|
For example a NEWARR opcode carries the length and the type of
|
|
the array that could be used later to avoid type checking or
|
|
array bounds check.
|
|
|
|
There are a number of operations supported on this
|
|
representation:
|
|
|
|
* Branch optimizations.
|
|
|
|
* Variable liveness.
|
|
|
|
* Loop optimizations: the dominator trees are
|
|
computed, loops are detected, and their nesting
|
|
level computed.
|
|
|
|
* Conversion of the method into static single assignment
|
|
form (SSA form).
|
|
|
|
* Dead code elimination.
|
|
|
|
* Constant propagation.
|
|
|
|
* Copy propagation.
|
|
|
|
* Constant folding.
|
|
|
|
Once the above optimizations are optionally performed, a
|
|
decomposition phase is used to turn some complex opcodes into
|
|
internal method calls. In the initial version of the JIT
|
|
engine, various operations on longs are emulated instead of
|
|
being inlined. Also the newarr invocation is turned into a
|
|
call to the runtime.
|
|
|
|
At this point, after computing variable liveness, it is
|
|
possible to use the linear scan algorithm for allocating
|
|
variables to registers. The linear scan pass uses the
|
|
information that was previously gathered by the loop nesting
|
|
and loop structure computation to favor variables in inner
|
|
loops. This process updates the basic block `nesting' field
|
|
which is later used during liveness analysis.
|
|
|
|
Stack space is then reserved for the local variables and any
|
|
temporary variables generated during the various
|
|
optimizations.
|
|
|
|
** Instruction selection: Only used up until Mono 2.0
|
|
|
|
At this point, the BURS instruction selector is invoked to
|
|
transform the tree-based representation into a list of
|
|
instructions. This is done using a tree pattern matcher that
|
|
is generated for the architecture using the `monoburg' tool.
|
|
|
|
Monoburg takes as input a file that describes tree patterns,
|
|
which are matched against the trees that were produced by the
|
|
engine in the previous stages.
|
|
|
|
The pattern matching might have more than one match for a
|
|
particular tree. In this case, the match selected is the one
|
|
whose cost is the smallest. A cost can be attached to each
|
|
rule, and if no cost is provided, the implicit cost is one.
|
|
Smaller costs are selected over higher costs.
|
|
|
|
The cost function can be used to select particular blocks of
|
|
code for a given architecture, or by using a prohibitive high
|
|
number to avoid having the rule match.
|
|
|
|
The various rules that our JIT engine uses transform a tree of
|
|
MonoInsts into a list of monoinsts:
|
|
|
|
+-----------------------------------------------------------+
|
|
| Tree List |
|
|
| of ===> Instruction selection ===> of |
|
|
| MonoInst MonoInst. |
|
|
+-----------------------------------------------------------+
|
|
|
|
During this process various "types" of MonoInst kinds
|
|
disappear and turned into lower-level representations. The
|
|
JIT compiler just happens to reuse the same structure (this is
|
|
done to reduce memory usage and improve memory locality).
|
|
|
|
The instruction selection rules are split in a number of
|
|
files, each one with a particular purpose:
|
|
|
|
inssel.brg
|
|
Contains the generic instruction selection
|
|
patterns.
|
|
|
|
inssel-x86.brg
|
|
Contains x86 specific rules.
|
|
|
|
inssel-ppc.brg
|
|
Contains PowerPC specific rules.
|
|
|
|
inssel-long32.brg
|
|
burg file for 64bit instructions on 32bit architectures.
|
|
|
|
inssel-long.brg
|
|
burg file for 64bit architectures.
|
|
|
|
inssel-float.brg
|
|
burg file for floating point instructions
|
|
|
|
For a given build, a set of those files would be included.
|
|
For example, for the build of Mono on the x86, the following
|
|
set is used:
|
|
|
|
inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
|
|
|
|
** Native method generation
|
|
|
|
The native method generation has a number of steps:
|
|
|
|
* Architecture specific register allocation.
|
|
|
|
The information about loop nesting that was
|
|
previously gathered is used here to hint the
|
|
register allocator.
|
|
|
|
* Generating the method prolog/epilog.
|
|
|
|
* Optionally generate code to introduce tracing facilities.
|
|
|
|
* Hooking into the debugger.
|
|
|
|
* Performing any pending fixups.
|
|
|
|
* Code generation.
|
|
|
|
*** Code Generation
|
|
|
|
The actual code generation is contained in the architecture
|
|
specific portion of the compiler. The input to the code
|
|
generator is each one of the basic blocks with its list of
|
|
instructions that were produced in the instruction selection
|
|
phase.
|
|
|
|
During the instruction selection phase, virtual registers are
|
|
assigned. Just before the peephole optimization is performed,
|
|
physical registers are assigned.
|
|
|
|
A simple peephole and algebraic optimizer is ran at this
|
|
stage.
|
|
|
|
The peephole optimizer removes some redundant operations at
|
|
this point. This is possible because the code generation at
|
|
this point has visibility into the basic block that spans the
|
|
original trees.
|
|
|
|
The algebraic optimizer performs some simple algebraic
|
|
optimizations that replace expensive operations with cheaper
|
|
operations if possible.
|
|
|
|
The rest of the code generation is fairly simple: a switch
|
|
statement is used to generate code for each of the MonoInsts,
|
|
in the mono/mini/mini-ARCH.c files, the method is called
|
|
"mono_arch_output_basic_block".
|
|
|
|
We always try to allocate code in sequence, instead of just using
|
|
malloc. This way we increase spatial locality which gives a massive
|
|
speedup on most architectures.
|
|
|
|
*** Ahead of Time compilation
|
|
|
|
Ahead-of-Time compilation is a new feature of our new
|
|
compilation engine. The compilation engine is shared by the
|
|
Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
|
|
(AOT).
|
|
|
|
The difference is on the set of optimizations that are turned
|
|
on for each mode: Just-in-Time compilation should be as fast
|
|
as possible, while Ahead-of-Time compilation can take as long
|
|
as required, because this is not done at a time critical
|
|
time.
|
|
|
|
With AOT compilation, we can afford to turn all of the
|
|
computationally expensive optimizations on.
|
|
|
|
After the code generation phase is done, the code and any
|
|
required fixup information is saved into a file that is
|
|
readable by "as" (the native assembler available on all
|
|
systems). This assembly file is then passed to the native
|
|
assembler, which generates a loadable module.
|
|
|
|
At execution time, when an assembly is loaded from the disk,
|
|
the runtime engine will probe for the existence of a
|
|
pre-compiled image. If the pre-compiled image exists, then it
|
|
is loaded, and the method invocations are resolved to the code
|
|
contained in the loaded module.
|
|
|
|
The code generated under the AOT scenario is slightly
|
|
different than the JIT scenario. It generates code that is
|
|
application-domain relative and that can be shared among
|
|
multiple thread.
|
|
|
|
This is the same code generation that is used when the runtime
|
|
is instructed to maximize code sharing on a multi-application
|
|
domain scenario.
|
|
|
|
* SSA-based optimizations
|
|
|
|
SSA form simplifies many optimization because each variable
|
|
has exactly one definition site. This means that each
|
|
variable is only initialized once.
|
|
|
|
For example, code like this:
|
|
|
|
a = 1
|
|
..
|
|
a = 2
|
|
call (a)
|
|
|
|
Is internally turned into:
|
|
|
|
a1 = 1
|
|
..
|
|
a2 = 2
|
|
call (a2)
|
|
|
|
In the presence of branches, like:
|
|
|
|
if (x)
|
|
a = 1
|
|
else
|
|
a = 2
|
|
|
|
call (a)
|
|
|
|
The code is turned into:
|
|
|
|
if (x)
|
|
a1 = 1;
|
|
else
|
|
a2 = 2;
|
|
a3 = phi (a1, a2)
|
|
call (a3)
|
|
|
|
All uses of a variable are "dominated" by its definition
|
|
|
|
This representation is useful as it simplifies the
|
|
implementation of a number of optimizations like conditional
|
|
constant propagation, array bounds check removal and dead code
|
|
elimination.
|
|
|
|
* Register allocation.
|
|
|
|
Global register allocation is performed on the medium
|
|
intermediate representation just before instruction selection
|
|
is performed on the method. Local register allocation is
|
|
later performed at the basic-block level on the
|
|
|
|
Global register allocation uses the following input:
|
|
|
|
1) set of register-sized variables that can be allocated to a
|
|
register (this is an architecture specific setting, for x86
|
|
these registers are the callee saved register ESI, EDI and
|
|
EBX).
|
|
|
|
2) liveness information for the variables
|
|
|
|
3) (optionally) loop info to favor variables that are used in
|
|
inner loops.
|
|
|
|
During instruction selection phase, symbolic registers are
|
|
assigned to temporary values in expressions.
|
|
|
|
Local register allocation assigns hard registers to the
|
|
symbolic registers, and it is performed just before the code
|
|
is actually emitted and is performed at the basic block level.
|
|
A CPU description file describes the input registers, output
|
|
registers, fixed registers and clobbered registers by each
|
|
operation.
|
|
|
|
* BURG Code Generator Generator: Only used up to Mono 2.0
|
|
|
|
monoburg was written by Dietmar Maurer. It is based on the
|
|
papers from Christopher W. Fraser, Robert R. Henry and Todd
|
|
A. Proebsting: "BURG - Fast Optimal Instruction Selection and
|
|
Tree Parsing" and "Engineering a Simple, Efficient Code
|
|
Generator Generator".
|
|
|
|
The original BURG implementation is unable to work on DAGs, instead only
|
|
trees are allowed. Our monoburg implementations is able to generate tree
|
|
matcher which works on DAGs, and we use this feature in the new
|
|
JIT. This simplifies the code because we can directly pass DAGs and
|
|
don't need to convert them to trees.
|
|
|
|
* Adding IL opcodes: an excercise (from a post by Paolo Molaro)
|
|
|
|
mini.c is the file that read the IL code stream and decides
|
|
how any single IL instruction is implemented
|
|
(mono_method_to_ir () func), so you always have to add an
|
|
entry to the big switch inside the function: there are plenty
|
|
of examples in that file.
|
|
|
|
An IL opcode can be implemented in a number of ways, depending
|
|
on what it does and how it needs to do it.
|
|
|
|
Some opcodes are implemented using a helper function: one of
|
|
the simpler examples is the CEE_STELEM_REF implementation.
|
|
|
|
In this case the opcode implementation is written in a C
|
|
function. You will need to register the function with the jit
|
|
before you can use it (mono_register_jit_call) and you need to
|
|
emit the call to the helper using the mono_emit_jit_icall()
|
|
function.
|
|
|
|
This is the simpler way to add a new opcode and it doesn't
|
|
require any arch-specific change (though it's limited to what
|
|
you can do in C code and the performance may be limited by the
|
|
function call).
|
|
|
|
Other opcodes can be implemented with one or more of the already
|
|
implemented low-level instructions.
|
|
|
|
An example is the OP_STRLEN opcode which implements
|
|
String.Length using a simple load from memory. In this case
|
|
you need to add a rule to the appropriate burg file,
|
|
describing what are the arguments of the opcode and what is,
|
|
if any, it's 'return' value.
|
|
|
|
The OP_STRLEN case is:
|
|
|
|
reg: OP_STRLEN (reg) {
|
|
MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
|
|
state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
|
|
}
|
|
|
|
The above means: the OP_STRLEN takes a register as an argument
|
|
and returns its value in a register. And the implementation
|
|
of this is included in the braces.
|
|
|
|
The opcode returns a value in an integer register
|
|
(state->reg1) by performing a int32 load of the length field
|
|
of the MonoString represented by the input register
|
|
(state->left->reg1): before the burg rules are applied, the
|
|
internal representation is based on trees, so you get the
|
|
left/right pointers (state->left and state->right
|
|
respectively, the result is stored in state->reg1).
|
|
|
|
This instruction implementation doesn't require arch-specific
|
|
changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
|
|
available on all platforms), and usually the produced code is
|
|
fast.
|
|
|
|
Next we have opcodes that must be implemented with new low-level
|
|
architecture specific instructions (either because of performance
|
|
considerations or because the functionality can't get implemented in
|
|
other ways).
|
|
|
|
You also need a burg rule in this case, too. For example,
|
|
consider the OP_CHECK_THIS opcode (used to raise an exception
|
|
if the this pointer is null). The burg rule simply reads:
|
|
|
|
stmt: OP_CHECK_THIS (reg) {
|
|
mono_bblock_add_inst (s->cbb, tree);
|
|
}
|
|
|
|
Note that this opcode does not return a value (hence the
|
|
"stmt") and it takes a register as input.
|
|
|
|
mono_bblock_add_inst (s->cbb, tree) just adds the instruction
|
|
(the tree variable) to the current basic block (s->cbb). In
|
|
mini this is the place where the internal representation
|
|
switches from the tree format to the low-level format (the
|
|
list of simple instructions).
|
|
|
|
In this case the actual opcode implementation is delegated to
|
|
the arch-specific code. A low-level opcode needs an entry in
|
|
the machine description (the *.md files in mini/). This entry
|
|
describes what kind of registers are used if any by the
|
|
instruction, as well as other details such as constraints or
|
|
other hints to the low-level engine which are architecture
|
|
specific.
|
|
|
|
cpu-pentium.md, for example has the following entry:
|
|
|
|
checkthis: src1:b len:3
|
|
|
|
This means the instruction uses an integer register as a base
|
|
pointer (basically a load or store is done on it) and it takes
|
|
3 bytes of native code to implement it.
|
|
|
|
Now you just need to provide the low-level implementation for
|
|
the opcode in one of the mini-$arch.c files, in the
|
|
mono_arch_output_basic_block() function. There is a big switch
|
|
here too. The x86 implementation is:
|
|
|
|
case OP_CHECK_THIS:
|
|
/* ensure ins->sreg1 is not NULL */
|
|
x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
|
|
break;
|
|
|
|
If the $arch-codegen.h header file doesn't have the code to
|
|
emit the low-level native code, you'll need to write that as
|
|
well.
|
|
|
|
Complex opcodes with register constraints may require other
|
|
changes to the local register allocator, but usually they are
|
|
not needed.
|
|
|
|
* Future
|
|
|
|
Profile-based optimization is something that we are very
|
|
interested in supporting. There are two possible usage
|
|
scenarios:
|
|
|
|
* Based on the profile information gathered during
|
|
the execution of a program, hot methods can be compiled
|
|
with the highest level of optimizations, while bootstrap
|
|
code and cold methods can be compiled with the least set
|
|
of optimizations and placed in a discardable list.
|
|
|
|
* Code reordering: this profile-based optimization would
|
|
only make sense for pre-compiled code. The profile
|
|
information is used to re-order the assembly code on disk
|
|
so that the code is placed on the disk in a way that
|
|
increments locality.
|
|
|
|
This is the same principle under which SGI's cord program
|
|
works.
|
|
|
|
The nature of the CIL allows the above optimizations to be
|
|
easy to implement and deploy. Since we live and define our
|
|
universe for these things, there are no interactions with
|
|
system tools required, nor upgrades on the underlying
|
|
infrastructure required.
|
|
|
|
Instruction scheduling is important for certain kinds of
|
|
processors, and some of the framework exists today in our
|
|
register allocator and the instruction selector to cope with
|
|
this, but has not been finished. The instruction selection
|
|
would happen at the same time as local register allocation. <
|