You've already forked linux-packaging-mono
							
							
		
			
	
	
		
			798 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			798 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
|   | 
 | ||
|  | 	       A new JIT compiler for the Mono Project | ||
|  | 
 | ||
|  | 	   Miguel de Icaza (miguel@{ximian.com,gnome.org}), | ||
|  | 	   Paolo Molaro (lupus@{ximian.com,debian.org}) | ||
|  | 
 | ||
|  | 	This documents overall design of the Mono JIT up to version | ||
|  | 	2.0.   After Mono 2.0 the JIT engine was upgraded from | ||
|  | 	a tree-based intermediate representation to a linear | ||
|  | 	intermediate representation. | ||
|  | 
 | ||
|  | 	The Linear IL is documented here: | ||
|  | 
 | ||
|  | 	    http://www.mono-project.com/Linear_IL | ||
|  | 
 | ||
|  | * Abstract | ||
|  | 
 | ||
|  | 	Mini is a new compilation engine for the Mono runtime.  The | ||
|  | 	new engine is designed to bring new code generation | ||
|  | 	optimizations, portability and pre-compilation.  | ||
|  | 
 | ||
|  | 	In this document we describe the design decisions and the | ||
|  | 	architecture of the new compilation engine.  | ||
|  | 
 | ||
|  | * Introduction | ||
|  | 
 | ||
|  | 	Mono is a Open Source implementation of the .NET Framework: it | ||
|  | 	is made up of a runtime engine that implements the ECMA Common | ||
|  | 	Language Infrastructure (CLI), a set of compilers that target | ||
|  | 	the CLI and a large collection of class libraries. | ||
|  | 
 | ||
|  | 	This article discusses the new code generation facilities that | ||
|  | 	have been added to the Mono runtime.   | ||
|  | 
 | ||
|  | 	First we discuss the overall architecture of the Mono runtime, | ||
|  | 	and how code generation fits into it; Then we discuss the | ||
|  | 	development and basic architecture of our first JIT compiler | ||
|  | 	for the ECMA CIL framework.  The next section covers the | ||
|  | 	objectives for the work on the new JIT compiler, then we | ||
|  | 	discuss the new features available in the new JIT compiler, | ||
|  | 	and finally a technical description of the new code generation | ||
|  | 	engine. | ||
|  | 
 | ||
|  | * Architecture of the Mono Runtime | ||
|  | 
 | ||
|  | 	The Mono runtime is an implementation of the ECMA Common | ||
|  | 	Language Infrastructure (CLI), whose aim is to be a common | ||
|  | 	platform for executing code in multiple languages. | ||
|  | 
 | ||
|  | 	Languages that target the CLI generate images that contain | ||
|  | 	code in high-level intermediate representation called the | ||
|  | 	"Common Intermediate Language".  This intermediate language is | ||
|  | 	rich enough to allow for programs and pre-compiled libraries | ||
|  | 	to be reflected.  The execution environment allows for an | ||
|  | 	object oriented execution environment with single inheritance | ||
|  | 	and multiple interface implementations. | ||
|  | 
 | ||
|  | 	This runtime provides a number of services for programs that | ||
|  | 	are targeted to it: Just-in-Time compilation of CIL code into | ||
|  | 	native code, garbage collection, thread management, I/O | ||
|  | 	routines, single, double and decimal floating point, | ||
|  | 	asynchronous method invocation, application domains, and a | ||
|  | 	framework for building arbitrary RPC systems (remoting) and | ||
|  | 	integration with system libraries through the Platform Invoke | ||
|  | 	functionality. | ||
|  | 
 | ||
|  | 	The focus of this document is on the services provided by the | ||
|  | 	Mono runtime to transform CIL bytecodes into code that is | ||
|  | 	native to the underlying architecture. | ||
|  | 
 | ||
|  | 	The code generation interface is a set of macros that allow a | ||
|  | 	C programmer to generate code on the fly, this is done | ||
|  | 	through a set of macros found in the mono/jit/arch/ directory. | ||
|  | 	These macros are used by the JIT compiler to generate native | ||
|  | 	code.  | ||
|  | 
 | ||
|  | 	The platform invocation code is interesting, as it generates | ||
|  | 	CIL code on the fly to marshal parameters, and then this | ||
|  | 	code is in turned processed by the JIT engine. | ||
|  | 
 | ||
|  | 	Mono has now gone through three different JIT engines, these | ||
|  | 	are: | ||
|  | 
 | ||
|  | 	* Original JIT engine: 2002, hard to port, hard to | ||
|  | 	  implement optimizations. | ||
|  | 
 | ||
|  | 	* Second JIT engine, used up until Mono 2.0, very | ||
|  |           portable, many new optimizations. | ||
|  | 
 | ||
|  | 	* Third JIT engine, replaced the code generation layer from | ||
|  | 	  being based on a tree representation to be based on a linear | ||
|  | 	  representation. | ||
|  | 
 | ||
|  |         For more information on the code generation changes, see our | ||
|  | 	web site for the details on the Linear IL: | ||
|  | 
 | ||
|  | 	    http://www.mono-project.com/Linear_IL | ||
|  | 
 | ||
|  | * Previous Experiences | ||
|  | 
 | ||
|  | 	Mono has built a JIT engine, which has been used to bootstrap | ||
|  | 	Mono since January, 2002.  This JIT engine has reasonable | ||
|  | 	performance, and uses an tree pattern matching instruction | ||
|  | 	selector based on the BURS technology.  This JIT compiler was | ||
|  | 	designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza. | ||
|  | 
 | ||
|  | 	The existing JIT compiler has three phases: | ||
|  | 
 | ||
|  | 		* Re-creation of the semantic tree from CIL | ||
|  | 		  byte-codes. | ||
|  | 
 | ||
|  | 		* Instruction selection, with a cost-driven | ||
|  | 		  engine.  | ||
|  | 
 | ||
|  | 		* Code generation and register allocation. | ||
|  | 
 | ||
|  | 	It is also hooked into the rest of the runtime to provide | ||
|  | 	services like marshaling, just-in-time compilation and | ||
|  | 	invocation of "internal calls".  | ||
|  | 
 | ||
|  | 	This engine constructed a collection of trees, which we | ||
|  | 	referred to as the "forest of trees", this forest is created by | ||
|  | 	"hydrating" the CIL instruction stream. | ||
|  | 
 | ||
|  | 	The first step was to identify the basic blocks on the method, | ||
|  | 	and computing the control flow graph (cfg) for it.  Once this | ||
|  | 	information was computed, a stack analysis on each basic block | ||
|  | 	was performed to create a forest of trees for each one of | ||
|  | 	them.  | ||
|  | 
 | ||
|  | 	So for example, the following statement: | ||
|  | 
 | ||
|  | 	       int a, b; | ||
|  | 	       ... | ||
|  | 	       b = a + 1; | ||
|  | 
 | ||
|  | 	Which would be represented in CIL as: | ||
|  | 
 | ||
|  | 			ldloc.0  | ||
|  | 			ldc.i4.1  | ||
|  | 			add  | ||
|  | 			stloc.1  | ||
|  | 
 | ||
|  | 	After the stack analysis would create the following tree: | ||
|  | 
 | ||
|  |                (STIND_I4 ADDR_L[EBX|2] ( | ||
|  | 			 ADD (LDIND_I4 ADDR_L[ESI|1])  | ||
|  | 			 CONST_I4[1])) | ||
|  | 
 | ||
|  |         This tree contains information from the stack analysis: for | ||
|  |         instance, notice that the operations explicitly encode the | ||
|  |         data types they are operating on, there is no longer an | ||
|  |         ambiguity on the types, because this information has been | ||
|  |         inferred.  | ||
|  | 
 | ||
|  | 	At this point the JIT would pass the constructed forest of | ||
|  | 	trees to the architecture-dependent JIT compiler.   | ||
|  | 
 | ||
|  | 	The architecture dependent code then performed register | ||
|  | 	allocation (optionally using linear scan allocation for | ||
|  | 	variables, based on life analysis).   | ||
|  | 
 | ||
|  | 	Once variables had been assigned, a tree pattern matching with | ||
|  | 	dynamic programming is used (the tree pattern matcher is | ||
|  | 	custom build for each architecture, using a code | ||
|  | 	generator: monoburg). The instruction selector used cost | ||
|  | 	functions to select the best instruction patterns.   | ||
|  | 
 | ||
|  | 	The instruction selector is able to produce instructions that | ||
|  | 	take advantage of the x86 instruction indexing instructions | ||
|  | 	for example.  | ||
|  | 
 | ||
|  | 	One problem though is that the code emitter and the register | ||
|  | 	allocator did not have any visibility outside the current | ||
|  | 	tree, which meant that some redundant instructions were | ||
|  | 	generated.  A peephole optimizer with this architecture was | ||
|  | 	hard to write, given the tree-based representation that is | ||
|  | 	used. | ||
|  | 
 | ||
|  | 	This JIT was functional, but it did not provide a good | ||
|  | 	architecture to base future optimizations on.  Also the | ||
|  | 	line between architecture neutral and architecture | ||
|  | 	specific code and optimizations was hard to draw. | ||
|  | 
 | ||
|  | 	The JIT engine supported two code generation modes to support | ||
|  | 	the two optimization modes for applications that host multiple | ||
|  | 	application domains: generate code that will be shared across | ||
|  | 	application domains, or generate code that will not be shared | ||
|  | 	across application domains. | ||
|  | 
 | ||
|  | * Second Generation JIT engine. | ||
|  | 
 | ||
|  | 	We wanted to support a number of features that were missing: | ||
|  | 
 | ||
|  | 	   * Ahead-of-time compilation.   | ||
|  | 
 | ||
|  | 	     The idea is to allow developers to pre-compile their code | ||
|  | 	     to native code to reduce startup time, and the working | ||
|  | 	     set that is used at runtime in the just-in-time compiler. | ||
|  | 
 | ||
|  | 	     Although in Mono this has not been a visible problem, we | ||
|  | 	     wanted to pro-actively address this problem. | ||
|  | 
 | ||
|  | 	     When an assembly (a Mono/.NET executable) is installed in | ||
|  | 	     the system, it would then be possible to pre-compile the | ||
|  | 	     code, and have the JIT compiler tune the generated code | ||
|  | 	     to the particular CPU on which the software is | ||
|  | 	     installed.  | ||
|  | 
 | ||
|  | 	     This is done in the Microsoft.NET world with a tool | ||
|  | 	     called ngen.exe | ||
|  | 
 | ||
|  | 	   * Have a good platform for doing code optimizations.  | ||
|  | 
 | ||
|  | 	     The design called for a good architecture that would | ||
|  | 	     enable various levels of optimizations: some | ||
|  | 	     optimizations are better performed on high-level | ||
|  | 	     intermediate representations, some on medium-level and | ||
|  | 	     some at low-level representations. | ||
|  | 
 | ||
|  | 	     Also it should be possible to conditionally turn these on | ||
|  | 	     or off.  Some optimizations are too expensive to be used | ||
|  | 	     in just-in-time compilation scenarios, but these | ||
|  | 	     expensive optimizations can be turned on for | ||
|  | 	     ahead-of-time compilations or when using profile-guided | ||
|  | 	     optimizations on a subset of the executed methods. | ||
|  | 
 | ||
|  | 	   * Reduce the effort required to port the Mono code | ||
|  |              generator to new architectures. | ||
|  | 
 | ||
|  | 	     For Mono to gain wide adoption in the Unix world, it is | ||
|  | 	     necessary that the JIT engine works in most of today's | ||
|  | 	     commercial hardware platforms.  | ||
|  | 
 | ||
|  | * Features of the Second JIT engine. | ||
|  | 
 | ||
|  | 	The new JIT engine was architected by Dietmar Maurer and Paolo | ||
|  | 	Molaro, based on the new objectives. | ||
|  | 
 | ||
|  | 	Mono provides a number of services to applications running | ||
|  | 	with the new JIT compiler: | ||
|  | 
 | ||
|  | 	     * Just-in-Time compilation of CLI code into native code. | ||
|  | 
 | ||
|  | 	     * Ahead-of-Time compilation of CLI code, to reduce | ||
|  |                startup time of applications.  | ||
|  | 
 | ||
|  | 	A number of software development features are also available: | ||
|  | 
 | ||
|  | 	     * Execution time profiling (--profile) | ||
|  | 
 | ||
|  | 	       Generates a report of the times consumed by routines, | ||
|  | 	       as well as the invocation times, as well as the | ||
|  | 	       callers. | ||
|  | 
 | ||
|  | 	     * Memory usage profiling (--profile) | ||
|  | 
 | ||
|  | 	       Generates a report of the memory usage by a program | ||
|  | 	       that is ran under the Mono JIT. | ||
|  | 
 | ||
|  | 	     * Code coverage (--coverage) | ||
|  | 
 | ||
|  | 	     * Execution tracing. | ||
|  | 
 | ||
|  |         People who are interested in developing and improving the Mini | ||
|  |         JIT compiler will also find a few useful routines: | ||
|  | 
 | ||
|  | 	     * Compilation times | ||
|  | 
 | ||
|  | 	       This is used to time the execution time for the JIT | ||
|  | 	       when compiling a routine.  | ||
|  | 
 | ||
|  | 	     * Control Flow Graph and Dominator Tree drawing. | ||
|  | 
 | ||
|  | 	       These are visual aids for the JIT developer: they | ||
|  | 	       render representations of the Control Flow graph, and | ||
|  | 	       for the more advanced optimizations, they draw the | ||
|  | 	       dominator tree graph.  | ||
|  | 
 | ||
|  | 	       This requires Dot (from the graphwiz package) and Ghostview. | ||
|  | 
 | ||
|  | 	     * Code generator regression tests.   | ||
|  | 
 | ||
|  | 	       The engine contains support for running regression | ||
|  | 	       tests on the virtual machine, which is very helpful to | ||
|  | 	       developers interested in improving the engine. | ||
|  | 
 | ||
|  | 	     * Optimization benchmark framework. | ||
|  | 
 | ||
|  | 	       The JIT engine will generate graphs that compare | ||
|  | 	       various benchmarks embedded in an assembly, and run the | ||
|  | 	       various tests with different optimization flags.   | ||
|  | 
 | ||
|  | 	       This requires Perl, GD::Graph. | ||
|  | 
 | ||
|  | * Flexibility | ||
|  | 
 | ||
|  | 	This is probably the most important component of the new code | ||
|  | 	generation engine.  The internals are relatively easy to | ||
|  | 	replace and update, even large passes can be replaced and | ||
|  | 	implemented differently. | ||
|  | 
 | ||
|  | * New code generator | ||
|  | 
 | ||
|  | 	Compiling a method begins with the `mini_method_to_ir' routine | ||
|  | 	that converts the CIL representation into a medium | ||
|  | 	intermediate representation. | ||
|  | 
 | ||
|  | 	The mini_method_to_ir routine performs a number of operations: | ||
|  | 
 | ||
|  | 	    * Flow analysis and control flow graph computation. | ||
|  | 
 | ||
|  | 	      Unlike the previous version, stack analysis and control | ||
|  | 	      flow graphs are computed in a single pass in the | ||
|  | 	      mini_method_to_ir function, this is done for performance | ||
|  | 	      reasons: although the complexity increases, the benefit | ||
|  | 	      for a JIT compiler is that there is more time available | ||
|  | 	      for performing other optimizations. | ||
|  | 
 | ||
|  | 	    * Basic block computation. | ||
|  | 
 | ||
|  | 	      mini_method_to_ir populates the MonoCompile structure | ||
|  | 	      with an array of basic blocks each of which contains | ||
|  | 	      forest of trees made up of MonoInst structures. | ||
|  | 
 | ||
|  | 	    * Inlining | ||
|  | 
 | ||
|  | 	      Inlining is no longer restricted to methods containing | ||
|  | 	      one single basic block, instead it is possible to inline | ||
|  | 	      arbitrary complex methods. | ||
|  | 
 | ||
|  | 	      The heuristics to choose what to inline are likely going | ||
|  | 	      to be tuned in the future. | ||
|  | 
 | ||
|  | 	    * Method to opcode conversion. | ||
|  | 
 | ||
|  | 	      Some method call invocations like `call Math.Sin' are | ||
|  | 	      transformed into an opcode: this transforms the call | ||
|  | 	      into a semantically rich node, which is later inline | ||
|  | 	      into an FPU instruction. | ||
|  | 
 | ||
|  | 	      Various Array methods invocations are turned into | ||
|  | 	      opcodes as well (The Get, Set and Address methods) | ||
|  | 
 | ||
|  | 	    * Tail recursion elimination | ||
|  | 
 | ||
|  | 	Basic blocks **** | ||
|  | 
 | ||
|  | 	The MonoInst structure holds the actual decoded instruction, | ||
|  | 	with the semantic information from the stack analysis. | ||
|  | 	MonoInst is interesting because initially it is part of a tree | ||
|  | 	structure, here is a sample of the same tree with the new JIT | ||
|  | 	engine: | ||
|  | 
 | ||
|  | 		 (stind.i4 regoffset[0xffffffd4(%ebp)]  | ||
|  | 			   (add (ldind.i4 regoffset[0xffffffd8(%ebp)]) | ||
|  | 			         iconst[1])) | ||
|  | 
 | ||
|  | 	This is a medium-level intermediate representation (MIR).  | ||
|  | 
 | ||
|  | 	Some complex opcodes are decomposed at this stage into a | ||
|  | 	collection of simpler opcodes.  Not every complex opcode is | ||
|  | 	decomposed at this stage, as we need to preserve the semantic | ||
|  | 	information during various optimization phases.   | ||
|  | 
 | ||
|  | 	For example a NEWARR opcode carries the length and the type of | ||
|  | 	the array that could be used later to avoid type checking or | ||
|  | 	array bounds check. | ||
|  | 
 | ||
|  |         There are a number of operations supported on this | ||
|  | 	representation: | ||
|  | 
 | ||
|  | 		* Branch optimizations. | ||
|  | 
 | ||
|  | 		* Variable liveness. | ||
|  | 
 | ||
|  | 		* Loop optimizations: the dominator trees are | ||
|  | 		  computed, loops are detected, and their nesting | ||
|  | 		  level computed. | ||
|  | 
 | ||
|  | 		* Conversion of the method into static single assignment | ||
|  |                   form (SSA form). | ||
|  | 
 | ||
|  | 	        * Dead code elimination. | ||
|  | 
 | ||
|  | 		* Constant propagation. | ||
|  | 
 | ||
|  | 		* Copy propagation. | ||
|  | 
 | ||
|  | 		* Constant folding. | ||
|  | 
 | ||
|  | 	Once the above optimizations are optionally performed, a | ||
|  | 	decomposition phase is used to turn some complex opcodes into | ||
|  | 	internal method calls.  In the initial version of the JIT | ||
|  | 	engine, various operations on longs are emulated instead of | ||
|  | 	being inlined.  Also the newarr invocation is turned into a | ||
|  | 	call to the runtime. | ||
|  | 
 | ||
|  | 	At this point, after computing variable liveness, it is | ||
|  | 	possible to use the linear scan algorithm for allocating | ||
|  | 	variables to registers.  The linear scan pass uses the | ||
|  | 	information that was previously gathered by the loop nesting | ||
|  | 	and loop structure computation to favor variables in inner | ||
|  | 	loops.   This process updates the basic block `nesting' field | ||
|  | 	which is later used during liveness analysis. | ||
|  | 
 | ||
|  | 	Stack space is then reserved for the local variables and any | ||
|  | 	temporary variables generated during the various | ||
|  | 	optimizations. | ||
|  | 
 | ||
|  | ** Instruction selection: Only used up until Mono 2.0 | ||
|  | 
 | ||
|  | 	At this point, the BURS instruction selector is invoked to | ||
|  | 	transform the tree-based representation into a list of | ||
|  | 	instructions.  This is done using a tree pattern matcher that | ||
|  | 	is generated for the architecture using the `monoburg' tool.  | ||
|  | 
 | ||
|  | 	Monoburg takes as input a file that describes tree patterns, | ||
|  | 	which are matched against the trees that were produced by the | ||
|  | 	engine in the previous stages. | ||
|  | 
 | ||
|  | 	The pattern matching might have more than one match for a | ||
|  | 	particular tree.  In this case, the match selected is the one | ||
|  | 	whose cost is the smallest.  A cost can be attached to each | ||
|  | 	rule, and if no cost is provided, the implicit cost is one. | ||
|  | 	Smaller costs are selected over higher costs. | ||
|  | 
 | ||
|  | 	The cost function can be used to select particular blocks of | ||
|  | 	code for a given architecture, or by using a prohibitive high | ||
|  | 	number to avoid having the rule match. | ||
|  | 
 | ||
|  | 	The various rules that our JIT engine uses transform a tree of | ||
|  | 	MonoInsts into a list of monoinsts: | ||
|  | 
 | ||
|  | 	+-----------------------------------------------------------+ | ||
|  | 	| Tree                                           List       | | ||
|  | 	| of           ===> Instruction selection ===>   of         | | ||
|  | 	| MonoInst                                       MonoInst.  | | ||
|  |         +-----------------------------------------------------------+ | ||
|  | 
 | ||
|  | 	During this process various "types" of MonoInst kinds  | ||
|  | 	disappear and turned into lower-level representations.  The | ||
|  | 	JIT compiler just happens to reuse the same structure (this is | ||
|  | 	done to reduce memory usage and improve memory locality). | ||
|  | 
 | ||
|  | 	The instruction selection rules are split in a number of | ||
|  | 	files, each one with a particular purpose: | ||
|  | 
 | ||
|  | 	        inssel.brg | ||
|  | 			Contains the generic instruction selection | ||
|  | 			patterns. | ||
|  | 
 | ||
|  | 		inssel-x86.brg | ||
|  | 			Contains x86 specific rules. | ||
|  | 
 | ||
|  | 		inssel-ppc.brg | ||
|  | 			Contains PowerPC specific rules. | ||
|  | 
 | ||
|  | 		inssel-long32.brg | ||
|  | 			burg file for 64bit instructions on 32bit architectures. | ||
|  | 
 | ||
|  | 		inssel-long.brg | ||
|  | 			burg file for 64bit architectures. | ||
|  | 
 | ||
|  | 		inssel-float.brg | ||
|  | 			burg file for floating point instructions | ||
|  | 		 | ||
|  | 	For a given build, a set of those files would be included. | ||
|  | 	For example, for the build of Mono on the x86, the following | ||
|  | 	set is used: | ||
|  | 
 | ||
|  | 	    inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg | ||
|  | 
 | ||
|  | ** Native method generation | ||
|  | 
 | ||
|  | 	The native method generation has a number of steps: | ||
|  | 
 | ||
|  | 		* Architecture specific register allocation. | ||
|  | 
 | ||
|  | 		  The information about loop nesting that was | ||
|  | 		  previously gathered is used here to hint the | ||
|  | 		  register allocator.  | ||
|  | 
 | ||
|  | 		* Generating the method prolog/epilog. | ||
|  | 
 | ||
|  | 		* Optionally generate code to introduce tracing facilities. | ||
|  | 
 | ||
|  | 		* Hooking into the debugger. | ||
|  | 
 | ||
|  | 		* Performing any pending fixups.  | ||
|  | 
 | ||
|  | 		* Code generation. | ||
|  | 
 | ||
|  | *** Code Generation | ||
|  | 
 | ||
|  | 	The actual code generation is contained in the architecture | ||
|  | 	specific portion of the compiler.  The input to the code | ||
|  | 	generator is each one of the basic blocks with its list of | ||
|  | 	instructions that were produced in the instruction selection | ||
|  | 	phase. | ||
|  | 
 | ||
|  | 	During the instruction selection phase, virtual registers are | ||
|  | 	assigned.  Just before the peephole optimization is performed, | ||
|  | 	physical registers are assigned. | ||
|  | 
 | ||
|  | 	A simple peephole and algebraic optimizer is ran at this | ||
|  | 	stage.   | ||
|  | 
 | ||
|  | 	The peephole optimizer removes some redundant operations at | ||
|  | 	this point.  This is possible because the code generation at | ||
|  | 	this point has visibility into the basic block that spans the | ||
|  | 	original trees.   | ||
|  | 
 | ||
|  | 	The algebraic optimizer performs some simple algebraic | ||
|  | 	optimizations that replace expensive operations with cheaper | ||
|  | 	operations if possible. | ||
|  | 
 | ||
|  | 	The rest of the code generation is fairly simple: a switch | ||
|  | 	statement is used to generate code for each of the MonoInsts, | ||
|  | 	in the mono/mini/mini-ARCH.c files, the method is called | ||
|  | 	"mono_arch_output_basic_block". | ||
|  | 
 | ||
|  | 	We always try to allocate code in sequence, instead of just using | ||
|  | 	malloc. This way we increase spatial locality which gives a massive | ||
|  | 	speedup on most architectures. | ||
|  | 
 | ||
|  | *** Ahead of Time compilation | ||
|  | 
 | ||
|  | 	Ahead-of-Time compilation is a new feature of our new | ||
|  | 	compilation engine.  The compilation engine is shared by the | ||
|  | 	Just-in-Time (JIT) compiler and the Ahead-of-Time compiler | ||
|  | 	(AOT). | ||
|  | 
 | ||
|  | 	The difference is on the set of optimizations that are turned | ||
|  | 	on for each mode: Just-in-Time compilation should be as fast | ||
|  | 	as possible, while Ahead-of-Time compilation can take as long | ||
|  | 	as required, because this is not done at a time critical | ||
|  | 	time.  | ||
|  | 
 | ||
|  | 	With AOT compilation, we can afford to turn all of the | ||
|  | 	computationally expensive optimizations on. | ||
|  | 
 | ||
|  | 	After the code generation phase is done, the code and any | ||
|  | 	required fixup information is saved into a file that is | ||
|  | 	readable by "as" (the native assembler available on all | ||
|  | 	systems). This assembly file is then passed to the native | ||
|  | 	assembler, which generates a loadable module. | ||
|  | 
 | ||
|  | 	At execution time, when an assembly is loaded from the disk, | ||
|  | 	the runtime engine will probe for the existence of a | ||
|  | 	pre-compiled image.  If the pre-compiled image exists, then it | ||
|  | 	is loaded, and the method invocations are resolved to the code | ||
|  | 	contained in the loaded module. | ||
|  | 
 | ||
|  | 	The code generated under the AOT scenario is slightly | ||
|  | 	different than the JIT scenario.  It generates code that is | ||
|  | 	application-domain relative and that can be shared among | ||
|  | 	multiple thread. | ||
|  | 
 | ||
|  | 	This is the same code generation that is used when the runtime | ||
|  | 	is instructed to maximize code sharing on a multi-application | ||
|  | 	domain scenario. | ||
|  | 
 | ||
|  | * SSA-based optimizations | ||
|  | 
 | ||
|  | 	SSA form simplifies many optimization because each variable | ||
|  | 	has exactly one definition site.  This means that each | ||
|  | 	variable is only initialized once.   | ||
|  | 
 | ||
|  | 	For example, code like this: | ||
|  | 
 | ||
|  | 	    a = 1 | ||
|  | 	    .. | ||
|  | 	    a = 2 | ||
|  | 	    call (a) | ||
|  | 
 | ||
|  | 	Is internally turned into: | ||
|  | 
 | ||
|  | 	    a1 = 1 | ||
|  | 	    .. | ||
|  | 	    a2 = 2 | ||
|  | 	    call (a2) | ||
|  | 
 | ||
|  | 	In the presence of branches, like: | ||
|  | 
 | ||
|  | 	    if (x) | ||
|  | 	         a = 1 | ||
|  | 	    else | ||
|  | 		 a = 2 | ||
|  | 
 | ||
|  |             call (a) | ||
|  | 
 | ||
|  | 	The code is turned into: | ||
|  | 
 | ||
|  | 	    if (x) | ||
|  | 	         a1 = 1; | ||
|  | 	    else | ||
|  | 	         a2 = 2; | ||
|  | 	    a3 = phi (a1, a2) | ||
|  | 	    call (a3) | ||
|  | 
 | ||
|  | 	All uses of a variable are "dominated" by its definition | ||
|  | 
 | ||
|  | 	This representation is useful as it simplifies the | ||
|  | 	implementation of a number of optimizations like conditional | ||
|  | 	constant propagation, array bounds check removal and dead code | ||
|  | 	elimination.  | ||
|  | 
 | ||
|  | * Register allocation. | ||
|  | 
 | ||
|  | 	Global register allocation is performed on the medium | ||
|  | 	intermediate representation just before instruction selection | ||
|  | 	is performed on the method.  Local register allocation is | ||
|  | 	later performed at the basic-block level on the  | ||
|  | 
 | ||
|  | 	Global register allocation uses the following input: | ||
|  | 
 | ||
|  |         1) set of register-sized variables that can be allocated to a | ||
|  |         register (this is an architecture specific setting, for x86 | ||
|  |         these registers are the callee saved register ESI, EDI and | ||
|  |         EBX).  | ||
|  | 
 | ||
|  |         2) liveness information for the variables | ||
|  | 
 | ||
|  |         3) (optionally) loop info to favor variables that are used in | ||
|  |         inner loops. | ||
|  | 
 | ||
|  | 	During instruction selection phase, symbolic registers are | ||
|  | 	assigned to temporary values in expressions. | ||
|  | 
 | ||
|  | 	Local register allocation assigns hard registers to the | ||
|  | 	symbolic registers, and it is performed just before the code | ||
|  | 	is actually emitted and is performed at the basic block level. | ||
|  | 	A CPU description file describes the input registers, output | ||
|  | 	registers, fixed registers and clobbered registers by each | ||
|  | 	operation. | ||
|  | 
 | ||
|  | * BURG Code Generator Generator: Only used up to Mono 2.0 | ||
|  | 
 | ||
|  |        monoburg was written by Dietmar Maurer. It is based on the | ||
|  |        papers from Christopher W. Fraser, Robert R. Henry and Todd | ||
|  |        A. Proebsting: "BURG - Fast Optimal Instruction Selection and | ||
|  |        Tree Parsing" and "Engineering a Simple, Efficient Code | ||
|  |        Generator Generator". | ||
|  | 
 | ||
|  |        The original BURG implementation is unable to work on DAGs, instead only | ||
|  |        trees are allowed. Our monoburg implementations is able to generate tree | ||
|  |        matcher which works on DAGs, and we use this feature in the new | ||
|  |        JIT. This simplifies the code because we can directly pass DAGs and | ||
|  |        don't need to convert them to trees. | ||
|  | 
 | ||
|  | * Adding IL opcodes: an excercise (from a post by Paolo Molaro) | ||
|  | 
 | ||
|  | 	mini.c is the file that read the IL code stream and decides | ||
|  | 	how any single IL instruction is implemented | ||
|  | 	(mono_method_to_ir () func), so you always have to add an | ||
|  | 	entry to the big switch inside the function: there are plenty | ||
|  | 	of examples in that file. | ||
|  | 
 | ||
|  | 	An IL opcode can be implemented in a number of ways, depending | ||
|  | 	on what it does and how it needs to do it. | ||
|  | 	 | ||
|  | 	Some opcodes are implemented using a helper function: one of | ||
|  | 	the simpler examples is the CEE_STELEM_REF implementation. | ||
|  | 
 | ||
|  | 	In this case the opcode implementation is written in a C | ||
|  | 	function.  You will need to register the function with the jit | ||
|  | 	before you can use it (mono_register_jit_call) and you need to | ||
|  | 	emit the call to the helper using the mono_emit_jit_icall() | ||
|  | 	function.   | ||
|  | 
 | ||
|  | 	This is the simpler way to add a new opcode and it doesn't | ||
|  | 	require any arch-specific change (though it's limited to what | ||
|  | 	you can do in C code and the performance may be limited by the | ||
|  | 	function call). | ||
|  | 	 | ||
|  | 	Other opcodes can be implemented with one or more of the already | ||
|  | 	implemented low-level instructions.  | ||
|  | 
 | ||
|  | 	An example is the OP_STRLEN opcode which implements | ||
|  | 	String.Length using a simple load from memory.  In this case | ||
|  | 	you need to add a rule to the appropriate burg file, | ||
|  | 	describing what are the arguments of the opcode and what is, | ||
|  | 	if any, it's 'return' value. | ||
|  | 
 | ||
|  | 	The OP_STRLEN case is: | ||
|  | 	 | ||
|  | 	reg: OP_STRLEN (reg) {   | ||
|  | 		MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,  | ||
|  | 			state->left->reg1, G_STRUCT_OFFSET (MonoString, length)); | ||
|  | 	} | ||
|  | 
 | ||
|  | 	The above means: the OP_STRLEN takes a register as an argument | ||
|  | 	and returns its value in a register.  And the implementation | ||
|  | 	of this is included in the braces. | ||
|  | 	 | ||
|  | 	The opcode returns a value in an integer register | ||
|  | 	(state->reg1) by performing a int32 load of the length field | ||
|  | 	of the MonoString represented by the input register | ||
|  | 	(state->left->reg1): before the burg rules are applied, the | ||
|  | 	internal representation is based on trees, so you get the | ||
|  | 	left/right pointers (state->left and state->right | ||
|  | 	respectively, the result is stored in state->reg1). | ||
|  | 
 | ||
|  | 	This instruction implementation doesn't require arch-specific | ||
|  | 	changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is | ||
|  | 	available on all platforms), and usually the produced code is | ||
|  | 	fast. | ||
|  | 	 | ||
|  | 	Next we have opcodes that must be implemented with new low-level | ||
|  | 	architecture specific instructions (either because of performance | ||
|  | 	considerations or because the functionality can't get implemented in | ||
|  | 	other ways).   | ||
|  | 
 | ||
|  | 	You also need a burg rule in this case, too. For example, | ||
|  | 	consider the OP_CHECK_THIS opcode (used to raise an exception | ||
|  | 	if the this pointer is null). The burg rule simply reads: | ||
|  | 	 | ||
|  | 	stmt: OP_CHECK_THIS (reg) { | ||
|  | 		mono_bblock_add_inst (s->cbb, tree); | ||
|  | 	} | ||
|  | 	 | ||
|  | 	Note that this opcode does not return a value (hence the | ||
|  | 	"stmt") and it takes a register as input. | ||
|  | 
 | ||
|  | 	mono_bblock_add_inst (s->cbb, tree) just adds the instruction | ||
|  | 	(the tree variable) to the current basic block (s->cbb). In | ||
|  | 	mini this is the place where the internal representation | ||
|  | 	switches from the tree format to the low-level format (the | ||
|  | 	list of simple instructions). | ||
|  | 
 | ||
|  | 	In this case the actual opcode implementation is delegated to | ||
|  | 	the arch-specific code.  A low-level opcode needs an entry in | ||
|  | 	the machine description (the *.md files in mini/). This entry | ||
|  | 	describes what kind of registers are used if any by the | ||
|  | 	instruction, as well as other details such as constraints or | ||
|  | 	other hints to the low-level engine which are architecture | ||
|  | 	specific.   | ||
|  | 
 | ||
|  | 	cpu-pentium.md, for example has the following entry: | ||
|  | 	 | ||
|  | 	checkthis: src1:b len:3 | ||
|  | 	 | ||
|  | 	This means the instruction uses an integer register as a base | ||
|  | 	pointer (basically a load or store is done on it) and it takes | ||
|  | 	3 bytes of native code to implement it. | ||
|  | 
 | ||
|  | 	Now you just need to provide the low-level implementation for | ||
|  | 	the opcode in one of the mini-$arch.c files, in the | ||
|  | 	mono_arch_output_basic_block() function. There is a big switch | ||
|  | 	here too. The x86 implementation is: | ||
|  | 
 | ||
|  | 		case OP_CHECK_THIS: | ||
|  | 			/* ensure ins->sreg1 is not NULL */ | ||
|  | 			x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0); | ||
|  | 			break; | ||
|  | 	 | ||
|  | 	If the $arch-codegen.h header file doesn't have the code to | ||
|  | 	emit the low-level native code, you'll need to write that as | ||
|  | 	well.   | ||
|  | 
 | ||
|  | 	Complex opcodes with register constraints may require other | ||
|  | 	changes to the local register allocator, but usually they are | ||
|  | 	not needed. | ||
|  | 		 | ||
|  | * Future | ||
|  | 
 | ||
|  |         Profile-based optimization is something that we are very | ||
|  |         interested in supporting.  There are two possible usage | ||
|  |         scenarios:  | ||
|  | 
 | ||
|  | 	   * Based on the profile information gathered during | ||
|  |              the execution of a program, hot methods can be compiled | ||
|  |              with the highest level of optimizations, while bootstrap | ||
|  |              code and cold methods can be compiled with the least set | ||
|  |              of optimizations and placed in a discardable list. | ||
|  | 
 | ||
|  | 	   * Code reordering: this profile-based optimization would | ||
|  |              only make sense for pre-compiled code.  The profile | ||
|  |              information is used to re-order the assembly code on disk | ||
|  |              so that the code is placed on the disk in a way that | ||
|  |              increments locality.   | ||
|  | 
 | ||
|  | 	     This is the same principle under which SGI's cord program | ||
|  | 	     works.   | ||
|  | 
 | ||
|  | 	The nature of the CIL allows the above optimizations to be | ||
|  | 	easy to implement and deploy.  Since we live and define our | ||
|  | 	universe for these things, there are no interactions with | ||
|  | 	system tools required, nor upgrades on the underlying | ||
|  | 	infrastructure required. | ||
|  | 
 | ||
|  | 	Instruction scheduling is important for certain kinds of | ||
|  | 	processors, and some of the framework exists today in our | ||
|  | 	register allocator and the instruction selector to cope with | ||
|  | 	this, but has not been finished.  The instruction selection | ||
|  | 	would happen at the same time as local register allocation. < |