You've already forked linux-packaging-mono
							
							
		
			
	
	
		
			248 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
		
		
			
		
	
	
			248 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
|   | ==================
 | ||
|  | Vectorization Plan
 | ||
|  | ==================
 | ||
|  | 
 | ||
|  | .. contents::
 | ||
|  |    :local: | ||
|  | 
 | ||
|  | Abstract
 | ||
|  | ========
 | ||
|  | The vectorization transformation can be rather complicated, involving several
 | ||
|  | potential alternatives, especially for outer-loops [1]_ but also possibly for
 | ||
|  | innermost loops. These alternatives may have significant performance impact,
 | ||
|  | both positive and negative. A cost model is therefore employed to identify the
 | ||
|  | best alternative, including the alternative of avoiding any transformation
 | ||
|  | altogether.
 | ||
|  | 
 | ||
|  | The Vectorization Plan is an explicit model for describing vectorization
 | ||
|  | candidates. It serves for both optimizing candidates including estimating their
 | ||
|  | cost reliably, and for performing their final translation into IR. This
 | ||
|  | facilitates dealing with multiple vectorization candidates.
 | ||
|  | 
 | ||
|  | High-level Design
 | ||
|  | =================
 | ||
|  | 
 | ||
|  | Vectorization Workflow
 | ||
|  | ----------------------
 | ||
|  | VPlan-based vectorization involves three major steps, taking a "scenario-based
 | ||
|  | approach" to vectorization planning:
 | ||
|  | 
 | ||
|  | 1. Legal Step: check if a loop can be legally vectorized; encode constraints and
 | ||
|  |    artifacts if so.
 | ||
|  | 2. Plan Step:
 | ||
|  | 
 | ||
|  |    a. Build initial VPlans following the constraints and decisions taken by
 | ||
|  |       Legal Step 1, and compute their cost.
 | ||
|  |    b. Apply optimizations to the VPlans, possibly forking additional VPlans.
 | ||
|  |       Prune sub-optimal VPlans having relatively high cost.
 | ||
|  | 3. Execute Step: materialize the best VPlan. Note that this is the only step
 | ||
|  |    that modifies the IR.
 | ||
|  | 
 | ||
|  | Design Guidelines
 | ||
|  | -----------------
 | ||
|  | In what follows, the term "input IR" refers to code that is fed into the
 | ||
|  | vectorizer whereas the term "output IR" refers to code that is generated by the
 | ||
|  | vectorizer. The output IR contains code that has been vectorized or "widened"
 | ||
|  | according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
 | ||
|  | according to an Unroll Factor (UF).
 | ||
|  | The design of VPlan follows several high-level guidelines:
 | ||
|  | 
 | ||
|  | 1. Analysis-like: building and manipulating VPlans must not modify the input IR.
 | ||
|  |    In particular, if the best option is not to vectorize at all, the
 | ||
|  |    vectorization process terminates before reaching Step 3, and compilation
 | ||
|  |    should proceed as if VPlans had not been built.
 | ||
|  | 
 | ||
|  | 2. Align Cost & Execute: each VPlan must support both estimating the cost and
 | ||
|  |    generating the output IR code, such that the cost estimation evaluates the
 | ||
|  |    to-be-generated code reliably.
 | ||
|  | 
 | ||
|  | 3. Support vectorizing additional constructs:
 | ||
|  | 
 | ||
|  |    a. Outer-loop vectorization. In particular, VPlan must be able to model the
 | ||
|  |       control-flow of the output IR which may include multiple basic-blocks and
 | ||
|  |       nested loops.
 | ||
|  |    b. SLP vectorization.
 | ||
|  |    c. Combinations of the above, including nested vectorization: vectorizing
 | ||
|  |       both an inner loop and an outer-loop at the same time (each with its own
 | ||
|  |       VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
 | ||
|  |       inside [4]_, (re)vectorizing input IR containing vector code.
 | ||
|  |    d. Function vectorization [2]_.
 | ||
|  | 
 | ||
|  | 4. Support multiple candidates efficiently. In particular, similar candidates
 | ||
|  |    related to a range of possible VF's and UF's must be represented efficiently.
 | ||
|  |    Potential versioning needs to be supported efficiently.
 | ||
|  | 
 | ||
|  | 5. Support vectorizing idioms, such as interleaved groups of strided loads or
 | ||
|  |    stores. This is achieved by modeling a sequence of output instructions using
 | ||
|  |    a "Recipe", which is responsible for computing its cost and generating its
 | ||
|  |    code.
 | ||
|  | 
 | ||
|  | 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
 | ||
|  |    such regions may need to be, for example, predicated and linearized, or
 | ||
|  |    replicated VF*UF times to handle scalarized and predicated instructions.
 | ||
|  |    Innerloops are also modelled as SESE regions.
 | ||
|  | 
 | ||
|  | 7. Support instruction-level analysis and transformation, as part of Planning
 | ||
|  |    Step 2.b: During vectorization instructions may need to be traversed, moved,
 | ||
|  |    replaced by other instructions or be created. For example, vector idiom
 | ||
|  |    detection and formation involves searching for and optimizing instruction
 | ||
|  |    patterns.
 | ||
|  | 
 | ||
|  | Definitions
 | ||
|  | ===========
 | ||
|  | The low-level design of VPlan comprises of the following classes.
 | ||
|  | 
 | ||
|  | :LoopVectorizationPlanner: | ||
|  |   A LoopVectorizationPlanner is designed to handle the vectorization of a loop
 | ||
|  |   or a loop nest. It can construct, optimize and discard one or more VPlans,
 | ||
|  |   each VPlan modelling a distinct way to vectorize the loop or the loop nest.
 | ||
|  |   Once the best VPlan is determined, including the best VF and UF, this VPlan
 | ||
|  |   drives the generation of output IR.
 | ||
|  | 
 | ||
|  | :VPlan: | ||
|  |   A model of a vectorized candidate for a given input IR loop or loop nest. This
 | ||
|  |   candidate is represented using a Hierarchical CFG. VPlan supports estimating
 | ||
|  |   the cost and driving the generation of the output IR code it represents.
 | ||
|  | 
 | ||
|  | :Hierarchical CFG:
 | ||
|  |   A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
 | ||
|  |   Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
 | ||
|  |   cross-Tile edges are lifted to connect Tiles instead of the original
 | ||
|  |   basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
 | ||
|  |   Region and Block are used rather than Tile [5]_ to avoid confusion with loop
 | ||
|  |   tiling.
 | ||
|  | 
 | ||
|  | :VPBlockBase: | ||
|  |   The building block of the Hierarchical CFG. A pure-virtual base-class of
 | ||
|  |   VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
 | ||
|  |   control-flow relations with other VPBlocks. Note that in contrast to the IR
 | ||
|  |   BasicBlock, a VPBlockBase models its control-flow successors and predecessors
 | ||
|  |   directly, rather than through a Terminator branch or through predecessor
 | ||
|  |   branches that "use" the VPBlockBase.
 | ||
|  | 
 | ||
|  | :VPBasicBlock: | ||
|  |   VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
 | ||
|  |   Hierarchical CFG. It represents a sequence of output IR instructions that will
 | ||
|  |   appear consecutively in an output IR basic-block. The instructions of this
 | ||
|  |   basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
 | ||
|  |   sequence of zero or more VPRecipes that model the cost and generation of the
 | ||
|  |   output IR instructions.
 | ||
|  | 
 | ||
|  | :VPRegionBlock: | ||
|  |   VPRegionBlock is a subclass of VPBlockBase. It models a collection of
 | ||
|  |   VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
 | ||
|  |   CFG. A VPRegionBlock may indicate that its contents are to be replicated a
 | ||
|  |   constant number of times when output IR is generated, effectively representing
 | ||
|  |   a loop with constant trip-count that will be completely unrolled. This is used
 | ||
|  |   to support scalarized and predicated instructions with a single model for
 | ||
|  |   multiple candidate VF's and UF's.
 | ||
|  | 
 | ||
|  | :VPRecipeBase: | ||
|  |   A pure-virtual base class modeling a sequence of one or more output IR
 | ||
|  |   instructions, possibly based on one or more input IR instructions. These
 | ||
|  |   input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
 | ||
|  |   may specify how its ingredients are to be transformed to produce the output IR
 | ||
|  |   instructions; e.g., cloned once, replicated multiple times or widened
 | ||
|  |   according to selected VF.
 | ||
|  | 
 | ||
|  | :VPValue: | ||
|  |   The base of VPlan's def-use relations class hierarchy. When instantiated, it
 | ||
|  |   models a constant or a live-in Value in VPlan. It has users, which are of type
 | ||
|  |   VPUser, but no operands.
 | ||
|  | 
 | ||
|  | :VPUser: | ||
|  |   A VPValue representing a general vertex in the def-use graph of VPlan. It has
 | ||
|  |   operands which are of type VPValue. When instantiated, it represents a
 | ||
|  |   live-out Instruction that exists outside VPlan. VPUser is similar in some
 | ||
|  |   aspects to LLVM's User class.
 | ||
|  | 
 | ||
|  | :VPInstruction: | ||
|  |   A VPInstruction is both a VPRecipe and a VPUser. It models a single
 | ||
|  |   VPlan-level instruction to be generated if the VPlan is executed, including
 | ||
|  |   its opcode and possibly additional characteristics. It is the basis for
 | ||
|  |   writing instruction-level analyses and optimizations in VPlan as creating,
 | ||
|  |   replacing or moving VPInstructions record both def-use and scheduling
 | ||
|  |   decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic
 | ||
|  |   operations that enrich the Vectorizer's semantics.
 | ||
|  | 
 | ||
|  | :VPTransformState: | ||
|  |   Stores information used for generating output IR, passed from
 | ||
|  |   LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
 | ||
|  |   additional information down to VPBlocks and VPRecipes.
 | ||
|  | 
 | ||
|  | The Planning Process and VPlan Roadmap
 | ||
|  | ======================================
 | ||
|  | 
 | ||
|  | Transforming the Loop Vectorizer to use VPlan follows a staged approach. First,
 | ||
|  | VPlan is used to record the final vectorization decisions, and to execute them:
 | ||
|  | the Hierarchical CFG models the planned control-flow, and Recipes capture
 | ||
|  | decisions taken inside basic-blocks. Next, VPlan will be used also as the basis
 | ||
|  | for taking these decisions, effectively turning them into a series of
 | ||
|  | VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
 | ||
|  | itself including cost-based analyses for making these decisions, to fully
 | ||
|  | support compositional and iterative decision making.
 | ||
|  | 
 | ||
|  | Some decisions are local to an instruction in the loop, such as whether to widen
 | ||
|  | it into a vector instruction or replicate it, keeping the generated instructions
 | ||
|  | in place. Other decisions, however, involve moving instructions, replacing them
 | ||
|  | with other instructions, and/or introducing new instructions. For example, a
 | ||
|  | cast may sink past a later instruction and be widened to handle first-order
 | ||
|  | recurrence; an interleave group of strided gathers or scatters may effectively
 | ||
|  | move to one place where they are replaced with shuffles and a common wide vector
 | ||
|  | load or store; new instructions may be introduced to compute masks, shuffle the
 | ||
|  | elements of vectors, and pack scalar values into vectors or vice-versa.
 | ||
|  | 
 | ||
|  | In order for VPlan to support making instruction-level decisions and analyses,
 | ||
|  | it needs to model the relevant instructions along with their def/use relations.
 | ||
|  | This too follows a staged approach: first, the new instructions that compute
 | ||
|  | masks are modeled as VPInstructions, along with their induced def/use subgraph.
 | ||
|  | This effectively models masks in VPlan, facilitating VPlan-based predication.
 | ||
|  | Next, the logic embedded within each Recipe for generating its instructions at
 | ||
|  | VPlan execution time, will instead take part in the planning process by modeling
 | ||
|  | them as VPInstructions. Finally, only logic that applies to instructions as a
 | ||
|  | group will remain in Recipes, such as interleave groups and potentially other
 | ||
|  | idiom groups having synergistic cost.
 | ||
|  | 
 | ||
|  | Related LLVM components
 | ||
|  | -----------------------
 | ||
|  | 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
 | ||
|  |    tree, where TSLP [3]_ adds Plan Step 2.b.
 | ||
|  | 
 | ||
|  | 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
 | ||
|  |    Polly [7]_.
 | ||
|  | 
 | ||
|  | 3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
 | ||
|  |    the Loop Vectorizer and extend it to handle outer loops [8,9]_.
 | ||
|  | 
 | ||
|  | References
 | ||
|  | ----------
 | ||
|  | .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
 | ||
|  |     Nuzman and Ayal Zaks, PACT 2008.
 | ||
|  | 
 | ||
|  | .. [2] "Proposal for function vectorization and loop vectorization with function
 | ||
|  |     calls", Xinmin Tian, [`cfe-dev
 | ||
|  |     <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
 | ||
|  |     March 2, 2016.
 | ||
|  |     See also `review <https://reviews.llvm.org/D22792>`_.
 | ||
|  | 
 | ||
|  | .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
 | ||
|  |     Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
 | ||
|  | 
 | ||
|  | .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
 | ||
|  |     overhead", Hao Zhou and Jingling Xue, CGO 2016.
 | ||
|  | 
 | ||
|  | .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
 | ||
|  |     Brian Koblenz, PLDI 1991
 | ||
|  | 
 | ||
|  | .. [6] "Structural analysis: A new approach to flow analysis in optimizing
 | ||
|  |     compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
 | ||
|  | 
 | ||
|  | .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
 | ||
|  |     thesis, 2011.
 | ||
|  | 
 | ||
|  | .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
 | ||
|  |     European LLVM Developers' Meeting 2017.
 | ||
|  | 
 | ||
|  | .. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
 | ||
|  |     Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
 |