llvm-capstone/mlir/docs/RationaleSimplifiedPolyhedralForm.md
Jacques Pienaar 430bba2a0f [mlir] Make code blocks more consistent
Use the same form specification for the same type of code.
2019-12-31 09:54:16 -08:00

17 KiB

MLIR: The case for a simplified polyhedral form

MLIR embraces polyhedral compiler techniques for their many advantages representing and transforming dense numerical kernels, but it uses a form that differs significantly from other polyhedral frameworks.

Disclaimer / Warning

This document is a very early design proposal (which has since been accepted) that explored the tradeoffs of using this simplified form vs the traditional polyhedral schedule list form. At some point, this document could be dusted off and written as a proper academic paper, but until now, it is better to included it in this crafty form than not to. Beware that this document uses archaic syntax and should not be considered a canonical reference to modern MLIR.

Introduction

This document discusses general goals of the project, introduces context and the two alternatives, then talks about the tradeoffs of these designs. Written by Chris Lattner.

General goals of an IR, and goals of mlfunc's specifically

Our currently planned representation for MLIR consists of two kinds of functions: an LLVM-like "CFG Function" and an "ML Function": a function represented in multidimensional loop form. The idea is that a CFG function is capable of full generality for expressing arbitrary computation, but is awkward for loop transformations. In contrast, mlfunc's are limited (e.g. to control flow involving loop nests over affine spaces) but these limitations make it much easier to transform and analyze, particularly for the set of computations in a machine learning kernel.

The design of an intermediate representations is an optimization problem, which makes intentional tradeoffs that aim to make certain kinds of compiler transformations simple. After all, it is "possible" to do almost any transformation on any IR: we could theoretically do loop transformations on assembly language. OTOH, such transformations would take too long to write, would be fragile due to irrelevant changes, would be difficult to maintain, and difficult to make target independent. Performing transformations on the "right level" of IR makes it much easier to do analysis and transformation of code, and can make them faster by reducing the size of the IR, and eliminating possibilities that would have otherwise have to be considered.

This is the reason we're interested in adding polyhedral techniques to an IR in the first place: though our base "CFG function" representation is fully capable of expressing any computation, it is "too" expressive. The limitations imposed by polyhedral techniques (e.g. on affine loop bounds and array subscripts) define a closed algebra that can represent an interesting range of transformations and their compositions, and because of their simplicity, we can perform (e.g.) dependence analysis more efficiently and more reliably.

This raises an important question that this document examines: given we are introducing a redundant and limited way to express code and transformations, exactly what form is best to perform the analyses and transformations we want?

We explore two different design points that are capable of expressing the same class of affine loop computations, but which use different representational forms. These forms trade off verbosity, ease of transformation, and ease of analysis in interesting ways.

Context: Traditional Polyhedral Form

We started by discussing a representation that uses the traditional polyhedral schedule set + domain representation, e.g. consider C-like code like:

  void simple_example(...) {
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
         float tmp = X[i,j]    // S1
         A[i,j] = tmp + 1      // S2
         B[i,j] = tmp * 42     // S3
       }
    }
  }

The polyhedral representation doesn't care about the actual computation, so we will abstract them into S1/S2/S3 in the discussion below. Originally, we planned to represent this with a classical form like (syntax details are not important and probably slightly incorrect below):

  mlfunc @simple_example(... %N) {
    %tmp = call @S1(%X, %i, %j)
      domain: (0 <= %i < %N), (0 <= %j < %N)
      schedule: (i, j, 0)

    call @S2(%tmp, %A, %i, %j)
      domain: (0 <= %i < %N), (0 <= %j < %N)
      schedule: (i, j, 1)

    call @S3(%tmp, %B, %i, %j)
      domain: (0 <= %i < %N), (0 <= %j < %N)
      schedule: (i, j, 2)
  }

In this design, an mlfunc is an unordered bag of instructions whose execution order is fully controlled by their schedule.

However, we recently agreed that a more explicit schedule tree representation is a better fit for our needs, because it exposes important structure that will make analyses and optimizations more efficient, and also makes the scoping of SSA values more explicit. This leads us to a representation along the lines of:

  mlfunc @simple_example(... %N) {
    d0/d1 = mlspace
    for S1(d0), S2(d0), S3(d0) {
      for S1(d1), S2(d1), S3(d1) {

        %tmp = call @S1(%X, d0, d1)      ;; S1
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S2(%tmp, %A, d0, d1)      ;; S2
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S3(%tmp, %B, d0, d1)      ;; S3
          domain: (0 <= d0 < %N), (0 <= d1 < %N)
      }
    }
  }

This change makes the nesting structure of the loops an explicit part of the representation, and makes lexical ordering within a loop significant (eliminating the constant 0/1/2 of schedules).

It isn't obvious in the example above, but the representation allows for some interesting features, including the ability for instructions within a loop nest to have non-equal domains, like this - the second instruction ignores the outer 10 points inside the loop:

  mlfunc @reduced_domain_example(... %N) {
    d0/d1 = mlspace
    for S1(d0), S2(d0) {
      for S1(d1), S2(d1) {
        %tmp = call @S1(%X, d0, d1)    ;; S1
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S2(%tmp, %A, d0, d1)      ;; S2
          domain: (10 <= d0 < %N-10), (10 <= d1 < %N-10)
      }
    }
  }

It also allows schedule remapping within the instruction, like this example that introduces a diagonal skew through a simple change to the schedules of the two instructions:

  mlfunc @skewed_domain_example(... %N) {
    d0/d1 = mlspace
    for S1(d0), S2(d0+d1) {
      for S1(d0+d1), S2(d1) {
        %tmp = call @S1(%X, d0, d1)    ;; S1
          domain: (0 <= d0 < %N), (0 <= d1 < %N)

        call @S2(%tmp, %A, d0, d1)      ;; S2
          domain: (0 <= d0 < %N), (0 <= d1 < %N)
      }
    }
  }

This form has great power, and the polyhedral code generator (which lowers from an mlfunc to a cfgfunc representation) handles this power so things that introduce loop transformations don't have to explicitly manipulate the looping structure.

Proposal: Simplified Polyhedral Form

This document proposes and explores the idea of going one step further, moving all of the domain and schedule information into the "schedule tree". In this form, we would have a representation where all instructions inside of a given for-loop are known to have the same domain, which is maintained by the loop. In the simplified form, we also have an "if" instruction that takes an affine condition.

Our simple example above would be represented as:

  mlfunc @simple_example(... %N) {
    affine.for %i = 0 ... %N step 1 {
      affine.for %j = 0 ... %N step 1 {
        // identity noop in this case, but can exist in general.
        %0,%1 = affine.apply #57(%i, %j)

        %tmp = call @S1(%X, %0, %1)

        call @S2(%tmp, %A, %0, %1)

        call @S3(%tmp, %B, %0, %1)
      }
    }
  }

The example with the reduced domain would be represented with an if instruction:

  mlfunc @reduced_domain_example(... %N) {
    affine.for %i = 0 ... %N step 1 {
      affine.for %j = 0 ... %N step 1 {
        // identity noop in this case, but can exist in general.
        %0,%1 = affinecall #57(%i, %j)

        %tmp = call @S1(%X, %0, %1)

        if (10 <= %i < %N-10), (10 <= %j < %N-10) {

          %2,%3 = affine.apply(%i, %j)    // identity noop in this case

          call @S2(%tmp, %A, %2, %3)
        }
      }
    }
  }

These IRs represent exactly the same information, and use a similar information density. The 'traditional' form introduces an extra level of abstraction (schedules and domains) that make it easy to transform instructions at the expense of making it difficult to reason about how those instructions will come out after code generation. With the simplified form, transformations have to do parts of code generation inline with their transformation: instead of simply changing a schedule to (i+j, j) to get skewing, you'd have to generate this code explicitly (potentially implemented by making polyhedral codegen a library that transformations call into):

mlfunc @skewed_domain_example(... %N) {
  affine.for %t1 = 0 ... 2*N-2 step 1 {
    affine.for %t2 = max(0, t1-N+1) ... min(N, t1) step 1 {
      (%i, %j) = (%t1-%t2, %t2)
      ...
    }
  }
}

Evaluation

Both of these forms are capable of expressing the same class of computation: multidimensional loop nests with affine loop bounds and affine memory references. That said, they pose very different tradeoffs in other ways.

Commonality: can express same computation

Both of these can express the same sorts of computation, e.g. kernels written in one form are representable in the other form in all cases.

Commonality: dependence analysis

These representations both use affine functions for data layout mapping and access subscripts, and dependence analysis works the same way.

Commonality: difficulty of determining optimal transformation series

One major challenge in performance of optimization of this sort of code is choosing the ordering and behavior of various loop transformations that get applied. There are non-local effects of every decision, and neither representation helps solve this inherently hard problem.

Commonality: compactness of IR

In the cases that are most relevant to us (hyper rectangular spaces) these forms are directly equivalent: a traditional instruction with a limited domain (e.g. the "reduced_domain_example" above) ends up having one level of ML 'if' inside its loops. The simplified form pays for this by eliminating schedules and domains from the IR. Both forms allow code duplication to reduce dynamic branches in the IR: the traditional approach allows instruction splitting, the simplified form supports instruction duplication.

It is important to point out that the traditional form wins on compactness in the extreme cases: e.g. the loop skewing case. These cases will be rare in practice for our workloads, and are exactly the cases that downstream transformations want to be explicit about what they are doing.

Simplicity of code generation

A key final stage of an mlfunc is its conversion to a CFG function, which is required as part of lowering to the target machine. The simplified form has a clear advantage here: the IR has a direct correspondence to the structure of the generated code.

In contrast, the traditional form has significant complexity in the lowering process to a CFG function, because the verbosity not imbued in the IR needs to come out during code generation. Code generation from ISL shows that it is possible to do this, but it is a non-trivial transformation.

Ease of transformation

An advantage for the traditional form is that it is easier to perform certain transformations on it: skewing and tiling are just transformations on the schedule of the instructions in question, it doesn't require changing the loop structure.

In practice, the simplified form requires moving the complexity of code generation into the transformations themselves - this is sometimes trivial, sometimes involved. The author believes that this should be possible by making the code generation algorithms themselves be library functions that transformations call into, instead of an opaque block that happens at the end of the mlfunc processing.

Also, the sorts of transformations performed today by XLA (including tiling, padding, unrolling, and other rectangular transformations) should be easy enough to implement on either representation. The only cases that are a challenge are more advanced cases like skewing, e.g. for DMA data movement generation.

Ease of analysis: Cost models

The simplified form is much easier for analyses and transformations to build cost models for (e.g. answering the question of "how much code bloat will be caused by unrolling a loop at this level?"), because it is easier to predict what target code will be generated. With the traditional form, these analyses will have to anticipate what polyhedral codegen will do to a set of instructions under consideration: something that is non-trivial in the interesting cases in question (see "Cost of code generation").

Cost of code generation

State of the art polyhedral code generation is expensive and complicated, sometimes exponential time complexity. We expect that most machine learning workloads will be hyper-rectangular, and thus it should be easy to specialize in important cases. That said, the traditional polyhedral representation makes it very easy to introduce complicated and expensive schedules, and provides no way to understand and project a cost model for using them. All downstream clients of the IR need to be prepared to handle the full generality of IR that may come to them.

The simplified form defines this away: the concepts in the IR remain simple, and the code much more directly reflects the cost model for lowering to CFG functions and machine code. This is expected to be very important in the late stages of a code generator for an accelerator.

SSA in ML Functions

We agree already that values defined in an mlfunc can include scalar values and they are defined based on traditional dominance. In the simplified form, this is very simple: arguments and induction variables defined in for-loops are live inside their lexical body, and linear series of instructions have the same "top down" dominance relation that a basic block does.

In the traditional form though, this is not the case: it seems that a lot of knowledge about how codegen will emit the code is necessary to determine if SSA form is correct or not. For example, this is invalid code:

  %tmp = call @S1(%X, %0, %1)
    domain: (10 <= %i < %N), (0 <= %j < %N)
    schedule: (i, j)

  call @S2(%tmp, %A, %0, %1)
    domain: (0 <= %i < %N), (0 <= %j < %N)
    schedule: (i, j)

Because %tmp isn't defined on some iterations of the %i loop.

This matters because it makes the verifier more complicated, but more significantly, it means that load promotion and other optimizations that will produce SSA form will need to be aware of this and be able to model what codegen does.

An emergent property of this that we discussed recently is that PHI nodes in mlfunc's (if we support them) will also have to have domains.

Lack of redundancy in IR

The traditional form has multiple encodings for the same sorts of behavior: you end up having bits on affine.for loops to specify whether codegen should use "atomic/separate" policies, unroll loops, etc. Instructions can be split or can generate multiple copies of their instruction because of overlapping domains, etc.

This is a problem for analyses and cost models, because they each have to reason about these additional forms in the IR.

Suitability to purpose: lowering to machine code

One of the main drivers for this work is lowering to low-level accelerator code, including two-dimensional vectorization, insertion of DMAs, and other utilization of the matrix accelerator units. In the author's opinion, the extra compactness of the traditional form is a negative for this purpose: reasoning about the generated machine code will require understanding the mapping from mlfunc to lowered code, which means that it must understand what code generation will do.

In the simplified form, the effect of "code generation" is always obvious from the IR itself, which should make it easier to perform vectorization to target instructions and other analyses we need to perform.

Third Alternative: two different levels of mlfunc

One hybrid alternative is to support both the traditional and simplified forms of mlfunc in our IR.

The stages could look like this, for example:

  1. Early performance transformations could be done on the traditional form.
  2. Partial code generation lowers to the simplified form
  3. Target specific lowering phases for tiling, and vectorization and other 2D transforms that don't benefit much from the traditional form could be run.
  4. Final codegen to a cfg func can be done when all of the instructions are replaced with ones valid on the target.

While this is possible, it isn't clear what would justify the complexity of this approach. Unless there is a super compelling reason for this, it would be nice to not do this. Update: we discussed this as a design team and agreed that this wouldn't be a good way to go.