helion.language.tile

helion.language.tile(begin_or_end, end_or_none=None, /, block_size=None)[source]

Break up an iteration space defined by a size or sequence of sizes into tiles.

The generated tiles can flatten the iteration space into the product of the sizes, perform multidimensional tiling, swizzle the indices for cache locality, reorder dimensions, etc. The only invariant is that every index in the range of the given sizes is covered exactly once.

The exact tiling strategy is determined by a Config object, typically created through autotuning.

If used at the top level of a function, this becomes the grid of the kernel. Otherwise, it becomes a loop in the output kernel.

The key difference from grid() is that tile gives you Tile objects that load a slice of elements, while grid gives you scalar integer indices. It is recommended to use tile in most cases, since it allows more choices in autotuning.

Parameters:
  • begin_or_end (int | Tensor | Sequence[int | Tensor]) – If 2+ positional args provided, the start of iteration space. Otherwise, the end of iteration space.

  • end_or_none (int | Tensor | Sequence[int | Tensor] | None) – If 2+ positional args provided, the end of iteration space.

  • block_size (object) – Fixed block size (overrides autotuning) or None for autotuned size

Returns:

Iterator over tile objects

Return type:

Iterator[Tile] or Iterator[Sequence[Tile]]

Examples

One dimensional tiling:

@helion.kernel
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)

    for tile in hl.tile(x.size(0)):
        # tile processes multiple elements at once
        result[tile] = x[tile] + y[tile]

    return result

Multi-dimensional tiling:

@helion.kernel()
def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    m, k = x.size()
    k, n = y.size()
    out = torch.empty([m, n], dtype=x.dtype, device=x.device)

    for tile_m, tile_n in hl.tile([m, n]):
        acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
        for tile_k in hl.tile(k):
            acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n])
        out[tile_m, tile_n] = acc


return out

Fixed block size:

@helion.kernel
def process_with_fixed_block(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)

    for tile in hl.tile(x.size(0), block_size=64):
        # Process with fixed block size of 64
        result[tile] = x[tile] * 2

    return result

Using tile properties:

@helion.kernel
def tile_info_example(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros([x.size(0)], dtype=x.dtype, device=x.device)

    for tile in hl.tile(x.size(0)):
        # Access tile properties
        start = tile.begin
        end = tile.end
        size = tile.block_size
        indices = tile.index  # [start, start+1, ..., end-1]

        # Use in computation
        result[tile] = x[tile] + indices

    return result

See also

Note

Similar to range() with multiple forms:

  • tile(end) iterates 0 to end-1, autotuned block_size

  • tile(begin, end) iterates begin to end-1, autotuned block_size

  • tile(begin, end, block_size) iterates begin to end-1, fixed block_size

  • tile(end, block_size=block_size) iterates 0 to end-1, fixed block_size

Block sizes can be registered for autotuning explicitly with register_block_size() and passed as the block_size argument if one needs two loops to use the same block size. Passing block_size=None is equivalent to calling register_block_size.

Use tile in most cases. Use grid when you need explicit control over the launch grid.