helion.language.tile
- helion.language.tile(begin_or_end, end_or_none=None, /, block_size=None)[source]
Break up an iteration space defined by a size or sequence of sizes into tiles.
The generated tiles can flatten the iteration space into the product of the sizes, perform multidimensional tiling, swizzle the indices for cache locality, reorder dimensions, etc. The only invariant is that every index in the range of the given sizes is covered exactly once.
The exact tiling strategy is determined by a Config object, typically created through autotuning.
If used at the top level of a function, this becomes the grid of the kernel. Otherwise, it becomes a loop in the output kernel.
The key difference from
grid()
is thattile
gives youTile
objects that load a slice of elements, whilegrid
gives you scalar integer indices. It is recommended to usetile
in most cases, since it allows more choices in autotuning.- Parameters:
begin_or_end (
int
|Tensor
|Sequence
[int
|Tensor
]) – If 2+ positional args provided, the start of iteration space. Otherwise, the end of iteration space.end_or_none (
int
|Tensor
|Sequence
[int
|Tensor
] |None
) – If 2+ positional args provided, the end of iteration space.block_size (
object
) – Fixed block size (overrides autotuning) or None for autotuned size
- Returns:
Iterator over tile objects
- Return type:
Examples
One dimensional tiling:
@helion.kernel def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: result = torch.zeros_like(x) for tile in hl.tile(x.size(0)): # tile processes multiple elements at once result[tile] = x[tile] + y[tile] return result
Multi-dimensional tiling:
@helion.kernel() def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: m, k = x.size() k, n = y.size() out = torch.empty([m, n], dtype=x.dtype, device=x.device) for tile_m, tile_n in hl.tile([m, n]): acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) for tile_k in hl.tile(k): acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) out[tile_m, tile_n] = acc return out
Fixed block size:
@helion.kernel def process_with_fixed_block(x: torch.Tensor) -> torch.Tensor: result = torch.zeros_like(x) for tile in hl.tile(x.size(0), block_size=64): # Process with fixed block size of 64 result[tile] = x[tile] * 2 return result
Using tile properties:
@helion.kernel def tile_info_example(x: torch.Tensor) -> torch.Tensor: result = torch.zeros([x.size(0)], dtype=x.dtype, device=x.device) for tile in hl.tile(x.size(0)): # Access tile properties start = tile.begin end = tile.end size = tile.block_size indices = tile.index # [start, start+1, ..., end-1] # Use in computation result[tile] = x[tile] + indices return result
See also
grid()
: For explicit control over the launch gridtile_index()
: For getting tile indicesregister_block_size()
: For registering block sizes
Note
Similar to
range()
with multiple forms:tile(end) iterates 0 to end-1, autotuned block_size
tile(begin, end) iterates begin to end-1, autotuned block_size
tile(begin, end, block_size) iterates begin to end-1, fixed block_size
tile(end, block_size=block_size) iterates 0 to end-1, fixed block_size
Block sizes can be registered for autotuning explicitly with
register_block_size()
and passed as theblock_size
argument if one needs two loops to use the same block size. Passingblock_size=None
is equivalent to calling register_block_size.Use
tile
in most cases. Usegrid
when you need explicit control over the launch grid.