Config

The Config class represents kernel optimization parameters that control how Helion kernels are compiled and executed.

class helion.Config(*, block_sizes=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, num_warps=None, num_stages=None, pid_type=None, indexing=None, **kwargs)[source]

Bases: Mapping[str, object]

Parameters:
__init__(*, block_sizes=None, loop_orders=None, flatten_loops=None, l2_groupings=None, reduction_loops=None, range_unroll_factors=None, range_warp_specializes=None, range_num_stages=None, range_multi_buffers=None, range_flattens=None, static_ranges=None, num_warps=None, num_stages=None, pid_type=None, indexing=None, **kwargs)[source]

Initialize a Config object.

Parameters:
  • block_sizes (list[int] | None) – Controls tile sizes for hl.tile invocations.

  • loop_orders (list[list[int]] | None) – Permutes iteration order of tiles.

  • l2_groupings (list[int] | None) – Reorders program IDs for L2 cache locality.

  • reduction_loops (list[int | None] | None) – Configures reduction loop behavior.

  • range_unroll_factors (list[int] | None) – Loop unroll factors for tl.range calls.

  • range_warp_specializes (list[bool | None] | None) – Warp specialization for tl.range calls.

  • range_num_stages (list[int] | None) – Number of stages for tl.range calls.

  • range_multi_buffers (list[bool | None] | None) – Controls disallow_acc_multi_buffer for tl.range calls.

  • range_flattens (list[bool | None] | None) – Controls flatten parameter for tl.range calls.

  • static_ranges (list[bool] | None) – Whether to use tl.static_range instead tl.range.

  • num_warps (int | None) – Number of warps per block.

  • num_stages (int | None) – Number of stages for software pipelining.

  • pid_type (Optional[Literal['flat', 'xyz', 'persistent_blocked', 'persistent_interleaved']]) – Program ID type strategy (“flat”, “xyz”, “persistent_blocked”, “persistent_interleaved”).

  • indexing (Optional[Literal['pointer', 'tensor_descriptor', 'block_ptr']]) – Indexing strategy (“pointer”, “tensor_descriptor”, “block_ptr”).

  • **kwargs (object) – Additional user-defined configuration parameters.

  • flatten_loops (list[bool] | None)

config: dict[str, object]
to_json()[source]

Convert the config to a JSON string.

Return type:

str

classmethod from_json(json_str)[source]

Create a Config object from a JSON string.

Parameters:

json_str (str)

Return type:

Config

save(path)[source]

Save the config to a JSON file.

Parameters:

path (str | Path)

Return type:

None

classmethod load(path)[source]

Load a config from a JSON file.

Parameters:

path (str | Path)

Return type:

Config

property block_sizes: list[int]
property loop_orders: list[list[int]]
property flatten_loops: list[bool]
property reduction_loops: list[int | None]
property num_warps: int
property num_stages: int
property l2_groupings: list[int]
property pid_type: Literal['flat', 'xyz', 'persistent_blocked', 'persistent_interleaved']
property range_unroll_factors: list[int]
property range_warp_specializes: list[bool | None]
property range_num_stages: list[int]
property range_multi_buffers: list[bool | None]
property range_flattens: list[bool | None]
property static_ranges: list[bool]
property indexing: Literal['pointer', 'tensor_descriptor', 'block_ptr']

Overview

Config objects specify optimization parameters that control how Helion kernels run on the hardware.

Key Characteristics

  • Performance-focused: Control GPU resource allocation, memory access patterns, and execution strategies

  • Autotuned: The autotuner searches through different Config combinations to find optimal performance

  • Kernel-specific: Each kernel can have different optimal Config parameters based on its computation pattern

  • Hardware-dependent: Optimal configs vary based on GPU architecture and problem size

Config vs Settings

Aspect

Config

Settings

Purpose

Control execution performance

Control compilation behavior

Autotuning

✅ Automatically optimized

❌ Never autotuned

Examples

block_sizes, num_warps, indexing

print_output_code, use_default_config

When to use

Performance optimization

Development, debugging, environment setup

Configs are typically discovered automatically through autotuning, but can also be manually specified for more control.

Configuration Parameters

Block Sizes and Resources

Config.block_sizes

List of tile sizes for hl.tile() loops. Each value controls the number of elements processed per GPU thread block for the corresponding tile dimension.

Config.reduction_loops

Configuration for reduction operations within loops.

Config.num_warps

Number of warps (groups of 32 threads) per thread block. Higher values increase parallelism but may reduce occupancy.

Config.num_stages

Number of pipeline stages for software pipelining. Higher values can improve memory bandwidth utilization.

Loop Optimizations

Config.loop_orders

Permutation of loop iteration order for each hl.tile() loop. Used to optimize memory access patterns.

Config.flatten_loops

Whether to flatten nested loops for each hl.tile() invocation.

Config.range_unroll_factors

Unroll factors for tl.range loops in generated Triton code.

Config.range_warp_specializes

Whether to enable warp specialization for tl.range loops.

Config.range_num_stages

Number of pipeline stages for tl.range loops.

Config.range_multi_buffers

Controls disallow_acc_multi_buffer parameter for tl.range loops.

Config.range_flattens

Controls flatten parameter for tl.range loops.

Config.static_ranges

Whether to use tl.static_range instead of tl.range.

Execution and Indexing

Config.pid_type

Program ID layout strategy:

  • "flat": Standard linear program ID assignment

  • "xyz": 3D program ID layout

  • "persistent_blocked": Persistent kernels with blocked work distribution

  • "persistent_interleaved": Persistent kernels with interleaved distribution

Config.l2_groupings

Controls reordering of program IDs to improve L2 cache locality.

Config.indexing

Memory indexing strategy:

  • "pointer": Pointer-based indexing

  • "tensor_descriptor": Tensor descriptor indexing

  • "block_ptr": Block pointer indexing

Usage Examples

Manual Config Creation

import torch
import helion
import helion.language as hl

# Create a specific configuration
config = helion.Config(
    block_sizes=[64, 32],      # 64 elements per tile in dim 0, 32 in dim 1
    num_warps=8,               # Use 8 warps (256 threads) per block
    num_stages=4,              # 4-stage pipeline
    pid_type="xyz"             # Use 3D program ID layout
)

# Use with kernel
@helion.kernel(config=config)
def my_kernel(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)
    for i, j in hl.tile(x.shape):
        result[i, j] = x[i, j] * 2
    return result

Config Serialization

# Save config to file
config.save("my_config.json")

# Load config from file
loaded_config = helion.Config.load("my_config.json")

# JSON serialization
config_dict = config.to_json()
restored_config = helion.Config.from_json(config_dict)

Autotuning with Restricted Configs

# Restrict autotuning to specific configurations
configs = [
    helion.Config(block_sizes=[32, 32], num_warps=4),
    helion.Config(block_sizes=[64, 16], num_warps=8),
    helion.Config(block_sizes=[16, 64], num_warps=4),
]

@helion.kernel(configs=configs)
def matrix_multiply(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    m, k = a.size()
    k2, n = b.size()
    assert k == k2, f"size mismatch {k} != {k2}"
    out = torch.empty([m, n], dtype=a.dtype, device=a.device)

    for tile_m, tile_n in hl.tile([m, n]):
        acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
        for tile_k in hl.tile(k):
            acc = torch.addmm(acc, a[tile_m, tile_k], b[tile_k, tile_n])
        out[tile_m, tile_n] = acc

    return out

See Also