Kernel
The Kernel
class is the main entry point for executing Helion GPU kernels.
- class helion.Kernel(fn, *, configs=None, settings)[source]
Bases:
Generic
[_R
]- Parameters:
- __init__(fn, *, configs=None, settings)[source]
Initialize the Kernel object. This is typically called from the @helion.kernel decorator.
- specialization_key(args)[source]
Generate a specialization key for the given arguments.
This method generates a unique key for the arguments based on their types and the corresponding extractor functions defined in _specialization_extractors.
- normalize_args(*args, **kwargs)[source]
Normalize the given arguments and keyword arguments according to the function signature.
- autotune(args, *, force=False, **options)[source]
Perform autotuning to find the optimal configuration for the kernel. This uses the default setting, you can call helion.autotune.* directly for more customization.
If config= or configs= is provided to helion.kernel(), the search will be restricted to the provided configs. Use force=True to ignore the provided configs.
Mutates (the bound version of) self so that __call__ will run the best config found.
- Parameters:
- Returns:
The best configuration found during autotuning.
- Return type:
Overview
A Kernel
object is typically created via the @helion.kernel
decorator. It manages:
Compilation of Python functions to GPU code
Autotuning to find optimal configurations
Caching of compiled kernels
Execution with automatic argument binding
The kernel compilation process converts Python functions using helion.language
constructs into optimized Triton GPU kernels.
Creation and Usage
Basic Kernel Creation
import torch
import helion
import helion.language as hl
@helion.kernel
def vector_add(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
result = torch.zeros_like(a)
for i in hl.grid(a.size(0)):
result[i] = a[i] + b[i]
return result
# Usage
a = torch.randn(1000, device='cuda')
b = torch.randn(1000, device='cuda')
c = vector_add(a, b) # Automatically compiles and executes
With Custom Settings
@helion.kernel(
use_default_config=True, # Skip autotuning
print_output_code=True # Debug generated code
)
def my_kernel(x: torch.Tensor) -> torch.Tensor:
# Implementation
pass
With Restricted Configurations
@helion.kernel(configs=[
helion.Config(block_sizes=[32], num_warps=4),
helion.Config(block_sizes=[64], num_warps=8)
])
def optimized_kernel(x: torch.Tensor) -> torch.Tensor:
# Implementation
pass
BoundKernel
When you call kernel.bind(args)
, you get a BoundKernel
that’s specialized for specific argument types and (optionally) shapes:
# Bind once, execute many times
bound = my_kernel.bind((example_tensor,))
result1 = bound(tensor1) # Compatible tensor (same dtype/device)
result2 = bound(tensor2) # Compatible tensor (same dtype/device)
# With static_shapes=True, tensors must have exact same shapes/strides
@helion.kernel(static_shapes=True)
def shape_specialized_kernel(x: torch.Tensor) -> torch.Tensor:
# Implementation
pass
bound_static = shape_specialized_kernel.bind((torch.randn(100, 50),))
result = bound_static(torch.randn(100, 50)) # Must be exactly [100, 50]
BoundKernel Methods
The returned BoundKernel has these methods:
__call__(*args)
- Execute with bound argumentsautotune(args, **kwargs)
- Autotune this specific bindingset_config(config)
- Set and compile specific configurationto_triton_code(config)
- Generate Triton source codecompile_config(config)
- Compile for specific configuration
Advanced Usage
Manual Autotuning
# Separate autotuning from execution
kernel = my_kernel
# Find best config
config = kernel.autotune(example_args, num_iterations=100)
# Later, use the optimized config
result = kernel(actual_args) # Uses cached config
Config Management
bound = kernel.bind(args)
# set a specific configuration
bound.set_config(helion.Config(block_sizes=[64], num_warps=8))
# generate Triton code for the bound kernel
triton_code = bound.to_triton_code(config)
print(triton_code)
Caching and Specialization
Kernels are automatically cached based on:
Argument types (dtype, device)
Tensor shapes (when using
static_shapes=True
)
By default (static_shapes=False
), kernels only specialize on basic shape categories (0, 1, or ≥2 per dimension) rather than exact shapes, allowing the same compiled kernel to handle different tensor sizes efficiently.
# These create separate cache entries
tensor_float = torch.randn(100, dtype=torch.float32, device='cuda')
tensor_int = torch.randint(0, 10, (100,), dtype=torch.int32, device='cuda')
result1 = my_kernel(tensor_float) # Compiles for float32
result2 = my_kernel(tensor_int) # Compiles for int32 (separate cache)
Settings vs Config in Kernel Creation
When creating kernels, you’ll work with two distinct types of parameters:
Settings: Compilation Control
Settings control how the kernel is compiled and the development environment:
@helion.kernel(
# Settings parameters
use_default_config=True, # Skip autotuning for development
print_output_code=True, # Debug: show generated Triton code
static_shapes=True, # Compilation optimization strategy
autotune_log_level=logging.DEBUG # Verbose autotuning output
)
def debug_kernel(x: torch.Tensor) -> torch.Tensor:
# Implementation
pass
Config: Execution Control
Config parameters control how the kernel executes on GPU hardware:
@helion.kernel(
# Config parameters
config=helion.Config(
block_sizes=[64, 128], # GPU tile sizes
num_warps=8, # Thread parallelism
num_stages=4, # Pipeline stages
indexing='block_ptr' # Memory access strategy
)
)
def production_kernel(x: torch.Tensor) -> torch.Tensor:
# Implementation
pass
Combined Usage
You can specify both Settings and Config together:
@helion.kernel(
# Settings: control compilation
print_output_code=False, # No debug output
static_shapes=True, # Shape specialization
# Config: control execution
config=helion.Config(
block_sizes=[32, 32], # Execution parameters
num_warps=4
)
)
def optimized_kernel(x: torch.Tensor) -> torch.Tensor:
# Implementation
pass
For more details, see Settings (compilation control) and Config (execution control).
See Also
Settings - Compilation behavior and debugging options (controls how kernels are compiled)
Config - GPU execution parameters and optimization strategies (controls how kernels execute)
Exceptions - Exception handling and error diagnostics
Language Module - Helion language constructs for kernel authoring
Autotuner Module - Autotuning configuration and search strategies