Skip to content
This repository was archived by the owner on Feb 24, 2026. It is now read-only.
This repository was archived by the owner on Feb 24, 2026. It is now read-only.

[Feature Request] LayoutInference pass should be enhanced to analysis vectorize factor cross indices #266

@LeiWang1999

Description

@LeiWang1999

Currently, when we write a set of nested loops to ensure 16-byte vectorized access, the code might look like this:

for i in range(1):
    for v_3 in T.vectorized(16):
        B_shared[tx // 16, tx % 16 // 8, tx % 8 * 2 + v_3 // 8, v_3 % 8] = B[bx * 8 + tx // 16, ko * 2 + tx % 16 // 8, tx % 8 * 2 + v_3 // 8, v_3 % 8]

However, our current legalization pass transforms this into the following form:

for i, v_3 in T.grid(1, 2):
    for vec in T.vectorized(8):
        B_shared[tx // 16, tx % 16 // 8, tx % 8 * 2 + (v_3 * 8 + vec) // 8, (v_3 * 8 + vec) % 8] = B[bx * 8 + tx // 16, ko * 2 + tx % 16 // 8, tx % 8 * 2 + (v_3 * 8 + vec) // 8, (v_3 * 8 + vec) % 8]

While this transformation achieves functional correctness, it introduces additional complexity in the indexing expressions and splits the vectorized loop into smaller chunks (e.g., breaking the 16-element vectorized access into two 8-element accesses). This reduces the efficiency of vectorized memory operations and complicates the generated code.

Proposed Enhancement:
To address this, the legalization pass should be enhanced to maintain the original vectorized structure and ensure that the indexing expressions remain as simple as possible. Specifically:
1. Preserve Single-Level Vectorization: Instead of breaking the 16-element vectorized loop into smaller subloops (e.g., two 8-element loops), the pass should retain the original T.vectorized(16) loop where possible.
2. Simplify Index Calculations: The pass should avoid introducing complex expressions like (v_3 * 8 + vec) for computing indices. Instead, it should aim to directly map the v_3 indices to the original structure (e.g., v_3 // 8 and v_3 % 8).
3. Optimize Performance: By preserving the larger vectorized loop and avoiding unnecessary transformations, the pass can generate more efficient, hardware-friendly code that takes better advantage of vectorized memory access.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions