Currently, when we write a set of nested loops to ensure 16-byte vectorized access, the code might look like this:
for i in range(1):
for v_3 in T.vectorized(16):
B_shared[tx // 16, tx % 16 // 8, tx % 8 * 2 + v_3 // 8, v_3 % 8] = B[bx * 8 + tx // 16, ko * 2 + tx % 16 // 8, tx % 8 * 2 + v_3 // 8, v_3 % 8]
However, our current legalization pass transforms this into the following form:
for i, v_3 in T.grid(1, 2):
for vec in T.vectorized(8):
B_shared[tx // 16, tx % 16 // 8, tx % 8 * 2 + (v_3 * 8 + vec) // 8, (v_3 * 8 + vec) % 8] = B[bx * 8 + tx // 16, ko * 2 + tx % 16 // 8, tx % 8 * 2 + (v_3 * 8 + vec) // 8, (v_3 * 8 + vec) % 8]
While this transformation achieves functional correctness, it introduces additional complexity in the indexing expressions and splits the vectorized loop into smaller chunks (e.g., breaking the 16-element vectorized access into two 8-element accesses). This reduces the efficiency of vectorized memory operations and complicates the generated code.
Proposed Enhancement:
To address this, the legalization pass should be enhanced to maintain the original vectorized structure and ensure that the indexing expressions remain as simple as possible. Specifically:
1. Preserve Single-Level Vectorization: Instead of breaking the 16-element vectorized loop into smaller subloops (e.g., two 8-element loops), the pass should retain the original T.vectorized(16) loop where possible.
2. Simplify Index Calculations: The pass should avoid introducing complex expressions like (v_3 * 8 + vec) for computing indices. Instead, it should aim to directly map the v_3 indices to the original structure (e.g., v_3 // 8 and v_3 % 8).
3. Optimize Performance: By preserving the larger vectorized loop and avoiding unnecessary transformations, the pass can generate more efficient, hardware-friendly code that takes better advantage of vectorized memory access.
Currently, when we write a set of nested loops to ensure 16-byte vectorized access, the code might look like this:
However, our current legalization pass transforms this into the following form:
While this transformation achieves functional correctness, it introduces additional complexity in the indexing expressions and splits the vectorized loop into smaller chunks (e.g., breaking the 16-element vectorized access into two 8-element accesses). This reduces the efficiency of vectorized memory operations and complicates the generated code.
Proposed Enhancement:
To address this, the legalization pass should be enhanced to maintain the original vectorized structure and ensure that the indexing expressions remain as simple as possible. Specifically:
1. Preserve Single-Level Vectorization: Instead of breaking the 16-element vectorized loop into smaller subloops (e.g., two 8-element loops), the pass should retain the original T.vectorized(16) loop where possible.
2. Simplify Index Calculations: The pass should avoid introducing complex expressions like (v_3 * 8 + vec) for computing indices. Instead, it should aim to directly map the v_3 indices to the original structure (e.g., v_3 // 8 and v_3 % 8).
3. Optimize Performance: By preserving the larger vectorized loop and avoiding unnecessary transformations, the pass can generate more efficient, hardware-friendly code that takes better advantage of vectorized memory access.