list fused kernels in transformer written by Triton
Performance: improve 7% than torch kernel
Difference beween black line and red line is change the block size of GPU kernel
like this part in attention
- ffn2: working
- ffn2 + residual + norm
- linear + softmax

