[Backend] Improve dot support to target FMA #4516

binarman · 2024-08-14T13:10:09Z

This PR:

Refactors FMA dot implementation
Supports dot3d in FMA path
Fixes several issues in operand offset computation
Enables small dot operands

This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.

Rough list of PRs:

Basic FMA dot fixes, dot 3d support and relaxing small dimensions for dot (this PR) [Backend] Improve dot support to target FMA #4516
Blocked->dotOp shared memory bypassing [Backend] Bypass conversion for suitable blocked to dotOperand layout #4538
Accelerate AMD Matmul + emit dot operations [WIP] [AMD] Emit AMD specific intrinsics for dot #4594
Layout optimization, so operand B is loaded in proper mfma layout and do not need to go through LDS [WIP] Optimize fma dot #4581
Vectorization optimization of dot operands/results (in case llvm can not do this internally)
Reduction operation hoisting out of the K loop (reduction operation is a byproduct of layout optimization step) Hoist reduction outside a loop #4559

This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands

…ompiltion time and reduce number of instructions in assembly, fix bug with wrong order field used for share mem load size computation

antiagainst

First batch of comments; I still need to review SharedToDotOperandFMA.cpp more carefully.

include/triton/Conversion/TritonGPUToLLVM/Utility.h

include/triton/Dialect/TritonGPU/IR/Dialect.h

lib/Dialect/TritonGPU/IR/Dialect.cpp

third_party/nvidia/backend/compiler.py

third_party/amd/backend/compiler.py

lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/FMA.cpp

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandFMA.cpp

python/test/unit/language/test_core.py

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandFMA.cpp

antiagainst · 2024-09-23T06:31:55Z

When addressing comments, please make sure to add new commits and not squashing into existing ones. Otherwise it's hard to re-review again.

binarman · 2024-10-02T19:43:07Z

When addressing comments, please make sure to add new commits and not squashing into existing ones. Otherwise it's hard to re-review again.

hmm, let me rework this with merge commits, I did not notice this comment in time

Typically I rebase changes on top of main branch, when I have conflicts. Because this way history look clean and it is easier to review, but I see that you prefer merge updates, so I'll continue doing it that way.

…d test

antiagainst

Thanks for improving the implementation! I still have a few nits.

The change is quite intensive and I'm not super familiar with shared layout indexing and such. @zhanglx13 please take another look.

include/triton/Conversion/TritonGPUToLLVM/Utility.h

third_party/nvidia/backend/compiler.py

python/test/unit/language/test_core.py

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandFMA.cpp

antiagainst

This now looks good to me. But please wait for @zhanglx13 to double check the logic there given it's fairly involved. I'll approve after Lixun approves.

binarman force-pushed the small_fma_dot branch 3 times, most recently from 68350e9 to 8e620d3 Compare August 15, 2024 22:19

binarman mentioned this pull request Aug 17, 2024

[WIP] Support small dots and optimization of dot operands #4400

Draft

binarman force-pushed the small_fma_dot branch from 8e620d3 to 9d01eab Compare August 17, 2024 20:47

binarman changed the title ~~[WIP] Relax dot operand constrains with FMA based dot~~ Relax dot operand constrains with FMA based dot Aug 17, 2024

This was referenced Aug 19, 2024

[Backend] Bypass conversion for suitable blocked to dotOperand layout #4538

Merged

Hoist reduction outside a loop #4559

Draft

This was referenced Aug 26, 2024

[WIP] Optimize fma dot #4581

Draft

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

Draft

alefimov-amd force-pushed the small_fma_dot branch from 9d01eab to 3033970 Compare September 11, 2024 15:09

binarman force-pushed the small_fma_dot branch from 3033970 to 6907073 Compare September 13, 2024 12:40

binarman added 4 commits September 14, 2024 13:56

Relax dot operand constrains with FMA based dot

eb8218d

This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands

implement separate conversion path for unswizzled tensor to improve c…

860e20e

…ompiltion time and reduce number of instructions in assembly, fix bug with wrong order field used for share mem load size computation

post rebase fix NV accelerate matmul

eb7d234

add smoke tests for accelerate matmul pass

fe8d557

binarman force-pushed the small_fma_dot branch from 35bae87 to fe8d557 Compare September 14, 2024 14:06

antiagainst requested changes Sep 23, 2024

View reviewed changes

antiagainst changed the title ~~Relax dot operand constrains with FMA based dot~~ [Backend] Improve dot support to target FMA Sep 23, 2024

binarman force-pushed the small_fma_dot branch from fe8d557 to 4d70a5e Compare October 2, 2024 19:40

binarman added 3 commits October 3, 2024 12:00

address review comments and take into account order of dot operand

d983e19

Merge remote-tracking branch 'openai/main' into small_fma_dot

669e30b

add more comments and remove variable name duplicates

04678cc

binarman force-pushed the small_fma_dot branch from 4d70a5e to 04678cc Compare October 3, 2024 12:48

binarman added 4 commits October 3, 2024 13:56

selfreview cleanup: rewording, remove nvidia specific code and relate…

f4685c3

…d test

Merge remote-tracking branch 'openai/main' into small_fma_dot

acc40fb

skip small tests on unsupported platforms

8c15a4d

Merge remote-tracking branch 'openai/main' into small_fma_dot

f08913e

antiagainst requested changes Oct 4, 2024

View reviewed changes

binarman added 2 commits October 4, 2024 14:13

Merge remote-tracking branch 'openai/main' into small_fma_dot

6fedcb4

addressing review comments

7e79fe1

antiagainst requested changes Oct 5, 2024

View reviewed changes

antiagainst marked this pull request as ready for review October 5, 2024 04:00

antiagainst requested review from zhanglx13 and ptillet as code owners October 5, 2024 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend] Improve dot support to target FMA #4516

[Backend] Improve dot support to target FMA #4516

binarman commented Aug 14, 2024 •

edited

Loading

antiagainst left a comment

antiagainst commented Sep 23, 2024 •

edited

Loading

binarman commented Oct 2, 2024 •

edited

Loading

antiagainst left a comment

antiagainst left a comment

[Backend] Improve dot support to target FMA #4516

Are you sure you want to change the base?

[Backend] Improve dot support to target FMA #4516

Conversation

binarman commented Aug 14, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst commented Sep 23, 2024 • edited Loading

binarman commented Oct 2, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

binarman commented Aug 14, 2024 •

edited

Loading

antiagainst commented Sep 23, 2024 •

edited

Loading

binarman commented Oct 2, 2024 •

edited

Loading