CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

(arxiv.org)

36 points | by matt_d 2 hours ago ago

2 comments

rahen 34 minutes ago

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.

LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

[-]

sroussey 5 minutes ago

I imagine this is what’s already done for AI laying out hardware design.