Attention Residuals

(arxiv.org)

2 points | by djhemath 8 hours ago ago

1 comments

  • djhemath 8 hours ago

    This paper by the Kimi team allows us to add more depth to the model without losing information/context. Although it increases efficiency by just over 1%, the total savings could reach millions. Or at least, it would allow us to build models with more layers for the same cost as today.