Show HN: Low-rank approximation for 3x3 FPGA convolutions (33% less DSP usage)

(dockerr.blog)

1 points | by el_dockerr 2 months ago ago

1 comments

Hi HN, author here.

I've been working on optimizing perception pipelines for SWaP-constrained FPGAs (like in satellite or drone payloads). I realized that we often run out of DSP slices for simple 3x3 convolutions.

I implemented a method to approximate these convolutions by learning coefficients that map strictly to Powers-of-Two (PoT). This allows replacing the constant multipliers with bit-shifts and adders (LUTs).

The results:

Reduces DSP usage by 33% (2 Muls instead of 3 per atomic dot-product).

Achieves >99% SSIM on correlated images.

The error manifests as a global DC-offset, which Batch Norm layers in CNNs can typically absorb.

I wrote a blog post detailing the math and the hardware implementation. The full C-benchmark and PoC code is on GitHub (linked in the post).

I'd love to hear from the FPGA folks here: Is this trade-off (accuracy vs. resources) something you'd use in production payloads?

Other sources:

[blog] https://www.dockerr.blog/blog/lowrank-hardware-approximation

[git] https://github.com/el-dockerr/Low-Rank_Hardware_Approximatio...

[LinkedIN] https://www.linkedin.com/in/swen-kalski-062b64299/