1 comments

  • el_dockerr 2 months ago

    Hi HN, author here.

    I've been working on optimizing perception pipelines for SWaP-constrained FPGAs (like in satellite or drone payloads). I realized that we often run out of DSP slices for simple 3x3 convolutions.

    I implemented a method to approximate these convolutions by learning coefficients that map strictly to Powers-of-Two (PoT). This allows replacing the constant multipliers with bit-shifts and adders (LUTs).

    The results:

    Reduces DSP usage by 33% (2 Muls instead of 3 per atomic dot-product).

    Achieves >99% SSIM on correlated images.

    The error manifests as a global DC-offset, which Batch Norm layers in CNNs can typically absorb.

    I wrote a blog post detailing the math and the hardware implementation. The full C-benchmark and PoC code is on GitHub (linked in the post).

    I'd love to hear from the FPGA folks here: Is this trade-off (accuracy vs. resources) something you'd use in production payloads?

    Other sources:

    [blog] https://www.dockerr.blog/blog/lowrank-hardware-approximation

    [git] https://github.com/el-dockerr/Low-Rank_Hardware_Approximatio...

    [LinkedIN] https://www.linkedin.com/in/swen-kalski-062b64299/