Neural networks that learn non-linearity without activation functions [pdf]

(tahabouhsine.com)

19 points | by mlnomadpy 4 days ago ago

7 comments

I misread this as if "there is no non-linearity". there is still non-linearity, it is just renamed and reshuffled into new operators. basically renaming apples into oranges.

[-]

12 hours ago

[deleted]

imtringued 12 hours ago

Well, it's more like fruits and vegetables. The author proposed a normalized inner product as replacement for the standard inner product.

It's not an activation function, because it has the learnable weights of a linear projection (mat vec multiplication) and the clamping properties of an activation function all in one.

My personal issue with the proposal is that it essentially doubles the amount of memory needed on-chip.

Yat-Product GEMMV now needs to store the running total of the inner product and the norm of the input vectors. That's a big cost increase for something that might not improve performance all that much.

mlnomadpy 4 days ago

hello everyone, am taha,

I was able to create a new kernel that allows you to learn non-linearity without using activation functions, making the models whitebox, and without any information loss.

MiniGPT with huggingface datasets streaming: https://www.kaggle.com/code/skywolfmo/yat-nnx-minigpt-finewe...

[-]

rytill 16 hours ago

Why would one have motivation to not use activation functions?

To my knowledge they’re a negligible portion of the total compute during training or inference and work well to provide non-linearity.

Very open to learning more.

[-]

julius 7 hours ago

Less information loss -> Less params? Please correct me if I got this wrong. The Intro claims:

"The dot product itself is a geometrically impoverished measure, primarily capturing alignment while conflating magnitude with direction and often obscuring more complex structural and spatial relationships [10, 11, 4, 61, 17]. Furthermore, the way current activation functions achieve non-linearity can exacerbate this issue. For instance, ReLU (f (x) = max(0, x)) maps all negative pre-activations, which can signify a spectrum of relationships from weak dissimilarity to strong anti-alignment, to a single zero output. This thresholding, while promoting sparsity, means the network treats diverse inputs as uniformly orthogonal or linearly independent for onward signal propagation. Such a coarse-graining of geometric relationships leads to a tangible loss of information regarding the degree and nature of anti-alignment or other neg- ative linear dependencies. This information loss, coupled with the inherent limitations of the dot product, highlights a fundamental challenge."

russfink 15 hours ago

One reason might be expressing the constructs in a different domain, eg homomorphic encrypted evaluators.