A 20-Year-Old Algorithm Can Help Us Understand Transformer Embeddings

(ai.stanford.edu)

81 points | by jemoka 4 days ago ago

11 comments

Lerc 2 hours ago

There's a second half of a two hour video on YouTube which talks about creating embeddings using some pre transforms followed by SVD with some distance shenanigans,

https://www.youtube.com/watch?v=Z6s7PrfJlQ0&t=3084s

It's 4 years old and seems to be a bit of a hidden gem. Someone even pipes up at 1:26 to say "This is really cool. Is this written up somewhere?"

[snapshot of the code shown]

    %%time
    cooc = vectorizers.TokenCooccurrenceVectorizer(
        window_orientation="after",
        kernel_function="harmonic",
        min_document_occurrences=5,
        window_radius=20,
    ).fit(tokenized_news)
    
    context_after_matrix = cooc.transform(tokenized_news)
    context_before_matrix = context_after_matrix.transpose()

    cooc_matrix = scipy.sparse.hstack([context_before_matrix, context_after_matrix])
    cooc_matrix = sklearn.preprocessing.normalize(cooc_matrix, norm="max", axis=0)
    cooc_matrix = sklearn.preprocessing.normalize(cooc_matrix, norm="l1", axis=1)
    cooc_matrix.data = np.power(cooc_matrix.data, 0.25)

    u, s, v = scipy.sparse.linalg.svds(cooc_matrix, k=160)
    word_vectors = u @ scipy.sparse.diags(np.sqrt(s))

CPU times: user 3min 5s, sys: 20.2 s, total: 3min 25s

Wall time: 1min 26s

[-]

nighthawk454 2 hours ago

That’s Leland McInnes - author of UMAP, the widely-used dimension reduction tool

[-]

Lerc 2 hours ago

I know, I mentioned his name in a post last week, Figured doing so again might seem a bit fanboy-ish. I am kind-of a fan but mostly a fan of good explanations. He's just self-selecting for the group.

chaps 9 hours ago

To the authors: Please expand your acronyms at least once! I had to stop reading to figure out what "KSVD" stands for.

Learning what it stands for* wasn't particularly helpful in this case, but defining the term would've kept me on your page.

*K-Singular Value Decomposition

[-]

jmount 7 hours ago

Strongly agree. I even searched to see I wasn't missing it. I mean yeah "SVD" is likely singular value decomposition, but in this context you have other acronyms bouncing around your head (like support vector machine- just need to get rid of the m).

JSteph22 7 hours ago

I'm surprised the authors just completely abandon the standard first-use notation for acronyms.

djoldman 9 hours ago

KSVD Algorithm:

https://legacy.sites.fas.harvard.edu/~cs278/papers/ksvd.pdf

[-]

westurner 3 hours ago

k-SVD algorithm: https://en.wikipedia.org/wiki/K-SVD

sdenton4 2 hours ago

This is great, and very relevant to some problems I've been looking around on white boards lately. Exceptionally well timed.

snovv_crash 7 hours ago

Basically find the primary eigenvectors.

[-]

sdenton4 2 hours ago

It's not, though...

In sparse coding, you're generally using an over-complete set of vectors which decompose the data into sparse activations.

So, if you have a dataset of hundred dimensional vectors, you want to find a set of vectors where each vector is well described as a combination of ~4 of the "basis" vectors.