11 comments

  • Lerc 2 hours ago

    There's a second half of a two hour video on YouTube which talks about creating embeddings using some pre transforms followed by SVD with some distance shenanigans,

    https://www.youtube.com/watch?v=Z6s7PrfJlQ0&t=3084s

    It's 4 years old and seems to be a bit of a hidden gem. Someone even pipes up at 1:26 to say "This is really cool. Is this written up somewhere?"

    [snapshot of the code shown]

        %%time
        cooc = vectorizers.TokenCooccurrenceVectorizer(
            window_orientation="after",
            kernel_function="harmonic",
            min_document_occurrences=5,
            window_radius=20,
        ).fit(tokenized_news)
        
        context_after_matrix = cooc.transform(tokenized_news)
        context_before_matrix = context_after_matrix.transpose()
    
        cooc_matrix = scipy.sparse.hstack([context_before_matrix, context_after_matrix])
        cooc_matrix = sklearn.preprocessing.normalize(cooc_matrix, norm="max", axis=0)
        cooc_matrix = sklearn.preprocessing.normalize(cooc_matrix, norm="l1", axis=1)
        cooc_matrix.data = np.power(cooc_matrix.data, 0.25)
    
        u, s, v = scipy.sparse.linalg.svds(cooc_matrix, k=160)
        word_vectors = u @ scipy.sparse.diags(np.sqrt(s))
    
    
    CPU times: user 3min 5s, sys: 20.2 s, total: 3min 25s

    Wall time: 1min 26s

    • nighthawk454 2 hours ago

      That’s Leland McInnes - author of UMAP, the widely-used dimension reduction tool

      • Lerc 2 hours ago

        I know, I mentioned his name in a post last week, Figured doing so again might seem a bit fanboy-ish. I am kind-of a fan but mostly a fan of good explanations. He's just self-selecting for the group.

  • chaps 9 hours ago

    To the authors: Please expand your acronyms at least once! I had to stop reading to figure out what "KSVD" stands for.

    Learning what it stands for* wasn't particularly helpful in this case, but defining the term would've kept me on your page.

    *K-Singular Value Decomposition

    • jmount 7 hours ago

      Strongly agree. I even searched to see I wasn't missing it. I mean yeah "SVD" is likely singular value decomposition, but in this context you have other acronyms bouncing around your head (like support vector machine- just need to get rid of the m).

    • JSteph22 7 hours ago

      I'm surprised the authors just completely abandon the standard first-use notation for acronyms.

  • djoldman 9 hours ago
  • sdenton4 2 hours ago

    This is great, and very relevant to some problems I've been looking around on white boards lately. Exceptionally well timed.

  • snovv_crash 7 hours ago

    Basically find the primary eigenvectors.

    • sdenton4 2 hours ago

      It's not, though...

      In sparse coding, you're generally using an over-complete set of vectors which decompose the data into sparse activations.

      So, if you have a dataset of hundred dimensional vectors, you want to find a set of vectors where each vector is well described as a combination of ~4 of the "basis" vectors.