Don't Force Your LLM to Write Terse [Q/Kdb] Code: An Information Theory Argument

(medium.com)

53 points | by gabiteodoru 7 days ago ago

25 comments

This approach of solving a problem by building a low-perplexity path towards the solution reminds me of Grothendieck's approach towards solving complex mathematical problems - you gradually build a theory which eventually makes the problem obvious.

https://ncatlab.org/nlab/show/The+Rising+Sea

[-]

nextaccountic 16 minutes ago

> you gradually build a theory which eventually makes the problem obvious.

Which incidentally is how programming in Haskell feels like

benjaminwootton 5 hours ago

The bigger issue is that LLMs haven’t had much training on Q as there’s little publically available code. I recently had to try and hack some together and LLMs couldn’t string simple pieces of code together.

It’s a bizarre language.

[-]

haolez 4 hours ago

I don't think that's the biggest problem. I think it's the tokenizer: it probably does a poor job with array languages.

[-]

quotemstr 3 hours ago

Perhaps for array languages LLMs would do a better job running on a q/APL parse tree (produced using tree-sitter?) with the output compressed into the traditional array-language line noise just before display, outside the agentic workflow.

Veedrac 3 hours ago

> Let’s start with an example: (2#x)#1,x#0 is code from the official q phrasebook for constructing an x-by-x identity matrix.

Is this... just to be clever? Why not

    (!x)=/:!x

aka. the identity matrix is defined as having ones on the diagonal? Bonus points AI will understand the code better.

[-]

sannysanoff 2 hours ago

while both versions are O(N^2), your version is slower because comparison operation, which affects execution speed. This is suboptimal.

  q)x:1000
  q)\t:1000 sum (til x)=/:(til x)
  889
  q)\t:1000 sum (til x)=/:(til x)
  871
  q)\t:1000 sum (2#x)#1,x#0
  602
  q)\t:1000 sum (2#x)#1,x#0
  599

upd: in ngn/k, situation is opposite ;-o

dzaima 3 hours ago

Unless the interpreter is capable of pattern-recognizing that whole pattern, that will be less efficient, e.g. having to work with 16-bit integers for x in the range 128..32767, whereas the direct version can construct the array directly (i.e. one byte or bit per element depending on whether kdb has bit booleans). Can't provide timings for kdb for licensing reasons, but here's Dyalog APL and CBQN doing the same thing, showing the fast version at 3.7x and 10.5x faster respectively: https://dzaima.github.io/paste/#0U1bmUlaOVncM8FGP5VIAg0e9cxV...

quotemstr 3 hours ago

The vibe I get from q/kdb in general is that its concision has passed the point of increasing clarity through brevity and is now about some kind of weird hazing or machismo thing. I've never seen even numpy's verbosity be an actual impediment to understanding an algorithm, so we're left speculation about social and psychological explanations for why someone would write (2#x)#1,x#0 and think it beautiful.

Some brief notations make sense. Consider, say, einsum: "ij->ji" elegantly expresses a whole operation in a way that exposes the underlying symmetry of the domain. I don't think q's line noise style (or APL for that matter) is similarly exposing any deeper structure.

sanjayjc 5 days ago

> I think the aesthetic preference for terseness should give way to the preference for LLM accuracy, which may mean more verbose code

From what I understand, the terseness of array languages (Q builds on K) serves a practical purpose: all the code is visible at once, without the reader having to scroll or jump around. When reviewing an LLM's output, this is a quality I'd appreciate.

[-]

gabiteodoru 5 days ago

I agree with you, though in the q world people tend to take it to the extreme, like packing a whole function into a single line rather than a single screen. Here's a ticker plant standard script from KX themselves; I personally find this density makes it harder to read, and when reading it I put it into my text editor and split semicolon-separated statements onto different lines: https://github.com/KxSystems/kdb-tick/blob/master/tick.q E.g. one challenge I've had was generating a magic square on a single line; for odd-size only, I wrote: ms:{{[(m;r;c);i]((.[m;(r;c);:;i],:),$[m[s:(r-1)mod n;d:(c+1) mod n:#:[m]];((r+1)mod n;c);(s;d)])}/[((x;x)#0;0;x div 2);1+!:[x*x]]0}; / but I don't think that's helping anyone

[-]

rak1507 3 hours ago

There's a difference between one line and short/terse/elegant.

  {m:(x,x)#til x*x; r:til[x]-x div 2; 2(flip r rotate')/m}

generates magic squares of odd size, and the method is much clearer. This isn't even golfed as the variables have been left.

krackers 6 hours ago

When Q folks try to write C: https://github.com/kparc/ksimple

[-]

quotemstr 3 hours ago

Representative example:

  //!malloc
  f(a,y(x+2,WS+=x;c*s=malloc(y);*s++=0;*s++=x;s))     //!< (a)llocate x bytes of memory for a vector of length x plus two extra bytes for preamble, set refcount to 0
                                                      //!< and vector length to x in the preamble, and return pointer to the 0'th element of a new vector \see a.h type system
  f(_a,WS-=nx;free(sx-2);0)                           //!< release memory allocated for vector x.
  G(m,(u)memcpy((c*)x,(c*)y,f))                       //!< (m)ove: x and y are pointers to source and destination, f is number of bytes to be copied from x to y.
                                                      //!< \note memcpy(3) assumes that x/y don't overlap in ram, which in k/simple they can't, but \see memmove(3)
  //!memory management
  f(r_,ax?x:(++rx,x))                                 //!< increment refcount: if x is an atom, return x. if x is a vector, increment its refcount and return x.
  f(_r,ax?x                                           //!< decrement refcount: if x is an atom, return x.
         :rx?(--rx,x)                                 //!<   if x is a vector and its refcount is greater than 0, decrement it and return x.
            :_a(x))                                   //!<   if refcount is 0, release memory occupied by x and return 0.

Reminds me a bit of both the IOCCC and 70s Unix C from before anyone knew how to write C in a comprehensible way. But the above is ostensibly production code and the file was last updated six months ago.

Is there some kind of brain surgery you have to undergo when you accept the q license that damages the part of the brain that perceives beauty?

dapperdrake 4 hours ago

When EAX and RAX take too long to type.

lynx97 4 hours ago

Hey, another language with smileys! Like haskell, which has (x :) (partial application of a binary operator)

dapperdrake 4 hours ago

Perl and line noise also share these properties. Don’t particularly want to read straight binary zip files in a hex editor, though.

Human language has roughly, say, 36% encoding redundancy on purpose. (Or by Darwinian selection so ruthless we might as well call it "purpose".)

[-]

1718627440 3 hours ago

Language is often consciously changed and learned, so it is sometimes quite designed.

chmod775 3 hours ago

LLMs were created to use the same interface as humans (language/code).

Asking humans to change for the sake of LLMs is an utterly indefensible position. If humans want terse code, your LLM better cope or go home.

[-]

afc 2 hours ago

Disagree. If some small adjustments to your workflow or expectations enable you to use LLMs to produce good, working, high-quality code much faster than you could otherwise, at some point you should absolutely welcome this, not stubbornly refuse change.

[-]

chmod775 5 minutes ago

Somehow I don't think writing verbose English to communicate with an LLM is ever going to beat a language purpose-built for its particular niche. Being terse is the point and what makes it so useful. If people wanted to use python with their LLM instead, they have that option.

mikkupikku an hour ago

Do you swing a nailgun?

Use the tool according to how it works, not according to how you think it should work.

[-]

chmod775 40 minutes ago

Chances are hell is going to freeze over before people start writing verbose q code. Q being less verbose than alternatives is the whole point. Nobody is feeling any pressure to bend over backwards to accommodate the guy who struggles to get by when his LLM can't explain a piece of code to him.

To use your nailgun analogy as an example: Waddling in with your LLM and demanding the q community change is like walking into a clockmaker's workshop with your nailgun and demand they accommodate your chosen tool.

"But I can't fit my nailgun into these tiny spaces you're making, you should build larger clocks with plenty of space for my nailgun to get a good angle!"

No, we're not going to build larger clocks, but you're free to come back with a tiny automatic screwdriver instead. Alternatively you and your nailgun might feel more at home with the construction company across the street.

icsa 7 days ago

I think that there are a few critical issues that are not being considered:

* LLMs don't understand the syntax of q (or any other programming language).

* LLMs don't understand the semantics of q (or any other programming language).

* Limited training data, as compared to kanguages like Python or javascript.

All of the above contribute to the failure modes when applying LLMs to the generation or "understanding" of source code in any programming language.

[-]

chewxy 6 hours ago

> Limited training data, as compared to kanguages like Python or javascript.

I use my own APL to build neural networks. This is probably the correct answer, and inline with my experience as well.

I changed the semantics and definition of a bunch of functions and none of the coding LLMs out there can even approach writing semidecent APL.