Optimizing a 6502 image decoder, from 70 minutes to 1 minute

(colino.net)

171 points | by davikr 14 hours ago ago

18 comments

anyfoo 2 hours ago

Isn't it crazy how the image where every other pixel is black (labeled "Without interpolating, we can clearly see we only have half the pixels") sort of looks to have higher fidelity than the one after it, where the black pixels have been removed, which now looks pixelated?

And yet both images have the exact amount of information, because all pixels that have been removed are simply black.

The effect is so pronounced, that I wonder whether there wasn't any additional processing between the two images. Compare the window with the sky reflection between the two: In the image without black pixels, it looks distorted and aliased, while in the one with, it looks pristine.

If actually only the black pixels have been removed (and the result nearest-neighbor scaled), I think the black pixel grid is a form of dithering, although neither random, nor the error diffusion kind one usually thinks of when hearing "dithering". There is no error being diffused here, and all added pixels are constant black.

Maybe the black pixels allow the mind to fill in the gaps (shifting the interpolation that was also removed, prior to the black pixel image, to our brain, essentially). It is known that our brain interpolates and even straight makes up highly non-linear stuff as part of our vision system. A famous example of that is how our blind spot, the fovea where the optic nerve comes out, is "hidden" from us that way.

The aliasing would "disappear" because we sort of have twice the amount of samples (pixels), leading to twice the Nyquist frequency, only that half of the samples have been made up by our vision system through interpolation. (This is a way simplified and crude way to look at it. Pun unintended.)

But before jumping to such lofty conclusions, I still wonder whether nothing more happened between the two images...

[-]

bawolff 16 minutes ago

If i understand, the picture without black pictures is half the resolution at 320x240. That's small enough your browser might be upscaling it

flanked-evergl 10 hours ago

If I have the "Without interpolating, we can clearly see we only have half the pixels." image entirely on screen, using Chrome, KDE with X11 on Ubuntu 24.04, then it makes my whole screen change colour. Everything becomes slightly darker or something. Very odd. I will try it on another computer.

[-]

opello 7 hours ago

I wonder if there's not some adaptive backlight automatic gain control cueing off of the moire image's black pixels, since you describe things as slightly darker.

Cockbrand 10 hours ago

Without having tried it, maybe there's some HDR content on the page, triggering the display's HDR mode?

JKCalhoun 9 hours ago

I kind of enjoy seeing these posts from time to time on HN. I thought it was my age (I remember this hardware) but I think a lot of engineers are enjoying practicing their craft in a more pure environment with so few (or no) layers of abstraction underneath.

Refreshing at times, isn't it?

[-]

dylan604 9 hours ago

For me, it's less the abstraction vs having hard limits on things like memory. Modern software has nearly limitless memory compared to the less than 1MB typical on these projects. It was definitely a lesson I had to learn.

As far as the abstraction, when does it get to a point the compiler can't undo the abstraction? At what point does one need to get to a point where something cannot be done?

[-]

tmoertel 8 hours ago

On 6502-based systems the available memory was often less than 64 KiB, the maximum addressable by the processor directly. Still, you could squeeze a lot into that small amount of you were clever. For example, Steve Wozniak wrote in BYTE magazine about computing e to over 100K places on an Apple 2:

> I first calculated e to 47 K bytes of precision in January 1978. The program ran for 4.5 days, and the binary result was saved on cassette tape. Because I had no way of detecting lost-bit errors on the Apple (16 K-byte dynamic memory circuits were new items back then), a second result, matching the first, was required. Only then would I have enough confidence in the binary result to print it in decimal. Before I could rerun the 4.5 day program successfully, other projects at Apple, principally the floppy-disk controller, forced me to deposit the project in the bottom drawer. This article, already begun, was postponed along with it. Two years later, in March 1980, I pulled the e project out of the drawer and reran it, obtaining the same results. As usual (for some of us), writing the magazine article consumed more time than that spent meeting the technical challenges.

See page 392 of https://archive.org/details/byte-magazine-1981-06.

HarHarVeryFunny 8 hours ago

> As far as the abstraction, when does it get to a point the compiler can't undo the abstraction?

The early 8-bit systems were so constrained in everything from memory to registers, instruction set, and clock speed, that using a high level language wasn't an option if you were trying to optimize performance or squeeze a lot of functionality into available memory. An 8-bit system would have a 64KB address space, but maybe only 16-32KB of RAM, with the rest used by the "OS" and mapped to the display, etc.

The 6502 was especially impoverished, having only three 8-bit registers and a very minimalistic instruction set. Writing performant software for it depended heavily on using "zero-page" memory (special addressing mode to access 1st 256 bytes of memory) to hold variables, rather than passing stack based parameters to functions, etc. It was really a completely different style of programming and mindset to using a high level language - not about language abstraction, but a painful awareness of the bare metal you were running on all the time.

asveikau 8 hours ago

I may sound bitter in describing this, but I started to notice about 15 years ago that the then-current crop of developers, trained exclusively on GC'd languages, seemed to have no idea what a memory constraint would look like and thought that a hidden memory allocation is free as long as it occurs a few layers beneath you.

6510 8 hours ago

It is quite surreal to me that this was not the road taken. Optimizing software doesn't have the same potential as optimizing hardware but i'd say 1/70 is significant. If thousands of people would work on this indefinitely the time would drop to seconds. That code would also be completely incomprehensible. The argument that people should just buy a faster computer could just as easily have worked out the other way around, just write faster software. Going the hardware way gave us really really readable code which is great. The other direction however would have given us really really cheap devices. Receiving and sending a signal for [say] a chat application requires very little stuff. It would be next to impossible to add images, word suggestions or spell checkers. We could still bake mature applications onto dedicated chips. But until now those efforts went pretty much nowhere(?) I imagine one could quite easily bake a mail client or server, or a torrent client, irc, perhaps even a gui for windowed applications. Maybe an error console?

[-]

dr_zoidberg 8 hours ago

In the ~30 years I've used computers, they've become ~1,000,000 times faster. My daily experience with computers doesn't show it. There's someone out there who took the time to measure UI latency and has shown that, no only isn't it faster, it's actually slowed down. And yet, our hardware is 1,000,000 times faster...

Edit: this is the latency project I was thinking about https://danluu.com/input-lag/

Someone 6 hours ago

> If thousands of people would work on this indefinitely the time would drop to seconds.

Even the small Apple II screen takes 7.5 kilobytes of RAM. Reading https://imapenguin.com/2022/06/how-fast-can-a-6502-transfer-..., just writing all of that to the screen takes a tenth of a second, for just over 50,000 pixels, and that’s ignoring the idiosyncratic video memory layout of the Apple II.

Going below ten seconds for decompressing that would mean you must produce 5 output pixels every millisecond, which means you have about 200 CPU cycles per pixel. On a 6502, that’s less than 100 instructions.

That makes me doubt it can get under 10 seconds.

⇒ if you want to get down to seconds, I think you’ll have to drop even more image data than this does, and do that fast.

[-]

colinlm 4 hours ago

Oh yes the dithering takes 10 seconds all to itself. The Quicktake 100 format is much more simple (4-bits nibbles) and it still needs 22 seconds to decode 640x480.

(Decoding and dithering are done in two passes for memory reasons and space on floppy disk reasons but it brings auto-levelling)

It's about 450 cycles per pixel for decoding QT100 (1200 for QT150), and 230 cycles per pixel for dithering.

6510 2 hours ago

When we had this tiny memory and the slow cpu to work with everyone was into optimization. It quite regularly happened that someone would find a new way to do something way beyond what everyone else thought possible. You would quite literally sit in front the screen stuck in a loop saying "What?" then "How?" with 5 second pauses in between. I'm not bragging about my amazing skills here, more the opposite. I've just adjusted my estimated accordingly. If enough people spend enough time the solution will turn out much weirder than the quake inverse square root.

> I’ll be using the revised version as I think it’s a well-established example of doing a real-world block transfer. Sure there may be faster ways, but this is a realistic way, which is what we’re going for.

   NEXT  LDX #NUMBER  
         LDA BASE,X 
         STA DEST,X 
         DEX 
         BNE NEXT

Something unrolled like:

    $1000 LDA $2000
    $1003 STA $3000
    $1006 LDA $2001
    $1009 STA $3001
    etc...

Isn't even the fastest solution but 3 out of 5 instructions are gone and I think the two remaining are faster. The transfer in the book is of course really practical while this one is already almost unworkable. You can do worse tho:

    $1000 LDA #$12
    $1002 STA $30
    $1004 LDA #$34
    $1006 STA $31
    etc...

For the truly insane solution you would have to consider if you even have to read the data. Maybe you can present the image as-is and modify it... or worse... turn it into code...

flenserboy 5 hours ago

but it still should happen. every project (of a certain size) should have at least 1-2 programmers whose job it is to make the code faster & smaller. frankly, this could well be the best use of AI on code — not to generate it, but to use its potential speed to chew through existing code bases, outputting squashed-down, streamlined code which is tested to return the same results as the original (think the dream of gentoo, but every program on your system optimized for your particular hardware).

JSR_FDED 12 hours ago

Good reminder to do less, rather than the same thing but optimized.

[-]

iberator 11 hours ago

Yeah. Amazing idea and execution (counting number of instructions per module).