My Browser WASM't Prepared for This. Using DuckDB, Apache Arrow and Web Workers

(motifanalytics.medium.com)

126 points | by jjp a year ago ago

36 comments

mcraiha a year ago

Tip for all the blog authors, do NOT post code as image. Specially do not add fake editor UI and drop shadow to the image.

In this case 25 lines of code is 50 kB of image binary.

Also it cannot be searched via search engine. Nor can it be read with screen reader.

[-]

doubled112 a year ago

Pro tip for everybody: do not post any text as images

Never should I receive a Java exception hundreds of lines long as a cut off JPEG file.

Or a screenshot of a Google Sheet missing the information you’re talking to me about.

[-]

shepmaster a year ago

We made this to be used as a reply when pictures are misused where text would be better:

https://fewer.pics/

[-]

stingraycharles a year ago

I’m disappointed the rendered picture isn’t advanced CSS. I honestly expected that to be the case.

[-]

natebc a year ago

I hope someone accepts this challenge!

brulard a year ago

It should have been animated.

[-]

ralferoo a year ago

It should have been a .swf and required a plugin to be installed to see it.

lgas a year ago

Why ruin the message by writing half of it upside down?

[-]

shepmaster a year ago

Yep, as the sibling said, it’s to prove the point in a snarky manner. The yellow-on-white color, the wobbling, upside down path of text, the font choice, the unrelated image; all of these serve to make the concept actively hard to process.

Hopefully this sparks a little “wow I wonder if my image of text was also hard to read like I just experienced” moment.

doubled112 a year ago

It actually enhances the message by being harder to read.

goda90 a year ago

This guide can help if you still want the code to be pretty: https://www.taniarascia.com/adding-syntax-highlighting-to-co...

[-]

ciupicri a year ago

Or use vim to convert it to HTML https://vi.stackexchange.com/questions/792/how-to-convert-a-...

fragmede a year ago

OCR's come pretty far these days and I can select text off an image with my iPhone processing it locally with a fair bit of success.

a year ago

[deleted]

a year ago

[deleted]

markerz a year ago

It’s also terrible copy-pasting and CMD+F experience.

drtgh a year ago

As long as such 50kB image doesn’t get replaced with 700kB of javascript bundles for coloring code.

I mean, if such coloring it's going to be done, it should be done with HTML/CSS.

For the OP article, for screenreaders perhaps you are sugesting people to use the alt attribute or similar.

paulsutter a year ago

[flagged]

jasmcole a year ago

We use the WASM build of DuckDB quite extensively at Count (https://count.co - 2-3m queries per month). There are a couple of bugs we've noticed, but given that it's pretty much maintained by a single person seems impressively reliable!

[-]

jillyboel a year ago

Looking at your insane pricing page I have to assume that you are sponsoring that single person?

[-]

Spivak a year ago

I'm confused, nothing about their pricing looks that weird. Businesses don't typically have large BI teams so you can ride that $199/mo $2400/year for a long time which is so small most SMBs can probably expense it without approval.

[-]

jillyboel a year ago

You're focussing on the wrong part

jillyboel a year ago

Gotta love being downvoted for daring to suggest a company should sponsor the sole open source dev making their whole product possible.

pmm a year ago

Author here. Thank you all for the comments. I take full responsibility for stupidly using an image for posting the code snippet. Sorry for that! Also, the article was originally posted almost 2 years ago (and "resurrected" with the recent migration to Medium). This is why a fairly old DuckDB version is referenced there. Some of the issues I observed are now gone too.

Obviously, many things have changed since then. We've experimented extensively and moved back and forth with using DuckDB for our internal cloud processing architecture. We eventually settled on just using it for reading the data and then handling everything else in custom workers. Even using TypeScript, we achieved close to 1M events/s per worker overall with very high scalability. However, our use-case is quite distinct. We use a custom query engine (for sequence processing), which has driven many design decisions.

Overall, I think DuckDB (both vanilla and WASM version) is absolutely phenomenal. It also matured since my original blog post. I believe we'll only see more and more projects using it as their backbone. For example, MotherDuck is doing some amazing things with it (e.g., https://duckdb.org/2023/03/12/duckdb-ui) but there are also many more exciting initiatives.

azakai a year ago

> [wasm] is executed in a stack-based virtual machine rather than as a native library code.

Wasm's binary format is indeed a stack-based virtual machine, but that is not how it is executed. Optimizing VMs convert it to SSA form, basic blocks, and finally machine code, much the same as clang or gcc compile native library code.

It is true that wasm has some overhead, but that is due to portability and sandboxing, not the stack-based binary format.

> On top of the above, memory available to WASM is limited by the browser (in case of Chrome, the limit is currently set at 4GB per tab).

wasm64 solves this, by allowing 64-bit pointers and a lot more than 4GB of memory.

The feature is already supported in Chrome and Firefox, but not everywhere else yet.

[-]

geokon a year ago

The more I read about WASM the more it sounds like the JVM

I'm still not clear what at its core it's done differently (in a way that couldn't be bolted on to a subset of the JVM)

[-]

azakai a year ago

The JVM is designed around Java. That's really the main difference, and it brings some downsides for the goals of wasm, which include running native code - think C++ or Rust. The JVM is great at Java, which relies on runtime inlining etc., but not at C++, which assumes ahead-of-time inlining and other optimizations.

[-]

geokon a year ago

I don't understand how the virtual machine would preclude you from in-lining ahead of time...? That's done when you're compiling.

What is WASM doing to facilitate recompiling native code that isn't practical to do on the JVM

tobilg a year ago

Not sure why the comparisons were made with pretty outdated versions to be honest.

I‘m using a (older) v1.29.1 dev version with https://sql-workbench.com w/o any bigger issues.

__mp a year ago

I’ve been toying with the idea of implementing a distributed analytics engine on top of Cloudflare workers and DuckDB.

I’m not sure if this goes against the CloudFlare TOS tough (last time I checked they had some provisons against processing images).

[-]

judge2020 a year ago

In the past few years there was this blog post[0] that clarified this. It moved the restriction on serving a "disproportionate percentage of pictures, audio files, or other large files" to another part of the TOS dedicated specifically to the CDN part[1] and clarified that, if you're using Cloudflare add-on services Stream, R2 (their S3), or Cloudflare Images, then you won't be at risk of termination.

0: https://blog.cloudflare.com/updated-tos/

1: The restriction still exists at https://www.cloudflare.com/service-specific-terms-applicatio... under "Content Delivery Network (Free, Pro, or Business)".

httgp a year ago

BoilingData does something very similar, except with Lambdas on AWS.

bobnamob a year ago

They've already got one: https://developers.cloudflare.com/analytics/analytics-engine...

tobilg a year ago

IMO that’s not really possible because of the size limits of Cloudflare Workers. Neither the WASM nor the Node version are small enough.

I‘m running it on AWS Lambda functions with some success.

canadiantim a year ago

How persistent can you make this data?

curtisszmania a year ago

[dead]