Show HN: Data Formulator – AI-powered data visualization from Microsoft Research

(github.com)

212 points | by chenglong-hn 9 months ago ago

35 comments

zurfer 9 months ago

Anthropic recently released something that looks more polished but follows the chat paradigm. [1]

As a builder of something like that [2], I believe the future is a mix, where you have chat (because it's easy to go deep and refine) AND generate UIs that are still configurable manually. It's interesting to see that you also use plotly for rendering charts. I found it non-trivial to make these highly configurable via a UI (so far).

Thank you for open sourcing so we can all learn from it.

[1] https://news.ycombinator.com/item?id=41885231 [2] https://getdot.ai

[-]

flessner 9 months ago

The future in this space will probably stick to what IDEs have done from the beginning: Leaving the "core platform" unchanged while providing additional AI powered features around it.

Microsoft Office, VS Code, Adobe Photoshop and most other large software platforms have all embraced this.

I have genuinely not seen an AI product that works standalone (without a preexisting platform) besides chat-based LLMs.

zurfer 9 months ago

Here is the link to one of the prompts. It seems like all the LLM tasks are in the agents directory: https://github.com/microsoft/data-formulator/blob/main/py-sr...

Some of these "agents" are used for surprising things like sorting: https://github.com/microsoft/data-formulator/blob/main/py-sr... [this seems a bit lazy, but I guess it works :D]

[-]

chenglong-hn 9 months ago

you find it! you can't imagine how often I'm annoyed when see April being ranked before March due to alphabetic order...

Thus the sorting agent, and now running by default in the background!

DeathArrow 9 months ago

If you look in the video from OP, you can see that chat is still used at some point.

[-]

chenglong-hn 9 months ago

yes! chat is still a necessary component, since it's sometimes the only way for us to communicate unstructured information to the system.

zurfer 9 months ago

hmm there is a follow up to show the difference in percent instead of absolute values, which is similar to the type of interaction you can have in chat and there is a sort of history on the left side, so things are chat like to some degree.

paddy_m 9 months ago

I have a tool [1] that is tackling some of the same problems in a different way.

I had some core views that shaped what I built.

1. When doing data manipulation, especially initial exploration and cleaning, we type the same things over and over. Being proficient with pandas involves a lot of recognition of patterns, and hopefully remembering one with well written code (like you would read in Effective Pandas).

2. pandas/polars is a huge surface space in terms of API calls, but rarely are all of those calls relevant. There are distinct operations you would want on a datetime column, a string column or an int column. The traditional IDE paraidgm is a bit lacking for this type of use (python typing doesn't seem to utilize the dtype of a column, so you see 400 methods for every column).

3.It is less important for a tool to have the right answer out of the box, vs letting you cycle through different views and transforms quickly.

------

I built a low code UI for Buckaroo that has a DSL (JSON Lisp) that mostly specifies transform, column name, and other arguments. These operations are then applied to a dataframe, and separately the python code is generated from templates for each command.

I also have a facility for auto-cleaning that heuristically inspects columns and outputs the same operations. So if a column has 95% numbers and 1% blank strings, that should probably be treated as a numeric column. These operations are then visible in the lowcode UI. Multiple cleaning methods can be tried out (with different thresholds).

[1] https://github.com/paddymul/buckaroo

[2] https://youtu.be/GPl6_9n31NE?si=YNZkpDBvov1lUYe4&t=603 Demonstrating the low code UI and autocleaning in about 3 minutes

[3] There are other related tools in this space, specifically visidata and dtale. They take different approaches which are worth learning from.

ps: I love this product space and I'm eager to talk to anyone building products in this area.

[-]

chenglong-hn 9 months ago

this is really really cool! directly working with table is sometimes the only way to clean the data as well :)

I wish multiple ways of interacting with data can co-exist seamlessly in some sort of future tool (without overwhelming users (?)) :)

[-]

paddy_m 9 months ago

To your point about LLM based approaches have the huge adoption advantage in that you don't need to understand a lot to write into a text box.

A tool like buckaroo requires investment into knowing where to click and how to understand the output intitially.

d_watt 9 months ago

Some of the metaphors for interacting with the models, and visualizing as threads are interesting. Definitely does a good combination of ease of prompting + interogatability of the generated code.

I quickly ran into a wall trying to do interesting things like "forecast a dataset using ARIMA." On the surface it just does a linear prediction, seeming to ignore me, but under the hood you can see the model tried importing a library not actually in my environment, failed, and fell back to linear.

Given that you're approaching this in a pythonic way, not sql, my default way of working with it is to think about what python stuff I'd want to do. How do you see handling these things in the future. Go the route of assuming anaconda, and prompt the model with a well known set of libraries to expect? Or maybe prompt the user to install libraries that are missing?

[-]

chenglong-hn 9 months ago

That's a cool example! You are right, GPT-4o is much more powerful than we allow it to perform in Data Formulator, and our current design is to restrict it to a point that the model behavior is more or less reliable.

While we design it targeting more end-user analysts scenarios (thus much simpler UI and function support), we see the big value of "freeing" GPT-4o for advanced users who would like to have AI do complex stuff. I guess a starting point could be having an "interactive terminal" where AI and the user can directly communicate about these out of the box concepts, even having the user instruct AI to dynamically generate new UI to adapt to their workflow.

goose- 9 months ago

Since Data Formulator performs data transformation on your behalf to get the desired visualization, how can we verify those transformations are not contaminated by LLM hallucinations, and ultimately, the validity of the visualization?

[-]

larodi 9 months ago

We can’t. Without the driver this car runs on probability. And that all. A capable operator is still needed in the loop.

DeathArrow 9 months ago

You can see the generated code.

[-]

croes 9 months ago

Do you think the people who this is made for can grasp the code?

[-]

chenglong-hn 9 months ago

this is constant challenge! code is the ultimate verification tool, but not everyone gets it.

sometimes reading charts help, sometimes looking at data helps, other times only code can serve the verification purpose...

hggigg 9 months ago

I rather like this idea. Apologies however for my cynicism in advance. I suspect it'll die due to human concerns. I've seen many reports recently which are just plain and utterly wrong written in dashboards by vendors and internally. The veracity of the results is mostly based on the human driving it and validating the methodology and the competent ones are apparently rather rare. This serves to give it to humans who are even worse at the job than the current ones.

paddy_m 9 months ago

This looks good, especially the data ingest.

A couple of questions:

What type of data person are you aiming this tool to be useful for, data analyst, data scientist, or data engineer. I'm guessing data analyst who wants to use PyData instead of Tableau?

How data aware is your system? Is it sending dtypes of columns, and maybe actual values to the LLM?

How do you deal with python code gen and hallucinations?

Do you plan to make a jupyter extension with this tech?

[-]

chenglong-hn 9 months ago

1. when started out, we have Tableau/PowerBI users as the main audience in mind, hoping to grant them the power of data analysts who program in python to create charts requiring data transformation. but as we are building the tool, data formulator more or less are most powerful for people work with python data as you mentioned, since they can more easily issue instructions, verify results and followup.

from our user study, it seems like experience with data analysis (i.e., know how to explore is the most important skill) either they know programming or not, software engineers without data analysis experience sometimes struggle using it.

2. check out how it's build in our paper (https://arxiv.org/abs/2408.16119)! but generally data types and sample values

3. hallucinations is unfortunately a big challenge! our main strategy is to (1) restrict the code space that AI can transform data so it's less likely to go wild, and (2) provide much information to the user as possible (code + data output + chart) to let them discover and (3) let them easily refine using data threads

4. yes! we have some plans in the bag and about to figuring it out!

data_ders 9 months ago

way cool! I hope to take it for a spin tomorrow!

Q: Does your team see potential value in a DSL for succinctly describing visualizations to an LLM as Hex did with their DSL for Vega-lite specs [1]?

[1]: https://hex.tech/blog/making-ai-charts-go-brrrr/

[-]

chenglong-hn 9 months ago

Wow, that's pretty cool! I think there are potential -- current LLMs are not that good on VegaLite when I ask it to edit the script :)

marktl 9 months ago

Definitely looks like something that could save me, and others, allot of time. Thanks for sharing!

matt3D 9 months ago

After giving it a whirl I'm a little underwhelmed, but maybe I'm using it wrong. I'm getting less consistent results than if I prompted GPT4-o for a Vega graph after providing it with the documentation.

[-]

chenglong-hn 9 months ago

potentially! I myself as an experienced programmer finds directly chatting with GPT-4o much easier, since I have full control. But Data Formulator only transforms data in the back and GPT-4o don't do anything with charts (sometimes we want it!)

We are wondering how to really achieve flexible chat experience with UI at the same time, right now it's quite a research challenge!

croes 9 months ago

The problem with AI is still, how confident can we be the result isn’t wrong?

[-]

chenglong-hn 9 months ago

I'm not very confident either -- some of users in our user study also reports the same. :)

We are recently thinking about how to have AI provide more information in ways we can interpret. For example, might adding some feature to hint if AI's computation use a specific formula that users didn't mention. I feel working with AI is like the first week working with a someone we never met before.

nerdjon 9 months ago

We still really should not be confident. We have seen this time and time again that the output cannot be relied on.

People keep trying to downplay the problem as "not a big deal" or that it happens rarely, but that is kinda the problem. It is right enough times that people get confident in it and we stop double checking the work. When it is manipulating data that makes it even more problematic.

Now if what is happening here is the AI is generating the code that makes the visualizations and that code is still working in a traditional way (accessing the data directly), than I think its fine. Sure maybe some of the code may be wrong, but at least it is not manipulating the data and coming up with wild conclusions.

monkeydust 9 months ago

This is something I constantly struggle with. Its fine for some generative sales, marketing language but for data analysis wearing the risk that the output is even 0.1% wrong is not an option for many.

So you end up checking what the AI does which negates the productivity argument entirely.

Curious how others deal with this dilemma

[-]

paddy_m 9 months ago

I would love if we could seed LLMs with specific books, or give our own weightings to sources. I'm sure most books are in there already, I would happily pay extra (and want it to go to original authors) for known provenance of advice. Even pass royalties back to the original authors. For PyData code, I'm always looking at Effective Python/Polars.

chenglong-hn 9 months ago

this is a big challenge! the more complicated the code generated by AI, the more likely is going wrong and harder to verify. (more magic == more risk)

I'm curious how to really restrict what AI is generating each step so it is simple enough to verify and edit, yet not making it seems too verbose and slow

gourabmi 9 months ago

I just tried this out on a bunch of log files that were lying around. Excellent! I would love to connect this to a SQL database through ODBC/ JDBC and ask questions about the data. Does anyone have a tool that they use already for this usecase ?

[-]

chenglong-hn 9 months ago

I'm tempted to add a button to support connecting data from/to ODBC/JDBC but at same time worried about security stuff...

maybe there could be some intermediate layer to support this :)

donq1xote1 9 months ago

Thanks for sharing and provide open source version! This is great!

9 months ago

[deleted]