Scientific datasets are riddled with copy-paste errors

(sciencedetective.org)

66 points | by jruohonen 10 hours ago ago

10 comments

This is legitimately so challenging to avoid, because loads of scientific processes are—to some degrees or others—bespoke and difficult to fully streamline and introduce efficient, well-structured, comprehensive QA.

A LOT of labour goes into making it work. Most scientists I know and work with are very diligent people who care a lot about the outputs being as correct as possible, but wow, their workflows aren't great.

My job is to try and address this in whatever ways are practical for the data and the people doing the science, and it's kind of like Saas in that you think it should be easy enough to spot problems, solve them, and carry on/become a billionaire, but... The world is much more complicated than that, and it's easier to fail in this endeavour than it is to break even.

The classic "DropBox is just rsync" or "I could build Airbnb in a weekend" sentiments have their commonalities and counterparts in science, and the reality is similarly defeating and punishing on both sides. Making science go faster while maintaining correctness is exceedingly difficult. There are so many moving parts. So many disparate participants who are wildly technical and capable, or brilliant at studying bacteria in starfish yet terrified to run a command in a terminal. Your user base has virtually nothing in common in terms of ability and willingness to do anything other than get their own work done. It's brutal.

So, I sympathize with the authors of these papers and I hope readers don't assume they're bad at what they do or that it's done in bad faith. It's genuinely difficult.

An anecdote: I created a tool for validating biodiversity data against a specification called Darwin Core. Initially our published data was failing to validate so much that I thought I'd made the tool wrong. Rather, the spec is so complex and vast that the people I work with were unable to manage to get valid data into the public repositories. And yet! They were able to publish, because the public repositories' own validation is... Invalid. That's the state of things.

Granted, the data is still correct enough to be useful, and the errors don't cause the results to indicate anything that they shouldn't. It's more like minor metadata issues, failures to maintain referential integrity across different datasets, etc. But it's a very real, very difficult problem.

Science isn't easy at all. So many hoops to jump through, so much rigor, so much data. Mistakes are inevitable.

[-]

nippoo an hour ago

It's hard to avoid, but there are steps we can make towards fixing it. I spent years in academia building open-source data processing pipelines for neuroscience data and helping other researchers do the same. Most quantitative research goes through "lossy" steps between raw data and final results involving Excel spreadsheets, one-off MATLAB commands, copy pasting the results, etc.

In a lot of cases (where data is being collected by humans with a tape measure, say) there is room for error. But one of the things that's getting traction in some fields is open-source publication of both raw datasets and the evaluation/processing methods (in a Jupyter Notebook, say) in a way that lets other people run their analysis on your data, your analysis on their data, or at least re-run your start-to-finish pipeline and look for errors!

As is often the case, the holdups are mostly political: methods papers are less prestigious than the "real science" ones, and it takes journals / funders to mandate these things and provide funding/hosting for datasets for 10+ years, etc - researchers are a time-poor bunch and often won't do things unless there's an incentive to!

SubiculumCode 3 hours ago

One Offs. A lot of research results in one-off code. You may not go back to this dataset, these ideas again. When you do, sometimes years, later, you go, oh shit, this is hard to work with. So then you begin to build better structures, do the extra work it takes to make things easy to apply to new purposes or to accept new (but slightly different) datasets. It takes time and effort, and money. And that is where it all breaks down. Most scientists have to be jacks of many trades to get by.

TheTaytay 3 hours ago

Yes…mistakes are inevitable, and I get not expecting or demanding perfection. But the subtext here is that this is unlikely to be a mistake, and much more likely to be fraud.

There are incentives for these spreadsheets having the values that they do, and also there is no conceivable way that the values are correct, and on top of that, the most likely ways to get these values are to copy and paste large amounts of numbers, and even perturb some of them manually.

If you see this in accounting,(where there are also mistakes), it’s definitely fraud. (Awww man - we accidentally inflated our revenue and profit to meet expectations by accidentally duplicating numerous revenue lines and no one internally caught it! Dang interns!) If you see it in science, you ask the authors about it and they shrug and mumble a semi plausible explanation if you’re lucky? I can totally imagine a lab tech or grad student making a large copy paste mistake. I can’t imagine them making a series of them in such a way that it bolsters or proves the author’s claim AND goes completely undetected by everyone involved.

[-]

SubiculumCode 3 hours ago

well, in that case, its bad. Obviously.

dataviz1000 2 hours ago

> their workflows aren't great

Sounds like a startup idea.

[-]

analog31 2 hours ago

Spend a few years working in the target environment. It will disabuse you of the idea that science research can be regularized with technology.

adampunk 2 hours ago

You'll want to sit down when I tell you the budget these folks have for workflow solutions. Ain't gonna take long but might be shocking if you've got big startup hopes. ;)

[-]

steve_adams_86 2 minutes ago

For sure. These are often people who want better equipment to do their research, not software subscriptions that promise to force them to work in unfamiliar and uncompelling ways. You'd need a fantastic, game-changing idea to get meaningful traction.

One example of these might be systems like S3 and distributed computing in AWS. Like, huge ideas that take massive initiatives to implement, but make science meaningfully easier. I can't think of many other modern technologies we use that the team doesn't mostly resent (like Slack or Google Drive). They're largely interested in just doing the science, the rest eats into funding (which is increasingly sparse these days).

cyanydeez 5 hours ago

just imagine you scan private insustry. this is a generic problem that LLMs wont solve in generative capabilities.