Part of the goal with releasing the dataset is to highlight how hard PDF parsing can be. Reducto models are SOTA, but they aren't perfect.
We constantly see alternatives show one ideal table to claim they're accurate. Being able to parse some tables is not hard.
What happens when it has merged cells, dense text, rotations, or no gridlines? Will your table outputs be the same when a user uploads a document twice?
Our team is relentlessly focused on solving for the true range of scenarios so our customers don't have to. Excited to share more about our next gen models soon.
This is great, but are there datasets for this already? I know pubtables is like 1M labeled data points. Also how important are table schemas as a % of overall unstructured documents?
Love the Pubtables work! It's a really useful dataset. Their data comes from existing annotations from scientific papers, so in our experience it doesn't include a lot of the hardest cases that a lot of methods fail at today. The annotations are computer generated instead of manually labeled, so you don't have things like scanned and rotated images or a lot of diversity in languages.
I'd encourage you to take a look at some of our data points to compare for yourself! Link: huggingface.co/spaces/reducto/rd_table_bench
In terms of the overall importance of table extraction, we've found it to be a key bottleneck for folks looking to do document parsing. It's up there amongst the hardest problems in the space alongside complex form region parsing. I don't have the exact statistics handy, but I'd estimate that ~25% of the pages we parse have some hairy tables in them!
I have realworld bank statements that I have been unable to find any PDF/AI extractor that can do a good job on.
(To summarize, the core challenge appears to be recognizing nested columnar layout formats combined with odd line wrapping within those columns.)
Is there anyone I can submit an example few pages to for consideration in some benchmark?
happy to add examples to future iterations of this dataset if you want to send examples!
Part of the goal with releasing the dataset is to highlight how hard PDF parsing can be. Reducto models are SOTA, but they aren't perfect.
We constantly see alternatives show one ideal table to claim they're accurate. Being able to parse some tables is not hard.
What happens when it has merged cells, dense text, rotations, or no gridlines? Will your table outputs be the same when a user uploads a document twice?
Our team is relentlessly focused on solving for the true range of scenarios so our customers don't have to. Excited to share more about our next gen models soon.
Not surprising to see Reducto at the top, it's by far the best option we've tried
This is great, but are there datasets for this already? I know pubtables is like 1M labeled data points. Also how important are table schemas as a % of overall unstructured documents?
Love the Pubtables work! It's a really useful dataset. Their data comes from existing annotations from scientific papers, so in our experience it doesn't include a lot of the hardest cases that a lot of methods fail at today. The annotations are computer generated instead of manually labeled, so you don't have things like scanned and rotated images or a lot of diversity in languages.
I'd encourage you to take a look at some of our data points to compare for yourself! Link: huggingface.co/spaces/reducto/rd_table_bench
In terms of the overall importance of table extraction, we've found it to be a key bottleneck for folks looking to do document parsing. It's up there amongst the hardest problems in the space alongside complex form region parsing. I don't have the exact statistics handy, but I'd estimate that ~25% of the pages we parse have some hairy tables in them!