A Java library for extracting tables from Text-Based PDFs and scanned PDFs

(github.com)

1 points | by mehulimukherjee 6 hours ago ago

1 comments

Hi HN,

Over the past year I’ve been working on ExtractPDF4J, an open-source Java library for extracting tables from real-world PDFs.

Many document processing pipelines rely on PDFs like bank statements, financial reports, or invoices. In practice these files are inconsistent: some are text-based, others are scanned images, and many contain irregular layouts or multi-page tables.

Most existing tools in this space are Python-based (like Camelot or Tabula). In JVM-heavy environments this often means running a separate Python service or building a hybrid stack.

ExtractPDF4J was designed to solve this problem directly in Java.

Key ideas behind the project:

• Hybrid parsing strategies (stream + lattice detection) • OCR fallback for scanned documents • CLI and service modules for production workflows • Maven Central distribution for easy integration

The latest release also introduced a BOM module to simplify dependency management and a full documentation site.

Project: https://github.com/ExtractPDF4J/ExtractPDF4J

Docs: https://extractpdf4j.github.io/ExtractPDF4J/

I’d really appreciate feedback from people who have dealt with messy PDF extraction problems. Suggestions and contributions are welcome. Star the repo for more reach to the Java community. Thank you!