1 comments

  • mehulimukherjee 6 hours ago

    Hi HN,

    Over the past year I’ve been working on ExtractPDF4J, an open-source Java library for extracting tables from real-world PDFs.

    Many document processing pipelines rely on PDFs like bank statements, financial reports, or invoices. In practice these files are inconsistent: some are text-based, others are scanned images, and many contain irregular layouts or multi-page tables.

    Most existing tools in this space are Python-based (like Camelot or Tabula). In JVM-heavy environments this often means running a separate Python service or building a hybrid stack.

    ExtractPDF4J was designed to solve this problem directly in Java.

    Key ideas behind the project:

    • Hybrid parsing strategies (stream + lattice detection) • OCR fallback for scanned documents • CLI and service modules for production workflows • Maven Central distribution for easy integration

    The latest release also introduced a BOM module to simplify dependency management and a full documentation site.

    Project: https://github.com/ExtractPDF4J/ExtractPDF4J

    Docs: https://extractpdf4j.github.io/ExtractPDF4J/

    I’d really appreciate feedback from people who have dealt with messy PDF extraction problems. Suggestions and contributions are welcome. Star the repo for more reach to the Java community. Thank you!