Show HN: POC to scrape and structure HTML into JSON for RAG

(structured.pages.dev)

8 points | by nirvanist 2 days ago ago

6 comments

  • mahi_novice a day ago

    Do you mind sharing more about the implementation details? Any safeguards you have for the urls and all?

    • nirvanist a day ago

      Basically, I use a headless Chromium with Puppeteer to render the page. Then, some logic extracts and cleans the HTML content. Finally, I use Gemini with a specific schema to return a JSON response.

  • mahi_novice a day ago

    How do you plan to use it?

    • nirvanist a day ago

      At the moment, I use it in client projects to build agents for their chat systems by adding RAG to models