Ask HN: What is nowadays (opensource) way of converting HTML to PDF?

62 points | by hhthrowaway1230 3 days ago ago

80 comments

pabs3 2 days ago

Just print to PDF in a browser, or automate that using a browser automation tool. For a non-browser-based open source solution, WeasyPrint.

https://weasyprint.org/

For a proprietary solution, try Prince XML:

https://www.princexml.com/

[-]

rossdavidh 3 hours ago

+1 to weasyprint; I have used weasyprint with a django production system for a few years now, and it works well enough that I never have to think about it. I'm not doing anything fancy, though, but for me it has worked well.

grounder 2 hours ago

WeasyPrint works really well for me. It can support all of the languages and fonts I need. I run it on AWS Lambda and in Docker as a web service.

I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed.

Edit: Previous post with some good discussion: https://news.ycombinator.com/item?id=26578826

jiehong 30 minutes ago

Most website do not have a print CSS, so it doesn’t print that nicely in PDF.

But, I upvote weasyprint for that instead.

sureglymop 2 hours ago

Prince XML looks nice but what about creating a PDF directly from a website? This often adds some problems, for example links still pointing to other pages on the web. But in my experience printing to PDF is often not good enough.

[-]

chinathrow 29 minutes ago

Yes, I did that for a recent small program. The @media print media query is powerful enough for most of the stuff I wanted to format nicely. Even page breaks are possible.

jmyeet 2 hours ago

I’ve had excellent experience with Prince XML and poor experience with everything else I’ve tried. Prince is fast, robust and reliable.

Yes it costs money. So does developer time.

[-]

angst_ridden 8 minutes ago

Agreed. Prince also has a lot of good features for headers, footers, page numbering, etc, that make it very powerful.

kappadi3 3 days ago

Puppeteer and Playwright are the main open-source options nowadays, both solid for HTML → PDF once your print CSS is sorted. Don’t forget proper page breaks (break-before/after/inside) — e.g. break-after: page works in Chromium, while always doesn’t. For trickier pagination you can look at Paged.js, and I’d test layouts in Chrome/Edge before automating.

Shameless plug: I run yakpdf.com, a hosted Puppeteer-based service if you want to avoid self-hosting. https://rapidapi.com/yakpdf-yakpdf/api/yakpdf

[-]

johnh-hn 4 hours ago

Seconded. I went with C# + Playwright. I tried iTextSharp, iText, PDFSharp, and wkhtmltopdf, but they all had limitations. I had good results with Playwright in minutes, outside of tweaking the CSS like you mention.

I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published.

[0] https://johnh.co/blog/creating-pdfs-from-html-using-csharp

[-]

benoau 12 minutes ago

Thirded, you can build this straight into your backend or into a microservice very easily.

You can also easily generate screenshots if that's more suitable than PDFs.

You can also easily use this to do stuff like jam a set of images into a HTML table and PDF or screenshot them in that format.

Aachen 3 hours ago

Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file:

    pandoc --self-contained input.html -o output.html

[-]

crazygringo 2 hours ago

Or, please do?

I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.

I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.

[-]

nine_k an hour ago

PDF is literally digital paper. HTML has logical structure, it can adapt to different displays, etc.

Sometimes you want one, sometimes, the other.

mr_mitm 2 hours ago

You could, though. What you are describing are features of an editor, not a file format. I can imagine a browser addon performing the same tasks.

[-]

whenc 2 hours ago

PDF annotations sit within the file.

[-]

mr_mitm an hour ago

I know, even though that depends on the editor. Okular for example places them in an extra file, last I checked. That's not unique to PDFs. HTML files are modifiable. There is nothing preventing an editor to put annotations in it as well.

[-]

crazygringo 33 minutes ago

PDF is designed for annotations in the file format. You annotate in one editor, you can change the annotations in another. You can always distinguish between original content and annotations. I see no indication that Okular stores highlights or annotations in a separate file, that would be bizarre.

There is no mechanism for annotations in HTML or the other formats I listed. An editor would just be editing the original content in its own non-standardized, non-portable way, which is not desirable for a number of reasons.

So when you say:

> What you are describing are features of an editor, not a file format.

That is incorrect. It is an intentionally designed and standardized feature of the file format.

layer8 24 minutes ago

HTML+CSS+media files isn’t a nice format, and much less portable through time and space than PDF.

moralestapia an hour ago

Please don't police what other people do.

agedclock 3 hours ago

Pandoc would be my preferred tool. It is excellent at converting between other formats as well.

TylerE 3 hours ago

Being (not so easily) edited is often a feature, not a bug.

[-]

craftkiller an hour ago

If that is your goal, you should be cryptographically signing your documents with your PGP key. That way you actually have assurance the document has not been modified rather than just hoping someone hasn't modified the document. Additionally, PGP can sign anything so you are open to use whatever format you want.

ryandrake 2 hours ago

Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing.

guywithahat 3 hours ago

I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex

pandoc input.html -o output.pdf --pdf-engine=<your engine>

Snawoot 4 hours ago

chrome --headless --disable-gpu --print-to-pdf https://example.com

[-]

piptastic 3 hours ago

same: google-chrome --headless --disable-gpu --no-pdf-header-footer --hide-scrollbars --print-to-pdf-margins="0,0,0,0" --print-to-pdf --window-size=1280,720 https://example.com

ended up using headless chrome specifically to make sure javascript things rendered properly

HPsquared 3 hours ago

Can Chromium do this?

Edit: it appears so- https://news.ycombinator.com/item?id=15131840

[-]

nine_k an hour ago

Yes, routinely works for me.

mmphosis 2 hours ago

Can Firefox do this?

with an elaborate script that relies on xdotool

[-]

andrehacker an hour ago

Yes, kind of...

/path/to/firefox --window-size 1700 --headless -screenshot myfile.png file://myfile.html

Easy, right ?

Used this for many years... but beware:

- caveat 1: this is (or was) a more or less undocumented function and a few years ago it just disappeared only to come back in a later release.

- caveat 2: even though you can convert local files it does require internet access as any references to icons, style sheets, fonts and tracker pixels cause Firefox to attempt to retrieve them without any (sensible) timeout. So, running this on a server without internet access will make the process hang forever.

nine_k 44 minutes ago

Why, Firefox has a headless mode. It can't just print a document via a simple CLI command, you have to go for Selenium (or maybe Playwright, I did not try it in that capacity). Foxdriver would work, but its development ceased.

efnx 3 minutes ago

pandoc

lizimo an hour ago

If generating PDF dynamically is what you really care about, consider Typst. https://typst.app/ We use it in production to generate reports, and it is amazing.

[-]

leephillips an hour ago

See https://lwn.net/Articles/1037577/ for a recent summary of what you can do with Typst.

cjm42 25 minutes ago

I've had decent results with html-pdf-chrome[0], which automates printing to PDF from Chromium or Chrome.

[0] https://github.com/westy92/html-pdf-chrome/

RiverCrochet 3 hours ago

If you don't really need the PDF but just want to archive pages, SingleFile is better. It'll capture the entire page to a single HTML file and I find this is better than the PDF if I don't want to print it. It's a browser extension, but there's also a command line version (https://github.com/gildas-lormeau/single-file-cli) that uses Chrome or Chromium's headless mode.

juice_bus 2 hours ago

I have Chromium shoved into an AWS Lambda Layer, when we need HTML to PDF conversion we shove it off onto that. It loads the HTML into Chromium then "prints" it to PDF.

Glyptodon an hour ago

The last time I had to do this I scripted a back-end that scaled up headless chrome browsers to render web pages to PDF. I think it was using Puppeteer, but was a few years ago. (FWIW the decision I think was mostly driven by the environment, I think there are other options.)

thangalin 3 hours ago

Is this an xy problem? If you have the original document (in Markdown), one possibility would be to use my software, KeenWrite[1], to convert Markdown to XHTML then typeset XHTML to PDF via ConTeXt. See the user manual[2] for an example of a Markdown document typeset in this fashion, along with usage instructions.

If you only have HTML to work with, you can also use Flying Saucer[3], which is what KeenWrite uses to preview Markdown documents when rendered as HTML. Flying Saucer uses an open-source version of iText[4] to produce PDF documents (from HTML source docs).

Another possibility is to use pandoc and LaTeX.

[1]: https://keenwrite.com/

[2]: https://keenwrite.com/docs/user-manual.pdf

[3]: https://github.com/flyingsaucerproject/flyingsaucer

[4]: https://itextpdf.com/

handzhiev 35 minutes ago

I'm surprised no one mentioned mPDF. Maybe php isn't very popular here :)

syngrog66 6 minutes ago

pandoc

haft 3 hours ago

A reverse of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.

[-]

bencornia 2 hours ago

I have been using pdf2htmlex with some success. https://github.com/pdf2htmlEX/pdf2htmlEX

freedomben 2 hours ago

I'd love to go the other way: convert a PDF into a self contained HTML page that renders properly in a browser. It's been way harder than I thought it would. Any advice?

[-]

mr_mitm an hour ago

You could embed it as a base64 blob, embed PDF.js (which is included by browsers anyway, I think) and use that to render it in the HTML. But I realize you probably meant a static HTML without JavaScript.

drabbiticus an hour ago

> renders properly

Depending on your requirements on both PDF input and HTML output, there is often no way to do this that is both easy and general. At it's core, PDFs are not designed to be universally reflowable.

bob1029 2 hours ago

If your HTML is simply an intermediary to get you to a PDF, you could consider just skipping straight to building the PDF directly:

https://pdfbox.apache.org

This would be far more efficient than spinning up an entire browser and printing PDFs to disk.

[-]

deaddodo an hour ago

Building PDF directly (unless you're creating documents, especially fillables) is non-intuitive. Most PDFs are people trying to capture live data in a cached manner. If not, using a preliminary format like Markdown/HTML/LaTeX/DocX/etc to generate your PDF is almost always more intuitive.

delduca an hour ago

https://gotenberg.dev

ratStallion 2 hours ago

My website's content is xml, and I use Apache Fop to turn it into a PDF with page numbers and other nice things. It works nicely, but takes some setup.

mightjustwork 3 days ago

https://gotenberg.dev/ ...has been working well for me for the last few years. It's a headless instance of Google Chrome with a golang wrapper. Runs well in Docker or a cloud instance.

[-]

hansonkd 4 hours ago

gotenberg is really rock solid for us. Easy to deploy as a docker container to any infrastructure.

nicoburns 3 hours ago

https://github.com/plutoprint/plutobook was a recent Show HN and looks excellent

gigatexal 39 minutes ago

pandoc is your friend.

haft 3 hours ago

A revers of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.

lucis 37 minutes ago

jsPDF is a work of art https://parall.ax/products/jspdf

throw03172019 3 days ago

I run chromium on my server and render the PDF from there using puppeteer.

hhthrowaway1230 2 hours ago

5k pdfs a month for archival purposes, must be pdf, customers demand this

zja 3 days ago

pandoc

[-]

w10-1 3 hours ago

To reinforce this: pandoc has been the go-to for a long, long time and they have encountered and addressed tons of issues, which is especially important for two underspecified and over-provisioned formats like HTML and pdf.

Go through the revision and bug history to see a sample of issues you're avoiding by using a highly-trafficked, well-supported solution.

The only reason not to use it is when they say they don't support a given feature that you need; and the nice thing there is that they'll usually say it, and have a good reason why.

The other reason to use pandoc is that while you might currently want PDF as your outbound format, you might end up preferring some other format (structured logically instead of by layout); with pandoc that change would be easy.

Finally, pandoc is extensible. If you do find that you want different output in some respect, you can easily write an plugin (in python or haskel or ...) to make exactly the tweak you need.

beeforpork 3 hours ago

Does pandoc do JavaScript? For stuff that is rendered (I don't want animated, interactive PDFs...).

hhthrowaway1230 3 days ago

doesn't pandoc rely on some engine itself?

[-]

cpach a day ago

Yep, you need something like XeTeX in order to render the PDF.

brudgers a day ago

Curious why that matters to you?

I mean everything has dependencies (some of the solutions elsewhere require Chrome and other common solutions require the JVM). At least Pandoc is GPL.

[-]

kakokiyrvoooo 13 hours ago

It matters because pandoc is not rendering the website to pdf, it converts the html to latex and then uses a latex engine to render the pdf.

[-]

brudgers an hour ago

Forgive me but I don’t understand why that matters to you and am trying to understand what the issue with Latex is.

Because lots of things work this way. For example compilers built on LLV uses an intermediate language and Python uses byte code.

I suspect some html to pdf tools go through postScript.

kreetx 3 hours ago

There are multiple ways to "depend", so if pandoc executes some external tool all of the work then might as well use that external tool directly. You will get more control over how the conversion happens, know for what search for when in trouble etc.

[-]

brudgers an hour ago

My understanding and experience is that Latex has a significant learning curve and Pandoc provides a more gentle front end.

Of course Latex gives you fine control to hand tune the engine…but that doesn’t seem like what the OP is looking for.

exabrial 4 hours ago

openhtmltopdf is what we're using. Some outdated versions.

[-]

supersaw an hour ago

Been using this as well. It's worth noting that while the original project appears to have been abandoned, it has since been forked and is currently maintained here: https://github.com/openhtmltopdf/openhtmltopdf

ftchd 2 hours ago

the only thing I found to work reliably well is simply Chromium's print feature

fogzen 16 hours ago

Don’t. Show a web page and open the print dialog, and tell people to save as PDF. All major browsers support this, and the browser HTML to PDF code is the most robust and accurate.

[-]

crazygringo 3 hours ago

There's nothing in OP's question that suggests this is a one-off operation in response to a user action.

It's very likely to be a massive batch operation of a ton of HTML files that might not even be their own site.

[-]

hhthrowaway1230 2 hours ago

this is the case indeed

chibbell 3 hours ago

That does make sense where possible. I do feel like OPs question is super relevant if you are doing anything where the PDF has to be rendered server side, like say as part of a larger data process when producing an exportable report in PDF format.

journal 16 hours ago

if you are doing html to pdf, you might also need the ability to merge. a few more features and you're better of with a commercial solution.

[-]

crazygringo 3 hours ago

Merge what?

[-]

pentium166 3 hours ago

I assume combining 2+ documents. For example, attaching a cover page with document owner/version control/lifecycle information to an existing PDF.

[-]

crazygringo 2 hours ago

That's the easiest thing in the world with free software.

One way is to install poppler-utils and use pdfunite. There are many other open-source packages you can use as well.