+1 to weasyprint; I have used weasyprint with a django production system for a few years now, and it works well enough that I never have to think about it. I'm not doing anything fancy, though, but for me it has worked well.
WeasyPrint works really well for me. It can support all of the languages and fonts I need. I run it on AWS Lambda and in Docker as a web service.
I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed.
Prince XML looks nice but what about creating a PDF directly from a website? This often adds some problems, for example links still pointing to other pages on the web. But in my experience printing to PDF is often not good enough.
Yes, I did that for a recent small program. The @media print media query is powerful enough for most of the stuff I wanted to format nicely. Even page breaks are possible.
Puppeteer and Playwright are the main open-source options nowadays, both solid for HTML → PDF once your print CSS is sorted.
Don’t forget proper page breaks (break-before/after/inside) — e.g. break-after: page works in Chromium, while always doesn’t. For trickier pagination you can look at Paged.js, and I’d test layouts in Chrome/Edge before automating.
Seconded. I went with C# + Playwright. I tried iTextSharp, iText, PDFSharp, and wkhtmltopdf, but they all had limitations. I had good results with Playwright in minutes, outside of tweaking the CSS like you mention.
I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published.
Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file:
I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.
I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.
I know, even though that depends on the editor. Okular for example places them in an extra file, last I checked. That's not unique to PDFs. HTML files are modifiable. There is nothing preventing an editor to put annotations in it as well.
PDF is designed for annotations in the file format. You annotate in one editor, you can change the annotations in another. You can always distinguish between original content and annotations. I see no indication that Okular stores highlights or annotations in a separate file, that would be bizarre.
There is no mechanism for annotations in HTML or the other formats I listed. An editor would just be editing the original content in its own non-standardized, non-portable way, which is not desirable for a number of reasons.
So when you say:
> What you are describing are features of an editor, not a file format.
That is incorrect. It is an intentionally designed and standardized feature of the file format.
If that is your goal, you should be cryptographically signing your documents with your PGP key. That way you actually have assurance the document has not been modified rather than just hoping someone hasn't modified the document. Additionally, PGP can sign anything so you are open to use whatever format you want.
Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing.
I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex
- caveat 1: this is (or was) a more or less undocumented function and a few years ago it just disappeared only to come back in a later release.
- caveat 2: even though you can convert local files it does require internet access as any references to icons, style sheets, fonts and tracker pixels cause Firefox to attempt to retrieve them without any (sensible) timeout. So, running this on a server without internet access will make the process hang forever.
Why, Firefox has a headless mode. It can't just print a document via a simple CLI command, you have to go for Selenium (or maybe Playwright, I did not try it in that capacity). Foxdriver would work, but its development ceased.
If generating PDF dynamically is what you really care about, consider Typst. https://typst.app/
We use it in production to generate reports, and it is amazing.
If you don't really need the PDF but just want to archive pages, SingleFile is better. It'll capture the entire page to a single HTML file and I find this is better than the PDF if I don't want to print it. It's a browser extension, but there's also a command line version (https://github.com/gildas-lormeau/single-file-cli) that uses Chrome or Chromium's headless mode.
I have Chromium shoved into an AWS Lambda Layer, when we need HTML to PDF conversion we shove it off onto that. It loads the HTML into Chromium then "prints" it to PDF.
The last time I had to do this I scripted a back-end that scaled up headless chrome browsers to render web pages to PDF. I think it was using Puppeteer, but was a few years ago. (FWIW the decision I think was mostly driven by the environment, I think there are other options.)
Is this an xy problem? If you have the original document (in Markdown), one possibility would be to use my software, KeenWrite[1], to convert Markdown to XHTML then typeset XHTML to PDF via ConTeXt. See the user manual[2] for an example of a Markdown document typeset in this fashion, along with usage instructions.
If you only have HTML to work with, you can also use Flying Saucer[3], which is what KeenWrite uses to preview Markdown documents when rendered as HTML. Flying Saucer uses an open-source version of iText[4] to produce PDF documents (from HTML source docs).
A reverse of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.
I'd love to go the other way: convert a PDF into a self contained HTML page that renders properly in a browser. It's been way harder than I thought it would. Any advice?
You could embed it as a base64 blob, embed PDF.js (which is included by browsers anyway, I think) and use that to render it in the HTML. But I realize you probably meant a static HTML without JavaScript.
Depending on your requirements on both PDF input and HTML output, there is often no way to do this that is both easy and general. At it's core, PDFs are not designed to be universally reflowable.
Building PDF directly (unless you're creating documents, especially fillables) is non-intuitive. Most PDFs are people trying to capture live data in a cached manner. If not, using a preliminary format like Markdown/HTML/LaTeX/DocX/etc to generate your PDF is almost always more intuitive.
My website's content is xml, and I use Apache Fop to turn it into a PDF with page numbers and other nice things. It works nicely, but takes some setup.
https://gotenberg.dev/
...has been working well for me for the last few years. It's a headless instance of Google Chrome with a golang wrapper. Runs well in Docker or a cloud instance.
A revers of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.
To reinforce this: pandoc has been the go-to for a long, long time and they have encountered and addressed tons of issues, which is especially important for two underspecified and over-provisioned formats like HTML and pdf.
Go through the revision and bug history to see a sample of issues you're avoiding by using a highly-trafficked, well-supported solution.
The only reason not to use it is when they say they don't support a given feature that you need; and the nice thing there is that they'll usually say it, and have a good reason why.
The other reason to use pandoc is that while you might currently want PDF as your outbound format, you might end up preferring some other format (structured logically instead of by layout); with pandoc that change would be easy.
Finally, pandoc is extensible. If you do find that you want different output in some respect, you can easily write an plugin (in python or haskel or ...) to make exactly the tweak you need.
I mean everything has dependencies (some of the solutions elsewhere require Chrome and other common solutions require the JVM). At least Pandoc is GPL.
There are multiple ways to "depend", so if pandoc executes some external tool all of the work then might as well use that external tool directly. You will get more control over how the conversion happens, know for what search for when in trouble etc.
Been using this as well. It's worth noting that while the original project appears to have been abandoned, it has since been forked and is currently maintained here: https://github.com/openhtmltopdf/openhtmltopdf
Don’t. Show a web page and open the print dialog, and tell people to save as PDF. All major browsers support this, and the browser HTML to PDF code is the most robust and accurate.
That does make sense where possible. I do feel like OPs question is super relevant if you are doing anything where the PDF has to be rendered server side, like say as part of a larger data process when producing an exportable report in PDF format.
Just print to PDF in a browser, or automate that using a browser automation tool. For a non-browser-based open source solution, WeasyPrint.
https://weasyprint.org/
For a proprietary solution, try Prince XML:
https://www.princexml.com/
+1 to weasyprint; I have used weasyprint with a django production system for a few years now, and it works well enough that I never have to think about it. I'm not doing anything fancy, though, but for me it has worked well.
WeasyPrint works really well for me. It can support all of the languages and fonts I need. I run it on AWS Lambda and in Docker as a web service.
I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed.
Edit: Previous post with some good discussion: https://news.ycombinator.com/item?id=26578826
Most website do not have a print CSS, so it doesn’t print that nicely in PDF.
But, I upvote weasyprint for that instead.
Prince XML looks nice but what about creating a PDF directly from a website? This often adds some problems, for example links still pointing to other pages on the web. But in my experience printing to PDF is often not good enough.
Yes, I did that for a recent small program. The @media print media query is powerful enough for most of the stuff I wanted to format nicely. Even page breaks are possible.
I’ve had excellent experience with Prince XML and poor experience with everything else I’ve tried. Prince is fast, robust and reliable.
Yes it costs money. So does developer time.
Agreed. Prince also has a lot of good features for headers, footers, page numbering, etc, that make it very powerful.
Puppeteer and Playwright are the main open-source options nowadays, both solid for HTML → PDF once your print CSS is sorted. Don’t forget proper page breaks (break-before/after/inside) — e.g. break-after: page works in Chromium, while always doesn’t. For trickier pagination you can look at Paged.js, and I’d test layouts in Chrome/Edge before automating.
Shameless plug: I run yakpdf.com, a hosted Puppeteer-based service if you want to avoid self-hosting. https://rapidapi.com/yakpdf-yakpdf/api/yakpdf
Seconded. I went with C# + Playwright. I tried iTextSharp, iText, PDFSharp, and wkhtmltopdf, but they all had limitations. I had good results with Playwright in minutes, outside of tweaking the CSS like you mention.
I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published.
[0] https://johnh.co/blog/creating-pdfs-from-html-using-csharp
Thirded, you can build this straight into your backend or into a microservice very easily.
You can also easily generate screenshots if that's more suitable than PDFs.
You can also easily use this to do stuff like jam a set of images into a HTML table and PDF or screenshot them in that format.
Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file:
Or, please do?
I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.
I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.
PDF is literally digital paper. HTML has logical structure, it can adapt to different displays, etc.
Sometimes you want one, sometimes, the other.
You could, though. What you are describing are features of an editor, not a file format. I can imagine a browser addon performing the same tasks.
PDF annotations sit within the file.
I know, even though that depends on the editor. Okular for example places them in an extra file, last I checked. That's not unique to PDFs. HTML files are modifiable. There is nothing preventing an editor to put annotations in it as well.
PDF is designed for annotations in the file format. You annotate in one editor, you can change the annotations in another. You can always distinguish between original content and annotations. I see no indication that Okular stores highlights or annotations in a separate file, that would be bizarre.
There is no mechanism for annotations in HTML or the other formats I listed. An editor would just be editing the original content in its own non-standardized, non-portable way, which is not desirable for a number of reasons.
So when you say:
> What you are describing are features of an editor, not a file format.
That is incorrect. It is an intentionally designed and standardized feature of the file format.
HTML+CSS+media files isn’t a nice format, and much less portable through time and space than PDF.
Please don't police what other people do.
Pandoc would be my preferred tool. It is excellent at converting between other formats as well.
Being (not so easily) edited is often a feature, not a bug.
If that is your goal, you should be cryptographically signing your documents with your PGP key. That way you actually have assurance the document has not been modified rather than just hoping someone hasn't modified the document. Additionally, PGP can sign anything so you are open to use whatever format you want.
Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing.
I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex
pandoc input.html -o output.pdf --pdf-engine=<your engine>
chrome --headless --disable-gpu --print-to-pdf https://example.com
same: google-chrome --headless --disable-gpu --no-pdf-header-footer --hide-scrollbars --print-to-pdf-margins="0,0,0,0" --print-to-pdf --window-size=1280,720 https://example.com
ended up using headless chrome specifically to make sure javascript things rendered properly
Can Chromium do this?
Edit: it appears so- https://news.ycombinator.com/item?id=15131840
Yes, routinely works for me.
Can Firefox do this?
with an elaborate script that relies on xdotool
Yes, kind of...
/path/to/firefox --window-size 1700 --headless -screenshot myfile.png file://myfile.html
Easy, right ?
Used this for many years... but beware:
- caveat 1: this is (or was) a more or less undocumented function and a few years ago it just disappeared only to come back in a later release.
- caveat 2: even though you can convert local files it does require internet access as any references to icons, style sheets, fonts and tracker pixels cause Firefox to attempt to retrieve them without any (sensible) timeout. So, running this on a server without internet access will make the process hang forever.
Why, Firefox has a headless mode. It can't just print a document via a simple CLI command, you have to go for Selenium (or maybe Playwright, I did not try it in that capacity). Foxdriver would work, but its development ceased.
pandoc
If generating PDF dynamically is what you really care about, consider Typst. https://typst.app/ We use it in production to generate reports, and it is amazing.
See https://lwn.net/Articles/1037577/ for a recent summary of what you can do with Typst.
I've had decent results with html-pdf-chrome[0], which automates printing to PDF from Chromium or Chrome.
[0] https://github.com/westy92/html-pdf-chrome/
If you don't really need the PDF but just want to archive pages, SingleFile is better. It'll capture the entire page to a single HTML file and I find this is better than the PDF if I don't want to print it. It's a browser extension, but there's also a command line version (https://github.com/gildas-lormeau/single-file-cli) that uses Chrome or Chromium's headless mode.
I have Chromium shoved into an AWS Lambda Layer, when we need HTML to PDF conversion we shove it off onto that. It loads the HTML into Chromium then "prints" it to PDF.
The last time I had to do this I scripted a back-end that scaled up headless chrome browsers to render web pages to PDF. I think it was using Puppeteer, but was a few years ago. (FWIW the decision I think was mostly driven by the environment, I think there are other options.)
Is this an xy problem? If you have the original document (in Markdown), one possibility would be to use my software, KeenWrite[1], to convert Markdown to XHTML then typeset XHTML to PDF via ConTeXt. See the user manual[2] for an example of a Markdown document typeset in this fashion, along with usage instructions.
If you only have HTML to work with, you can also use Flying Saucer[3], which is what KeenWrite uses to preview Markdown documents when rendered as HTML. Flying Saucer uses an open-source version of iText[4] to produce PDF documents (from HTML source docs).
Another possibility is to use pandoc and LaTeX.
[1]: https://keenwrite.com/
[2]: https://keenwrite.com/docs/user-manual.pdf
[3]: https://github.com/flyingsaucerproject/flyingsaucer
[4]: https://itextpdf.com/
I'm surprised no one mentioned mPDF. Maybe php isn't very popular here :)
pandoc
A reverse of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.
I have been using pdf2htmlex with some success. https://github.com/pdf2htmlEX/pdf2htmlEX
I'd love to go the other way: convert a PDF into a self contained HTML page that renders properly in a browser. It's been way harder than I thought it would. Any advice?
You could embed it as a base64 blob, embed PDF.js (which is included by browsers anyway, I think) and use that to render it in the HTML. But I realize you probably meant a static HTML without JavaScript.
> renders properly
Depending on your requirements on both PDF input and HTML output, there is often no way to do this that is both easy and general. At it's core, PDFs are not designed to be universally reflowable.
If your HTML is simply an intermediary to get you to a PDF, you could consider just skipping straight to building the PDF directly:
https://pdfbox.apache.org
This would be far more efficient than spinning up an entire browser and printing PDFs to disk.
Building PDF directly (unless you're creating documents, especially fillables) is non-intuitive. Most PDFs are people trying to capture live data in a cached manner. If not, using a preliminary format like Markdown/HTML/LaTeX/DocX/etc to generate your PDF is almost always more intuitive.
https://gotenberg.dev
My website's content is xml, and I use Apache Fop to turn it into a PDF with page numbers and other nice things. It works nicely, but takes some setup.
https://gotenberg.dev/ ...has been working well for me for the last few years. It's a headless instance of Google Chrome with a golang wrapper. Runs well in Docker or a cloud instance.
gotenberg is really rock solid for us. Easy to deploy as a docker container to any infrastructure.
https://github.com/plutoprint/plutobook was a recent Show HN and looks excellent
pandoc is your friend.
A revers of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.
jsPDF is a work of art https://parall.ax/products/jspdf
I run chromium on my server and render the PDF from there using puppeteer.
5k pdfs a month for archival purposes, must be pdf, customers demand this
pandoc
To reinforce this: pandoc has been the go-to for a long, long time and they have encountered and addressed tons of issues, which is especially important for two underspecified and over-provisioned formats like HTML and pdf.
Go through the revision and bug history to see a sample of issues you're avoiding by using a highly-trafficked, well-supported solution.
The only reason not to use it is when they say they don't support a given feature that you need; and the nice thing there is that they'll usually say it, and have a good reason why.
The other reason to use pandoc is that while you might currently want PDF as your outbound format, you might end up preferring some other format (structured logically instead of by layout); with pandoc that change would be easy.
Finally, pandoc is extensible. If you do find that you want different output in some respect, you can easily write an plugin (in python or haskel or ...) to make exactly the tweak you need.
Does pandoc do JavaScript? For stuff that is rendered (I don't want animated, interactive PDFs...).
doesn't pandoc rely on some engine itself?
Yep, you need something like XeTeX in order to render the PDF.
Curious why that matters to you?
I mean everything has dependencies (some of the solutions elsewhere require Chrome and other common solutions require the JVM). At least Pandoc is GPL.
It matters because pandoc is not rendering the website to pdf, it converts the html to latex and then uses a latex engine to render the pdf.
Forgive me but I don’t understand why that matters to you and am trying to understand what the issue with Latex is.
Because lots of things work this way. For example compilers built on LLV uses an intermediate language and Python uses byte code.
I suspect some html to pdf tools go through postScript.
There are multiple ways to "depend", so if pandoc executes some external tool all of the work then might as well use that external tool directly. You will get more control over how the conversion happens, know for what search for when in trouble etc.
My understanding and experience is that Latex has a significant learning curve and Pandoc provides a more gentle front end.
Of course Latex gives you fine control to hand tune the engine…but that doesn’t seem like what the OP is looking for.
openhtmltopdf is what we're using. Some outdated versions.
Been using this as well. It's worth noting that while the original project appears to have been abandoned, it has since been forked and is currently maintained here: https://github.com/openhtmltopdf/openhtmltopdf
the only thing I found to work reliably well is simply Chromium's print feature
Don’t. Show a web page and open the print dialog, and tell people to save as PDF. All major browsers support this, and the browser HTML to PDF code is the most robust and accurate.
There's nothing in OP's question that suggests this is a one-off operation in response to a user action.
It's very likely to be a massive batch operation of a ton of HTML files that might not even be their own site.
this is the case indeed
That does make sense where possible. I do feel like OPs question is super relevant if you are doing anything where the PDF has to be rendered server side, like say as part of a larger data process when producing an exportable report in PDF format.
if you are doing html to pdf, you might also need the ability to merge. a few more features and you're better of with a commercial solution.
Merge what?
I assume combining 2+ documents. For example, attaching a cover page with document owner/version control/lifecycle information to an existing PDF.
That's the easiest thing in the world with free software.
One way is to install poppler-utils and use pdfunite. There are many other open-source packages you can use as well.