I built a Git-tracked book production pipeline

(djspeckhals.com)

261 points | by dustin1114 5 days ago ago

66 comments

As someone who worked for years in commercial print, before most manufacturing moved overseas, I recall the workflows the article discusses as being more automate-able than the author seems to understand. For example, "Making the slightest change became a chore. [1.] Update the 'master' DOCX. [2.] Update the InDesign file ..." --the appropriate way to use an external document as master in InDesign is to use the Place command, which autoupdates text changes as they are made in Word. As another example, InDesign supports multiple formats of EPUB by direct export. I also question the author's familiarity with common LaTeX workflows. "'Why didn’t you just author it in LaTeX? ...' you might ask. [B]ut I prefer writing novels in a word processor, not a text editor." And, "How do I convert an ODT file to TeX?" Word processors offer exports of all kinds, including to plain text, and the purpose of a TeX editor is, like InDesign, to typeset text that is often written elsewhere. Capturing the styling from the word processor seems antithetical to the desire for an advanced typesetting tool.

Overall, as a technical writeup I enjoyed the article; however, I would caution that the author seems to approach publishing from an amateur perspective.

[-]

TheOtherHobbes 17 hours ago

The place command does not autoupdate. At least not in the most recent version.

Text is either embedded, in which case it's baked in, or linked, in which case you have to manually tell ID to update the link to reload the text.

But InDesign's EPUB output is horrifically terrible, especially if you're trying to use custom fonts/graphics for page headings. (Basically - no.)

And the CSS is... really not great.

The best fiction off-the-shelf option for EPUB gen is Vellum. It's a one-off payment of around $250 and you can get an EPUB-only version, or EPUB+PDF for print. It's not very customisable, but the presets - there aren't many - all look good.

For anything more sophisticated, options are limited. I spent far too long creating a non-fiction EPUB in ID a couple of years ago. I got there in the end but it was an extremely painful process and I ended up automating a lot of the workflow in JSX.

For fiction I created my own MD -> EPUB pipeline with a custom MD -> HTML parser for custom markup not handled by pure MD. Then a custom EPUB builder which does all the wrapping and general EPUB bureaucracy based on my own CSS.

Python has libraries for Pandoc, native DOCX, and MD (up to a point) so the basics were all there. The rest was glue.

It was a moderately-sized hobby project - would probably go much faster with AI now.

[-]

philistine 11 hours ago

Yeah, OP’s answer tells me why the big publisher’s EPUB books have always been subpar.

PxldLtd 17 hours ago

> The rest was glue

Oh how often I keep saying that these days... "All the parts are there! Why hasn't anyone piped this into that?"

[-]

Barbing 16 hours ago

Then I check 30 to 90 days later and sure enough, at least one person has done it.

raddan 18 hours ago

> Overall, as a technical writeup I enjoyed the article; however, I would caution that the author seems to approach publishing from an amateur perspective.

I also worked at a publishing company (for ~6 years) in the early 2000s. While you are right that the pros have some tricks to make the process easier, the fact remains that the process is not easy at all. Unlike in academic publishing, where nothing stands between the author and the reader, at a commercial publishing company (at least one of the majors), there are legions of people working behind the scenes. Editors communicate with authors; editorial assistants help the editors with fact-checking, drafts, basic organization and comprehensibility; copyeditors get all pedantic about formatting and word choice (sometimes resulting in arguments with authors that the editors need to smooth over); production departments that make the books look pretty, contain images whose copyrights are cleared and that can be legibly printed within a reasonable budget; graphic designers who develop house styles or even a custom style for a book and even original cover art; lawyers who negotiate copyrights for excerpts, images, and other ancillary materials; and on and on.

I know all this because I worked on a custom content management system for this company and in so doing I discovered that the process was incredibly complex. One of the major pet peeves of everybody involved was when an author thought they were doing anybody a favor by trying format things in Microsoft Word. Most of that information was thrown away and the real layout was done by people who thought in terms of widows, orphans, kerning, and leading (and so on). Once you know what all the people in a top publishing company do, the difference between an amateur publication and a professional one becomes immediately apparent. So I don't fault the author for getting a bit technical. The SE approach sounds like an epic attempt to make a complicated subject at least somewhat approachable.

[-]

raincole 12 hours ago

I did't know what widows and orphans are so I looked it up.

> Widow (sometimes called orphan)

> Orphan (sometimes called widow)

> Runt (sometimes called widow or orphan)

Yeah I'm glad we programmers are not the only ones bad at naming things...

[0]: https://en.wikipedia.org/wiki/Widows_and_orphans

[-]

chrismorgan 11 hours ago

I am firmly convinced that the customary mapping of widow/orphan is back to front. You’re really trying to convince me that the one that has been cut off from its antecedents is the widow? It should obviously be the orphan.

So, no wonder people confuse them, because the popular mapping is wrong.

dustin1114 11 hours ago

Yes! All the typographical techniques and terminology is fascinating (and confusing at times). Widow and orphan control really fight against text justification. Finding the right balance is tricky, but LaTeX has all the little knobs to tweak and find what's right for your uses (fiction for me).

elevation 16 hours ago

> Once you know what all the people in a top publishing company do, the difference between an amateur publication and a professional one becomes immediately apparent.

Any advise for developing this sense?

I will never work in a top publishing company but I have been able to approximate good design by first studying the fundamentals, then reproducing the layouts I see in popular media. I can make text into a beautiful book, and I see poor design choices in the corporate communication billion dollar companies.

But it feels like there’s a lot more I don’t know, and you never know what you don’t know, and it makes me wish I could absorb more from working under an expert.

[-]

Exoristos 16 hours ago

There's no substitute for apprenticeship (by whatever name). Unfortunately, skills of this kind may be close to extinction. For someone like you just interested in getting better at layout design, I'd recommend something like 'The Elements of Typographic Style', by Bringhurst; this concentrates mostly on books, but much applies to other layouts. Of more general interest -- i.e., beyond layout design -- might be 'An Encyclopedia of the Book', by Glaister. There's a wealth of valuable design and print resources from the '60s - '90s if you can find them -- some libraries still have high-quality examples, but most have replaced them with much less-valuable contemporary resources. Look for book and magazine sales by university departments, businesses, etc.

[-]

elevation 14 hours ago

Thank you! I have been absorbing Bringhurst methodically the past year.

I had not heard of Glaister, will be on the lookout.

Good point about library and corporate sales. My main supply of materials from the 60s has been from estate sales -- not for instructional materials, but for well composed period pieces. Older letterfaces and color palettes are so evocative; seeing the label of a 70 year old oil can with so much more personality than the products of today makes me want to bottle this style for my own future use. And it feels good to hold something back from the landfill.

[-]

Exoristos 14 hours ago

Have to say I'm finding the story of your efforts so far uplifting. Keep up the good work, and good luck.

bombcar 13 hours ago

“A Few Notes on Book Design” is also worth a read.

https://texdoc.org/serve/memdesign/0

If you have a decent TeX distribution installed you have a copy for long flights.

Cassell 16 hours ago

The trouble with our age is that, despite the abundance of intermediate-level information, expert teachers in specific, and shrinking, professions are as hard as ever to access, if not more so.

cryo32 8 hours ago

Would just like to add that academic publishers have to deal with a lot of rubbish too.

Trying getting the psychology department to use anything other than O365. We have our own typesetting contractors who deal with the muck they produce.

Finnucane 18 hours ago

When we get a Word doc from an author it is sent to the typesetter for reformatting. A standard set of style codes is applied and other corrections made so it can be directly imported into the design template. This the version the copyeditor works on. Also: once proofs are set this version is basically trash. In ye olde dayes, when this was all done on paper, the edited ms would eventually go back to the author, but sometimes they didn't want it. Now when the book is done the production manuscript files get deleted.

For ebook production, you could definitely do worse than follow Standard Ebooks' method. That will get you a decent standards-compliant file with basic accessibility features accounted for.

Exoristos 17 hours ago

Maybe we worked at the same firm. You never know.

munificent 16 hours ago

> For example, "Making the slightest change became a chore. [1.] Update the 'master' DOCX. [2.] Update the InDesign file ..." --the appropriate way to use an external document as master in InDesign is to use the Place command, which autoupdates text changes as they are made in Word.

It does not auto-update. Even if it did, you wouldn't necessarily want it to auto-update, because it's very hard to tell if changing one sentence in your manuscript has borked the layout of dozens of pages. Once you have rules set up around widow and orphan control, it's very easy for even tiny text changes to have large downstream layout effects.

Also, frankly, InDesign is kind of flaky and will sometimes change layout or make other visual changes in response to apparently nothing at all. I ran into a bug where it would just silently drop underlines on some elements and jiggling them a bit would bring them back.

For my two books, I ended up writing a script that would generate a visual diff of the entire book from the PDF export of the InDesign files so that I could tell for certain if InDesign had gotten itself confused. InDesign can produce beautiful output, but like a lot of Adobe software, it's temperamental and opaque.

WillAdams 18 hours ago

For my part, my approach was to set up a Word .docx file with styles, which would import into Adobe InDesign, mapping style-to-style, and if need be, pre-process w/ one or more AppleScripts and page as normal, then when it was time to return the edited manuscript to the author(s), select all the text and remove over-rides and export the text as a .rtf from InDesign, open that in Microsoft Word and re-save as a .docx.

dustin1114 11 hours ago

Hi, OP here. I'm glad you enjoyed the writeup.

Amateur...you're probably right. It reminds me of my home improvement project I've been working on this evening: interior painting. My ceiling lines are probably perfect to houseguests (if they notice at all). But if a professional painter got up on a ladder and looked closely, he'd probably shake his head and chuckle.

As for InDesign and EPUB, I've found the auto-generated output not up to the standard I was after. Worse, I've seen output differ between InDesign versions, which scared me.

I have an acquaintance who works for a "Big 5" publisher, and he recounted their process to me once. In short, the indd file became the source of truth. They would generate an EPUB from it but then hand edit it for many hours to bring it up to their house style. If there was a text change (rare in fiction) they update the indd and EPUB separately. Going back to the Word file is basically non-existent. If the author, copyeditor, proofreader had more extensive changes (like a full revision), it was close to a brand new publication.

The visual styling from the word processer isn't interesting. It's the "tagging" that paragraph and character styles bring that's helpful. It's not dissimilar from an HTML class, which scripting can transform into truly semantic text. I hope that clarifies some points. BTW, it's pretty cool to hear from people in the real print industry. I'm always fascinated by their workflows.

diamondap 19 hours ago

Kudos to you for doing that.

I've been publishing print and ebooks since 2015, and I can attest to the fact the Word to PDF X-1/a to epub/kindle pipeline is painful. Making minor edits after publication is also painful, as the author notes, and can be error prone if you fail to make identical changes to all formats.

The problem was bad enough that I built by own markdown to HTML to PDF/X-1a processor using Python, WeasyPrint, and ghostscript. This also allows me to use git for version control, and I can make formatting changes using vanilla CSS. My tools are currently too crude for the average non-tech writer to use, but they save me hours every time I use them.

For any of you hackers out there looking for an untapped market, try making a user-friendly tool that converts Word, PDF and/or similar formats to the print-ready PDF/X-1a, PDF/X-3 and PDF/X-4 formats. At the moment, all the existing tools are proprietary and expensive, and many are difficult to use. This won't be a big money maker, but it will certainly be welcome by many indie authors.

[-]

everybodyknows 13 hours ago

> ... markdown to HTML to PDF/X-1a processor using Python, WeasyPrint, and ghostscript.

I've been converting HTML to PDF by running WeasyPrint (latest version) with options I hoped were sufficient to satisfy the X-1a rules -- can it not quite do that? Is that why you need ghostscript?

[-]

diamondap 12 hours ago

I tried make my PDFs X-1a compliant with WeasyPrint, then ran them through Adobe's PDF/X validator and they kept failing. I was in a bit of a hurry and found a way to do it with ghostscript. I would like to remove ghostscript from the mix, so when I have some time, I may try again to do it all with WeasyPrint.

HanClinto 18 hours ago

Setting up good book publishing pipelines with version control + CI/CD might sounds simple, but I don't think it's trivial.

One of the best examples of this that I've ever seen is The Sourdough Framework [0] -- really impressed with the way that versioning and publishing is integrated in that book.

And yes -- I know it sounds like yet another Javascript library -- but it's actually a book about sourdough bread making. It's been discussed here several times before, but this one from 2023 [1] may have been the most popular (103 comments)

[0] - https://github.com/hendricius/the-sourdough-framework [1] - https://news.ycombinator.com/item?id=35961590

[-]

dustin1114 11 hours ago

This was an early inspiration for me that I failed to mention in the article. I'm glad you mentioned it. It really does have a lot of good examples, especially the complex lists and diagrams it implements in TeX.

theknarf 4 hours ago

You'll get pretty far if you start off with Obsidian + Markdown + a makefile with Pandoc. You can even combine Markdown and Latex files together with Pandoc. This gives you an easy workflow with all the power you need using Latex as an escape hatch. And Obsidian have enough plugins to do whatever you want (or swap it for any other Markdown or code editor of your choice).

raybb 15 hours ago

I've been making ebooks for a nonprofit using typst and pandoc for a few years and it works quite well.

We generate a pdf ebook, a print version, and a epub. They each have little tweeks but are all defined conditionally using sys.input.

It was rough at first and I've had to open around a dozen or so issues for pandoc to improve things. Now it's pretty seamless.

[-]

dustin1114 11 hours ago

I saw typst in my explorations but LaTeX had a few more of the controls I was looking for in print, and I really wanted a Standard Ebooks compliant EPUB. I might revisit at some time though. Thanks for bringing it up.

TeaVMFan 16 hours ago

I have a related pipeline that is based on HTML, EPublish, and Calibre:

https://frequal.com/forwriters/

I used it for a recent novel: https://www.amazon.com/dp/B0GYCZJVGX

meonkeys 11 hours ago

I enjoyed using Asciidoctor to write a book. It necessitates using a text editor instead of a word processor so it doesn't fit DJ's use case, but it really is quite nice.

I'm also fascinated by the build for Ada & Zangemann, a FOSS illustrated full-color children's book. It looks rather complex, but it handles translations, beautiful typesetting, and was remarkably fast when I tried running the build locally.

[-]

dustin1114 10 hours ago

Asciidoctor was in the running months ago. I like the idea of a single set of files, but yes, word processors are my weakness.

voidUpdate 5 hours ago

> "I would love if the XHTML and TeX were artifacts rather than code"

What's an "artifact"? I don't come from a writing background, so it may be obvious to some people, but I only know that word in a historical-ish context, as something old and important, which doesn't seem to make sense in this context

[-]

theknarf 4 hours ago

An artifact is the output of an automated process that take some input and outputs artifacts. Its a generic term that can mean all kinds of things depending on the process and type of output. For example if you have a program that take an Open Office document in and produce a pdf and an epub file out then "pdf" and "ebup" would be the artifacts.

Rp8yXmdmr 4 hours ago

Artifact in this context is whatever is produced by build process. That is common convention in CI/CD context. And the base definition for "artifact" is very wide: anything artificial, as in not natural but made by humans.

[-]

voidUpdate 3 hours ago

Aren't the xhtml and TeX already artifacts though? They are produced from a script that parses the ODT

donalhunt 5 hours ago

In software development, an artifact is a deployable file produced during the build process, such as a .jar, .zip, .exe, or Docker image.

In the publishing world, an artifact is something that is a product of processing code. e.g. the OP wants their code to generate files in various formats.

helterskelter 19 hours ago

My only problem using git and a text editor is deciding whether I want hard or soft wraps. Vim handles hard wraps better IMO and you can change the git diff engine to something like difft, which makes it much more bearable than the default for hard wrap prose.

But softwrap definitely has its advantages: no hard line breaks makes copying the text into other mediums easier, git diffs show only which paragraphs you edited and not a bunch of line diff noise no matter which engine you use. Only problem is it breaks my yy, dd, cc muscle memory, as AFAIK you can't force those to work on virtual (vs logical) lines.

[-]

dustin1114 11 hours ago

I use VS Code soft-wrapping, and `git diff --word-diff` does all I need, though there probably are better methods.

BrenBarn 18 hours ago

The annoyances of using "soft wraps" with various kinds of tools is one of the maddening irritations of our software landscape. Inserting non-semantic newlines in content just to make things fit the screen is insane.

[-]

somat 10 hours ago

It is not just to fit the screen, it also fits our line orientated version control better.

I don't know if this is suitable for large works(books), but for technical documentation I have my plain text source with one line per sentence, actually I go further than than and usually have one line per punctuation. The raw source reads a little hard but the version control diffs are much cleaner and editing is is easier. Most formats(html, troff, tex) ignore manual line returns anyway.

skydhash 16 hours ago

I think most authoring formats require a blank line to mark a paragraph. In emacs and in vim. You can easily reflow such block (and on unix there’s the fmt command).

moopie 17 hours ago

Sad that typst wasn’t mentioned, wonder how it compares to the setup in the article.

18 hours ago

[deleted]

g42gregory 19 hours ago

Hopefully some of the writers are reading this:

I love buying and reading physical books. However, about half of the books (I read mostly programming books) have letters that are printed pixelated. This is infuriating to me. No one bothers to run a trial print and see what comes out?

The root cause of this: PDF will look fine, but the text color is usually set slightly off black (why!!??). The eye couldn’t really see the difference and PDF renders smoothly. However, commercial printers couldn’t handle that properly.

Solution: set the text color to full black, you are using (most of the time) black and white printer!

You might need to have two PDF versions: one for printing and one for digital distribution (but why would you have off-black text anyway?).

[-]

robinsonb5 16 hours ago

> but the text color is usually set slightly off black (why!!??)

This can be cause by colour management. If the black is defined in terms of RGB and then converted to CMYK as part of the pre-press workflow, you'll typically have a mix of all four inks, and not necessarily 100% K - it depends on the colour profiles. For a black-only print job the C, M and Y channels will then be discarded, leaving a maybe-not-pure black.

tvmalsv 12 hours ago

>> (but why would you have off-black text anyway?)

I had read a long time ago that when doing web design you should avoid pure white and pure black, especially when one is on top of the other. I presume it’s to avoid harshness or to keep the “white” from halo’ing into the black (made even worse on a CRT display).

That is probably the worst advice when doing a printed medium, though. Different medium targets sometimes have conflicting “best practices.”

sscaryterry 18 hours ago

This is why PDF/X exists

KPGv2 18 hours ago

> why is the text color set slightly off black

Because pure black causes eye strain. Dark gray on white is superior for long reading sessions when your paper is white. The contrast really hurts after a while if you do pure black on pure white. This is a known phenomenon.

In fact, there's experimental evidence (https://www.nature.com/articles/s41598-018-28904-x) that this high contrast plays a hand in the onset of myopia, which in extreme forms is correlated with glaucoma and other vision disorders.

[-]

munificent 16 hours ago

> Dark gray on white is superior for long reading sessions when your paper is white.

Color is the ink's job. Approximating a lighter shade of black than the ink produces by speckling the output with tiny white pixels is definitely not an improvement in readability.

jtbayly 16 hours ago

So the solution is to have blurry text?

Most paper in Books isn’t pure white. Leave the text completely black.

huijzer 17 hours ago

Uhm why not Typst? I published my thesis and another book in it and it worked great. They are also working on HTML output which should make it easier to create EPUBs. Until then Pandoc should work I think

[-]

dustin1114 10 hours ago

typst is great. I experimented with it, it I simply didn't have the fine-tuning and maturity LaTeX. For example, window/orphan control is a binary on/off, while LaTeX calculates by penalties at a much lower level. Pandoc is also great (I used it often for unrelated workflows), but it can't map custom styles from ODT files (not sure about Word).

[-]

huijzer 8 hours ago

Seems it’s currently a percentage: https://forum.typst.app/t/how-to-leave-a-single-line-of-para...

On Pandoc I agree. Word custom styles is possible I believe but it will be a mess (as usual with Word).

genewitch 10 hours ago

https://standardebooks.org/contribute/producing-an-ebook-ste...

as linked in the article, looks like a nightmare. i was hyped that i could recommend something to author friends, but, i can hear it now, "Maaaaaaaaaaaaaaaaaan!"

oh well, they'll have to pay someone that understands all of that, because i don't.

arikrahman 19 hours ago

Did the author create the Christian novellas he's mentioned? Can't tell by the phrasing. That would be impressive enough on its own, combined with the tech stack?

[-]

gchamonlive 19 hours ago

From the about page:

  D. J. Speckhals is the author of the “Witnesses of the Light” historical fiction trilogy, which transports readers to fifteenth-century Europe to explore the resilient faith of the Waldensians.

dustin1114 11 hours ago

Author here. Yes, I wrote the books and glued everything together. If I failed to mention it in the article, it's because I was trying not to self-promote so much. Thanks for the compliment. It really was fun to figure it all out, if that wasn't clear :)

kyboren 19 hours ago

AKA what CS PhD students have been doing ~forever.

I guess this is like medical researchers "discovering" basic calculus or an office worker discovering that SFTP, sshfs, and git work fine and they don't need Dropbox after all.

What's common knowledge in one field can apparently still be alien to people outside the field, even in the age of LLMs.

Just wait until the author finds out about Overleaf...

[-]

macintux 14 hours ago

I hope the negative reactions here are to the truncated title (HN drops the “How” from titles). The author doesn’t seem to be claiming anything revolutionary, just describing how they created their pipeline.

Good grief.

> Please don't fulminate. Please don't sneer, including at the rest of the community.

dustin1114 11 hours ago

Author here. You're exactly right. All of my pre-grad education was liberal arts. I had never once heard of LaTeX until I entered the software world years later, and even then only from a coworker with a CS PhD.

KPGv2 18 hours ago

> what CS PhD students have been doing ~forever.

Or what every researcher has been doing for literally decades (except with other versioning systems, but still typesetting without Word or Adobe).

No need for techbros to pat themselves on the back as innovators.

I typeset my novels in LaTeX and use GIT. I even just clone a base repo whenever I'm going to release another.

[-]

kyboren 15 hours ago

Considering LaTeX came from legendary CS PhD and Turing award winner Leslie Lamport's need to typeset a book, and was built on the shoulders of legendary CS PhD and Turing award winner Donald Knuth's work on TeX, I think "techbros" can safely pat themselves on the back as innovators in this case.

skydhash 16 hours ago

I don’t have anything to publish, but one of these days, I’d like to try the troff suite (with eqn, pic, and tbl).

[-]

dghf 24 minutes ago

I keep on meaning to try these out: https://www.schaffter.ca/mom/

MagicMoonlight 8 hours ago

You use Scrivener and then Vellum. Nobody uses word or adobe slop anymore.