There’s something that bothers me about these sorts of recollections that make git seem… inevitable.
There’s this whole creation myth of how Git came to be that kind of paints Linus as some prophet reading from golden tablets written by the CS gods themselves.
Granted, this particular narrative in the blog post does humanise a bit more, remembering the stumbling steps, how Linus never intended for git itself to be the UI, how there wasn’t even a git commit command in the beginning, but it still paints the whole thing in somewhat romantic tones, as if the blob-tree-commit-ref data structure were the perfect representation of data.
One particular aspect that often gets left out of this creation myth, especially by the author of Github is that Mercurial had a prominent role. It was created by Olivia Mackall, another kernel hacker, at the same time as git, for the same purpose as git. Olivia offered Mercurial to Linus, but Linus didn’t look upon favour with it, and stuck to his guns. Unlike git, Mercurial had a UI at the very start. Its UI was very similar to Subversion, which at the time was the dominant VCS, so Mercurial always aimed for familiarity without sacrificing user flexibility. In the beginning, both VCSes had mind share, and even today, the mindshare of Mercurial lives on in hg itself as well as in worthy git successors such as jujutsu.
And the git data structure isn’t the only thing that could have ever possibly worked. It falls apart for large files. There are workaround and things you can patch on top, but there are also completely different data structures that would be appropriate for larger bits of data.
Git isn’t just plain wonderful, and in my view, it’s not inevitable either. I still look forward to a world beyond git, whether jujutsu or whatever else may come.
A lot of the ideas around git were known at this time. People mentioned monotone already. Still, Linus got the initial design wrong by computing the hash of the compressed content (which is a performance issue and also would make it difficult to replace the compression algorithm). Something I had pointed out early [1] and he later changed it.
I think the reason git then was successful was because it is a small, practical, and very efficient no-nonsense tool written in C. This made it much more appealing to many than the alternatives written in C++ or Python.
I'm curious why you think hg had a prominent role in this. I mean, it did pop up at almost exactly the same time for exactly the same reasons (BK, kernel drama) but I don't see evidence of Matt's benchmarks or development affecting the Git design decisions at all.
Here's one of the first threads where Matt (Olivia) introduces the project and benchmarks, but it seems like the list finds it unremarkable enough comparatively to not dig into it much:
I agree that the UI is generally better and some decisions where arguably better (changeset evolution, which came much later, is pretty amazing) but I have a hard time agreeing that hg influenced Git in some fundamental way.
I'm not saying that hg influenced git, but I'm saying that at the time, both were seen as worthy alternatives. Lots of big projects were using hg at one point: Python, Mozilla, Netbeans, Unity.
Sure, you managed to get Github in front of everyone's face and therefore git. For a while, Bitbucket was a viable alternative to many.
Were you involved with the decision to sponsor hg-git? I understand that at one point it was hoped that this would help move more people from hg into Github, just like Subversion support for Github would. I think the latter is still there.
"One particular aspect that often gets left out of this creation myth, especially by the author of Github is that Mercurial had a prominent role." implies to me that Hg had a role in the creation of Git, which is why I was reacting to that.
For the deadnaming comment, it wasn't out of disrespect, but when referring to an email chain, it could otherwise be confusing if you're not aware of her transition.
I wasn't sponsoring hg-git, I wrote it. I also wrote the original Subversion bridge for GitHub, which was actually recently deprecated.
> For the deadnaming comment, it wasn't out of disrespect, but when referring to an email chain, it could otherwise be confusing if you're not aware of her transition.
I assumed it was innocent. But the norm when naming a married woman or another person who changed their name is to call them their current name and append the clarifying information. Not vice versa. Jane Jones née Smith. Olivia (then Matt).
Is this not a case where it is justified, given that she at that time was named Matt, and it's crucial information to understand the mail thread linked to? I certainly would not understand at all without that context.
I do think an open source, distributed, content addressable VCS was inevitable. Not git itself, but something with similar features/workflows.
Nobody was really happy with the VCS situation in 2005. Most people were still using CVS, or something commercial. SVN did exist, it had only just reached version 1.0 in 2004, but your platforms like SourceForge still only offered CVS hosting. SVN was considered to be a more refined CVS, but it wasn't that much better and still shared all the same fundamental flaws from its centralised nature.
On the other hand, "distributed" was a hot new buzzword in 2005. The recent success of Bittorrent (especially its hot new DHT feature) and other file sharing platforms had pushed the concept mainstream.
Even if it wasn't for the Bitkeeper incident, I do think we would have seen something pop up by 2008 at the latest. It might not have caught on as fast as git did, but you must remember the thing that shot git to popularity was GitHub, not the linux kernel.
The file-tree-snapshot-ref structure is pretty good, but it lacks chunking at the file and tree layers, which makes it inefficient with large files and trees that don't change a lot. Modern backup tools like restic/borg/etc use something similar, but with chunking included.
> there are also completely different data structures that would be appropriate for larger bits of data.
Can you talk a little bit about this? My assumption was that the only way to deal with large files properly was to go back to centralised VCS, I'd be interested to hear what different data structures could obviate the issue.
>And the git data structure... falls apart for large files.
I'm good with this. In my over 25 years of professional experience, having used cvs, svn, perforce, and git, it's almost always a mistake keeping non-source files in the VCS. Digital assets and giant data files are nearly always better off being served from artifact repositories or CDN systems (including in-house flavors of these). I've worked at EA Sports and Rockstar Games and the number of times dev teams went backwards in versions with digital assets can be counted on the fingers of a single hand.
Non-source files should indeed never be in the VCS, but source files can still be binary, or large, or both. It depends on how you are editing the source and building the source into non-source files.
I think this conflates "non-source" with "large". Yes, it's often the case that source files are smaller than generated output files (especially for graphics artifacts), but this is really just a lucky coincidence that prevents the awkwardness of dealing with large files in version control from becoming as much of a hassle as it might be. Having a VCS that dealt with large files comfortably would free our minds and open up new vistas.
I think the key issue is actually how to sensibly diff and merge these other formats. Levenshtein-distance-based diffing is good enough for many text-based formats (like typical program code), but there is scope for so much better. Perhaps progress will come from designing file formats (including binary formats) specifically with "diffability" in mind -- similar to the way that, say, Java was designed with IDE support in mind.
Another alternative is the patch-theory approach from Darcs and now Pijul. It's a fundamentally different way of thinking about version control—I haven't actually used it myself but, from reading about it, I find thinking in patches matches my natural intuition better than git's model. Darcs had some engineering limitations that could lead to really bad performance in certain cases, but I understand Pijul fixes that.
In early 2000s I was researching VCSs for work and also helping a little developing arch, bazaar then (less so) bzr. I trialed Bitkeeper for work. We went with Subversion eventually. I think I tried Monotone but it was glacially slow. I looked at Mercurial. It didn't click.
When I first used Git I thought YES! This is it. This is the one. The model was so compelling, the speed phenomenal.
I never again used anything else unless forced -- typically Subversion, mostly for inertia reasons.
The article is written by a co-founder of github and not Linus Torvalds.
git is just a tool to do stuff. It's name (chosen by that Finnish bloke) is remarkably apt - its for gits!
It's not Mecurial, nor github, nor is it anything else. Its git.
It wasn't invented for you or you or even you. It was a hack to do a job: sort out control of the Linux kernel source when Bit Keeper went off the rails as far as the Linux kernel devs were concerned.
> There’s this whole creation myth of how Git came to be that kind of paints Linus as some prophet reading from golden tablets written by the CS gods themselves.
What?
> Git isn’t just plain wonderful, and in my view, it’s not inevitable either.
I mean, the proof is in the pudding. So why did we end up with Git? Was it just dumb luck? Maybe. But I was there at the start for both Git and Mercurial (as I comment elsewhere in this post). I used them both equally at first, and as a Python aficionado should've gravitated to Mercurial.
But I like to understand how tools work, and I personally found Mercurial harder to understand, slower to use, and much less flexible. It was great for certain workflows, but if those workflows didn't match what you wanted to do, it was rigid (I can't really expound on this; it's been more than a decade). Surprisingly (as I was coding almost entirely in Python at the time), I also found it harder to contribute to than Git.
Now, I'm just one random guy, but here we are, with the not plain wonderful stupid (but extremely fast) directory content manager.
> But I like to understand how tools work, and I personally found Mercurial harder to understand, slower to use, and much less flexible.
It's a relief to hear someone else say something like this, it's so rare to find anything but praise for mercurial in threads like these.
It was similar for me: In the early/mid 2010s I tried both git and mercurial after having only subversion experience, and found something with how mercurial handled branches extremely confusing (don't remember what, it's been so long). On the other hand, I found git very intuitive and have never had issues with it.
For me, the real problem at the time is that "rebase" was a second class feature.
I think too many folks at the time thought that full immutability was what folks wanted and got hung up on that. Turns out that almost everyone wanted to hide their mistakes, badly structured commits, and typos out of the box.
Good point. Git succeeded in the same way that Unix/Linux succeeded. Yes, it sucks in many ways, but it is flexible and powerful enough to be worth it. Meanwhile, something that is "better" but not flexible, powerful, or hackable is not evolutionarily successful.
In fact, now that I've used the term "evolution", Terran life/DNA functions much the same way. Adaptability trumps perfection every time.
I just wish they'd extend git to have better binary file diffs and moved file tracking.
Remembering the real history matters, because preserving history is valuable by itself, but I'm also really glad that VCS is for most people completely solved, there's nothing besides Git you have to pay attention to, you learn it once and use it your whole career.
> I just wish they'd extend git to have better binary file diffs
It's not built-in to git itself, but I remember seeing demos where git could be configured to use an external tool to do a visual diff any time git tried to show a diff of image files.
> and moved file tracking.
Check out -C and -M in the help for git log and blame. Git's move tracking is a bit weirder than others (it reconstructs moves/copies from history rather than recording them at commit), but I've found it more powerful than others because you don't need to remember a special "move" or "copy" command, plus it can track combining two files in a way others can't.
I was always under the impression Monotone - which was released two years before Mercurial - was the inspiration for git, and that this was pretty well known.
This is all fairly speculative, but I didn't get the impression that Monotone was a main inspiration for Git. I think BitKeeper was, in that it was a tool that Linus actually liked using. Monotone had the content addressable system, which was clearly an inspiration, but that's the only thing I've seen Linus reference from Monotone. He tried using it and bailed because it was slow, but took the one idea that he found interesting and built a very different thing with that concept as one part of it is how I would interpret the history between these projects.
Linus was definitely aware of and mentioned Monotone. But to call it an inspiration might be too far. Content Addressable Stores were around a long time before that, mostly for backup purposes afaik. See Plan9's Venti file system.
Yes, Monotone partly inspired both. You can see that both hash contents. But both git and hg were intended to replace Bitkeeper. Mercurial is even named after Larry McVoy, who changed his mind. He was, you know, mercurial in his moods.
> There’s this whole creation myth of how Git came to be that kind of paints Linus as some prophet reading from golden tablets written by the CS gods themselves.
Linus absolutely had a couple of brilliant insights:
1. Content-addressable storage for the source tree.
At that time, I was using SVN and experimenting with Hg and Bazaar. Both were too "magical" for me, with unclear rules for merging, branching, rebasing.
Then came git. I read its description "source code trees, identified by their hashes, with file content movement deduced from diffs", and it immediately clicked. It's such an easy mental model, and you can immediately understand what operations mean.
> At that time, I was using SVN and experimenting with Hg and Bazaar. Both were too "magical" for me, with unclear rules for merging, branching, rebasing.
I have no idea what you mean.
> It's such an easy mental model, and you can immediately understand what operations mean.
A lot of things in Mercurial kind of geared you towards using it more like Subversion was used. You pretty much could use Mercurial just like git was, and is, used, but the defaults didn't guide you to that direction.
One bigger difference I can think of is, Mercurial has permanently named branched (branch name is written in the commit), whereas in git branches are just named pointers. Mercurial got bookmarks in 2008 as an extension, and added to the core in 2011. If you used unnamed branches and bookmarks, you could use Mercurial exactly like git. But git was published in 2005.
Another is git's staging area. You can get pretty much the same functionality with repeatedly using `hg commit --amend` but again, in git the default gears you towards using the staging approach, in Mercurial you have specifically search for a way to get it to function this way.
I wonder this too. My guess is that he did not like "heavyweight" branching and the lack of cherry-pick/rebase. At any rate that is why I didn't like it back then.
Sun Microsystems (RIP) back then went with Mercurial instead of Git mainly because Mercurial had better support for file renames than Git did, but at Sun we used a rebase workflow with Mercurial even though Mercurial didn't have a rebase command. Sun had been using a rebase workflow since 1992. Rebase with Mercurial was wonky, but we were used to wonky workflows with Teamware anyways. Going with Mercurial was a mistake. Idk what Oracle does now internally, but I bet they use Git. Illumos uses Git.
Ah, right, at Sun we used MQ. But anyways, just glancing at the hg evolve docs I'm unconvinced. And anyways, it's an "extension". Mercurial fought rebase for a long time, and they did dumb things like "rebase is non-interactive, and histedit is for when you want to edit the history".
And that is partly why Mercurial lost. They insisted on being opinionated about workflows being merge-based. Git is not opinionated. Git lets you use merge workflows if you like that, and rebase workflows if you like that, and none of this judgment about devs editing local history -- how dare the VCS tell me what to do with my local history?!
I may be misremembering but c vs python was a part of it. I don't think Linus thought too highly of python, or any interpreted languages, except shell perhaps, and didn't want to deal with installing and managing python packages.
> and didn't want to deal with installing and managing python packages.
Based on the fact that ecosystem torpedoed an entire major version of the language, and that there are a bazillion competing and incompatible dep managers, it seems that bet turned out well
Around 2002 or so, I had an idea to tag every part of a project with a unique hash code. With a hash code, one could download the corresponding file. A hash code for the whole project would be a file containing a list of hash codes for the files that make up the project. Hash codes could represent the compiler that builds it, along with the library(s) it links with.
I showed it to a couple software entrepreneuers (Wild Tangent and Chromium), but they had no interest in it.
I never did anything else with it, and so it goes.
I had actually done a writeup on it, and thought I had lost it. I found it, dated 2/15/2002:
---
Consider that any D app is completely specified by a list of .module files
and the tools necessary to compile them. Assign a unique GUID to each unique .module file. Then, an app is specified by a list of .module GUIDs. Each app is also assigned a GUID.
On the client's machine is stored a pool of already downloaded .module files. When a new app is downloaded, what is actually downloaded is just a GUID. The client sees if that GUID is an already built app in the pool, then he's done. If not, the client requests the manifest for the GUID, a manifest being a list of .module GUIDs. Each GUID in the manifest is checked against the client pool, any that are not found are downloaded and added to the pool.
Once the client has all the .module files for the GUIDs that make up an app, they can all be compiled, linked, and the result cached in the pool.
Thus, if an app is updated, only the changed .module files ever need to get downloaded. This can be taken a step further and a changed .module file can be represented as a diff from a previous .module.
Since .module files are tokenized source, two source files that differ only in
comments and whitespace will have identical .module files.
There will be a master pool of .module files on WT's server. When an app is ready
to release, it is "checked in" to the master pool by assigning GUIDs to its
.module files. This master pool is what is consulted by the client when requesting
.module files by GUID.
The D "VM" compiler, linker, engine, etc., can also be identified by GUIDs.
This way, if an app is developed with a particular combination of tools, it
can specify the GUIDs for them in the manifest. Hence the
client will automatically download "VM" updates to get the exact tools needed
to duplicate the app exactly.
Check it out. The whitepaper's a fairly digestible read, too, and may get you excited about the whole concept (which is VERY different from how things are normally done, but ends up giving you guarantees)
The problem with NoxOS is all the effort to capture software closures is rendered moot by Linux namespaces, which are a more complete solution to the same problem.
Of course we didn't have them when the white paper was written, so that's fair but technology has moved on.
Nix(OS) is aware of namespaces, and can use them (in fact, the aforementioned gaming support relies on them), but versioning packages still works better than versioning the system in most cases.
Consider three packages, A, B, and C. B has two versions, A and C have one.
- A-1.0.0 depends on B-2.0.0 and C-1.0.0.
- C-1.0.0 depends on B-1.0.0.
If A gets a path to a file in B-2.0.0 and wants to share it with C (for example, C might provide binaries it can run on files, or C might be a daemon), it needs C to be in a mount namespace with B-2.0.0. However, without Nix-store-like directory structure, mounting B-2.0.0's files will overwrite B-1.0.0's, so C may fail to start or misbehave.
Your description (including the detailed description in the reply) seems to be missing the crucial difference that git uses - the hash code of the object is not some GUID, it is literally the hash of the content of the object. This makes a big difference as you don't need some central registry that maps the GUID to the object.
The RFC defining them says they're the same and has since the earliest draft I can find, also from 2002. You should offer more explanation when you take a stance contrary to what is well documented.
Check my other comment as to how GUIDs are created in many ioquake3 forks. They MD5 hashed a randomly generated file.
Theoretically speaking, UUIDs have a semantic guarantee that each generated identifier is unique across all systems, times, and contexts, whereas cryptographic hashes are deterministic functions (i.e. they produce the same output for the same input), there is no inherent randomness or timestamping, unless you deliberately inject it such as the way ioquake3 forks did with GUID.
UUIDv4's output size is 122 bits usable, so 1 in 2^122 chance of collision, whereas SHA-512 and BLAKE2b has 512 bits, which has a 2^256 collision resistance, bound by the birthday problem.
In any case, SHA-256, SHA-512, BLAKE2b (cryptographic hashes) are unique in practice, meaning they are extremely unlikely to collide, more so than UUIDv4, despite UUIDv4 being non-deterministic, while cryptographic hashes are deterministic.
Of course, you should still know when to use cryptographic hashes vs. UUIDs. UUIDs are good for database primary keys, identifying users globally, tracking events, and the rest, such as verifying file content, deduplicating data by content, and tamper detection is the job of a cryptographic hash.
But to get to the chase: GUIDs (Globally Unique Identifiers) are also known as UUIDs (Universally Unique Identifiers), so they are the same!
I hope this answers OP's (kbolino) question. He was right, GUIDs are the same as UUIDs. Parent confused GUIDs with cryptographic hashes, most likely.
---
FWIW, collision resistance (i.e. birthday bound) is not improved by post-quantum algorithms. It remains inherently limited by 2^{n/2}, no matter what, as long as they use hashing.
---
TL;DR: GUIDs (Globally Unique Identifiers) are also known as UUIDs (Universally Unique Identifiers), so they are the same, i.e. GUIDs and UUIDs are NOT different!
So, in this case, GUID is the MD5 hash of the generated qkey file. See "CL_GenerateQKey" for details.
> On startup, the client engine looks for a file called qkey. If it does not exist, 2KiB worth of random binary data is inserted into the qkey file. A MD5 digest is then made of the qkey file and it is inserted into the cl_guid cvar.
UUIDs have RFCs, GUIDs apparently do not, but AFAIK UUIDs are also named GUIDs, so...
Every git repo has a copy of that mapping instead of there being a central registry though, and because the commit author's name and email, and the date of the commit and a commit message (among other things) go into the hash that represents a commit, it's not that big a difference, is it? Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash if I don't also have the metadata that makes up the commit to make the git hash, and not just the files inside of it.
Yes, but the commit object (which includes metadata) references a tree object by its hash. The tree object is a text representation of a directory tree, basically, referencing file blobs by hash. So yes, you can recognize identical files between commits. It's true there's no fast indexing: if you want to ask the question "which commits contain exactly this file?" you have to search every commit. But you don't need to delta the file contents itself.
but people don't use the file hash, that's internal to git. I go to the centralized repository of repositories at github.com and look up tagged version 1.0.0 of whatever software, which refers to a git tag which references a commit hash (which yes it references a tree object as you said).
And in any case you had a specific requirement above ("Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash"), and in fact this can be done!
The git tag hash references a commit. Without the commit metadata, you don't have a tree object and thus don't know any hashes. You can take the files on disk and compute the hash and furthermore you can take that hash and make a tree object. but without the commit, all you can say is you have a tree object, you don't have a tree object for the commit in question to compare it to.
That's for human consumption though, which is what frustrates so many "hashing will solve everything!" schemes - it breaks as soon as you need a bug fix.
At the end of the day none of us want "exactly this hash" we want "latest". Exact hashes and other reproducibility are things which are useful when debugging or providing traceability - valuable but also not the human side of the equation.
Except that instead of a GUID, it's just a hash of the binary data itself, which ends up being more useful because it is a natural key and doesn't require storing a separate mapping
Git hasn't quite taken the step of making the hash the URL you use to download a file, any file, and be assured it is exactly what you thought it was, as the hash of the file must match its URL.
This is currently done in a haphazard way, not particularly organized.
> I started using Git for something you might not imagine it was intended for, only a few months after it’s first commit
I started using git around 2007 or so because that company I worked for at the time used ClearCase, without a doubt the most painful version manager I have ever used (especially running it from a Linux workstation). So I wrote a few scripts that would let me mirror a directory into a git repo, do all my committing in git, then replay those commits back to ClearCase.
I can't recall how Git came to me attention in the first place, but by late 2008 I was contributing patches to Git itself. Junio was a kind but exacting maintainer, and I learned a lot about contributing to open source from his stewardship. I even attended one of the early GitTogethers.
As far as I can recall, I've never really struggled with git. I think that's because I like to dissect how things work, and under the covers git is quite simple. So I never had too much trouble with its terribly baroque CLI.
At my next job, I was at a startup that was building upon a fork of Chromium. At the time, Chromium was using subversion. But at this startup, we were using git, and I was responsible for keeping our git mirror up-to-date. I also had the terrible tedious job of rebasing our fork with Chromium's upstream changes. But boy did I get good at resolving merge conflicts.
Git may be the CLI I've used most consistently for nearly two decades. I'm disappointed that GitHub became the main code-review tool for Git, but I'll never be disappointed that Git beat out Mercurial, which I always found overly ridged and was never able to adapt it to my workflow.
> I started using git around 2007 or so because that company I worked for at the time used ClearCase, without a doubt the most painful version manager I have ever used
Ah, ClearCase! The biggest pain was in your wallet! I saw the prices my company paid per-seat for that privilege -- yikes!
As someone who wrote my first line of code in approx 2010 and used git & GH for the first time in… 2013? it kind of amazes me to remember that Git is only 20 years old. GitHub for instance doesn’t seem surprising to me that is <20 years old, but `git` not existing before 2005 somehow always feels shocking to me. Obviously there were other alternatives (to some extent) for version control, but git just has the feeling of a tool that is timeless and so ingrained in the culture that it is hard to imagine (for me) the idea of people being software developers in the post-mainframe age without it. It feels like something that would have been born in the same era as Vim, SSH, etc (ie early 90s). This is obviously just because from the perspective of my programming consciousness beginning, it was so mature and entrenched already, but still.
I’ve never used other source control options besides git, and I sometimes wonder if I ever will!
What surprises me more is how young Subversion is in comparison to git, it's barely older.
I guess I started software dev at a magic moment pre-git but after SVN was basically everywhere, but it felt even more like it had been around forever vs the upstart git.
Any version control where you had to manually (and globally) "check out" (lock) files for editing was terrible and near unusable above about 3 people.
Version control systems where you didn't have shallow branches ( and thus each "branch" took a full copy / disk space of files) were awful.
version control systems which would have corruption data-bases (Here's to you Visual source safe) were awful.
Subversion managed to do better on all those issues, but it still didn't adequately solve distributed working issues.
It also didn't help that people often configured SVN to run with the option to add global locks back in, because they didn't understand the benefit of letting two people edit the same file at the same time.
I have a soft-spot for SVN. It was a lot better than it got credit for, but git very much stole the wind from under its sails by solving distributed (and critically, disconnected/offline) workflows just a bit better that developers could overlook the much worse UX, which remains bad to this day.
>It also didn't help that people often configured SVN to run with the option to add global locks back in, because they didn't understand the benefit of letting two people edit the same file at the same time.
I think it was more that they were afraid that a merge might some day be non-trivial. Amazing how that fear goes away once you've actually had the experience.
(I had to check because of this thread. SVN and Git initial releases were apparently about 4 and a half years apart. I think it was probably about 6 years between the time I first used SVN and the time I first used Git.)
Yeah, odd to learn. I remember dipping my toes into source control, playing around with CVS and SVN right around when git was originally announced and it felt so "modern" and "fresh" compared to these legacy systems I was learning.
There were far, far worse things out there than Subversion. VSS, ClearCase, an obscure commercial one written in Java whose name escapes me now..
Subversion was basically a better CVS. My recollection is that plenty of people were more than happy to switch to CVS or Subversion (even on Windows) if it meant they could escape from something as legitimately awful as VSS. Whereas the switch from Subversion to Git or Mercurial had more to do with the additional powers of the newer tools than the problems of the older ones.
Good pull. I was wondering if that was a true statement or not. I am curious if Linus knew about that or made it up independently, or if both came from somewhere else. I really don't know.
> He meant to build an efficient tarball history database toolset, not really a version control system. He assumed that someone else would write that layer.
Famous last words: "We'll do it the right way later!"
On the flip side: when you do intend to make a larger project like that, consciously focusing on the internal utility piece first is often a good move. For example, Pip doesn't offer a real API; anyone who wants their project to install "extra" dependencies dynamically is expected to (https://stackoverflow.com/questions/12332975) run it as a subprocess with its own command line. I suspect that maintaining Pip nowadays would be much easier if it had been designed from that perspective first, which is why I'm taking that approach with Paper.
FWIW, I just found out you can sign commits using ssh keys. Due to how pinentry + gnupg + git has issues on OpenBSD with commit signing, I just moved to signing via ssh. I had a workaround, but it was a real hack, now no issues!
20 years, wow seems like yesterday I moved my work items from cvs to git. I miss one item in cvs ($Id$), but I learned to do without it.
Oh yeah, SSH signing is incredible. I've also migrated to it and didn't look back.
A couple of differences:
- it's possible to specify signing keys in a file inside the repository, and configure git to verify on merge (https://github.com/wiktor-k/ssh-signing/). I'm using that for my dot config repo to make sure I'm pulling only stuff I committed on my machines.
- SSH has TPM key support via PKCS11 or external agents, this makes it possible to easily roll out hardware backed keys
- SSH signatures have context separation, that is it's not possible to take your SSH commit signature and repurpose it (unlike OpenPGP)
AFAIR keyword substitution of $Id$ included the revision number. That would be the commit hash in Git. For obvious reasons you cannot insert a hash value in content from which that hash value is being computed.
You can use smudge and clean filters to expand this into something on disk and then remove it again before the hash computation runs.
However, I don't think you would want to use the SHA, since that's somewhat meaningless to read. You would probably want to expand ID to `git describe SHA` so it's more like `v1.0.1-4-ga691733dc`, so you can see something more similar to a version number.
Professionally, I went from nothing to RCS then to CVS then to git.
In all cases I was the one who insisted on using some kind of source code control in the group I worked with.
Not being an admin, I set up RCS on the server, then later found some other group that allowed us to use their CVS instance. Then when M/S bought github the company got religion and purchased a contract for git.
Getting people to use any kind of SC was a nightmare, this was at a fortune 500 company. When I left, a new hire saw the benefit of SC and took over for me :)
In the old days, loosing source happened a lot, I did not what that to happen when I was working at that company.
Thanks for the useful article!
In addition to a lot of interesting info, it lead me to this repo containing an intro to git internals[1]. Would highly recommend everyone to take a look
[1] https://github.com/pluralsight/git-internals-pdf
Ah yes. It was pretty cool that when Peepcode was acquired, Pluralsight asked me what I wanted to do with my royalties there and was fine with me waiving them and just open-sourcing the content.
It also is a testament to the backwards compatibility of Git that even after 17 years, most of the contents of that book are still relevant.
You'll be the first to know when I write it. However, if anything, GitHub sort of killed the mailing list as a generally viable collaboration format outside of very specific use cases, so I'm not sure if I'm the right person to do it justice. However, it is a very cool and unique format that has several benefits that GitHub PR based workflows really lose out on.
By far my biggest complaint about the GitHub pull request model right now is that it doesn't treat the eventual commit message of a squashed commit (or even independent commits that will be rebased on the target) as part of the review process, like Gerrit does. I can't believe I'm the only person that is upset by this!
You are not alone. Coming from Gerrit myself, I hate that GitHub does not allow for commenting on the commit message itself. Neither does Gitlab.
Also, in a PR, I find that people just switch to Files Changed, disregarding the sequence of the commits involved.
This intentional de-emphasis of the importance of commit messages and the individual commits leads to lower quality of the git history of the codebase.
I never understood the "git cli sucks" thing until I used jj. The thing is, git's great, but it was also grown, over time, and that means that there's some amount of incoherence.
Furthermore, it's a leaky abstraction, that is, some commands only make sense if you grok the underlying model. See the perennial complaints about how 'git checkout' does more than one thing. It doesn't. But only if you understand the underlying model. If you think about it from a workflow perspective, it feels inconsistent. Hence why newer commands (like git switch) speak to the workflow, not to the model.
Furthermore, some features just feel tacked on. Take stashing, for example. These are pseudo-commits, that exist outside of your real commit graph. As a feature, it doesn't feel well integrated into the rest of git.
Rebasing is continually re-applying `git am`. This is elegant in a UNIXy way, but is annoying in a usability way. It's slow, because it goes through the filesystem to do its job. It forces you to deal with conflicts right away, because git has no way of modelling conflicts in its data model.
Basically, git's underlying model is great, but not perfect, and its CLI was grown, not designed. As such, it has weird rough edges. But that doesn't mean it's a bad tool. It's a pretty darn good one. But you can say that it is while acknowledging that it does have shortcomings.
Are you worried about hash collisions from different objects? The probability of a collision of N distinct objects with SHA-1 is (N choose 2) * 1 / 2^161. For a trillion objects the probability is about 1.7 x 10^-25. I think we can safely write code without collisions until the sun goes super novae.
> patches and tarballs workflow is sort of the first distributed version control system - everyone has a local copy, the changes can be made locally, access to "merge" is whomever can push a new tarball to the server.
Nitpick, but that's not a distributed workflow. It's distributed because anyone can run patch locally and serve the results themselves. There were well known alternative git branches back then, like the "mm" tree run by Andrew Morton.
The distributed nature of git is one of the most confusing for people who have been raised on centralised systems. The fact that your "master" is different to my "master" is something people have difficulty with. I don't think it's a huge mental leap, but too many people just start using Gitlab etc. without anyone telling them.
GitHub executed better than Bitbucket. And the Ruby community adopted GitHub early. Comparisons around 2010 said GitHub's UX and network effects were top reasons to choose Git. Mecurial's UX and Windows support were top reasons to choose Mercurial.
For me it won (10+ years ago) because for some reason git (a deeply linux oriented software) had better Windows support than Mercurial (that boasted about Windows support). You could even add files with names in various writing systems to git. I am not sure that Mercurial can do that even now.
Mercurial on windows was "download tortoisehg, use it", whereas git didn't have a good GUI and was full of footguns about line endings and case-insensitivity of branch names and the like.
Nowadays I use sublime merge on Windows and Linux alike and it's fine. Which solves the GUI issue, though the line ending issue is the same as it's always been (it's fine if you remember to just set it to "don't change line endings" globally but you have to remember to do that), and I'm not sure about case insensitivity of branch names.
Pretty sure Mercurial handles arbitrary filenames as UTF-8 encoded bytestrings, whether there was a problem with this in the past I can't recall, but would be very surprised if there was now.
Edit: does seem there at least used to be issues around this:
When evaluating successors to CVS back in 2007, Mozilla chose Mercurial because it had better Windows support than git. 18 years later, Mozilla is now migrating from Mercurial to git.
since it seems it has been forgotten, remember the reason Git was created is that Larry McVoy, who ran BitMover, which had been donating proprietary software licenses for BitKeeper to core kernel devs, got increasingly shirty at people working on tools to make BK interoperate with Free tools, culminating in Tridge showing in an LCA talk that you could telnet to the BK server and it would just spew out the whole history as SCCS files.
Larry shortly told everyone he wasn't going to keep giving BK away for free, so Linus went off for a weekend and wrote a crappy "content manager" called git, on top of which perhaps he thought someone might write a proper VC system.
I think that was the first time I ever saw Tridge deliver a conference presentation and it was to a packed lecture theatre at the ANU.
He described how he 'hacked' BitKeeper by connecting to the server via telnet and using the sophisticated hacker tools at his disposal to convince Bitkeeper to divulge its secrets, he typed:
Speaking of git front ends, I want to give a shout-out to Jujutsu. I suspect most people here have probably at least heard of it by now, but it has fully replaced my git cli usage for over a year in my day to day work. It feels like the interface that jj provides has made the underlying git data structures feel incredibly clear to me, and easy to manipulate.
Once you transition your mental model from working branch with a staging area to working revision that is continuously tracking changes, it's very hard to want to go back.
This is exciting, convergence is always good, but I'm confused about the value of putting the tracking information in a git commit header as opposed to a git trailer [1] where it currently lives.
In both cases, it's just metadata that tooling can extract.
Edit: then again, I've dealt with user error with the fragile semantics of trailers, so perhaps a header is just more robust?
Mostly because it is jarring for users that want to interact with tools which require these footers -- and the setups to apply them, like Gerrit's change-id script -- are often annoying, for example supporting Windows users but without needing stuff like bash. Now, I wrote the prototype integration between Gerrit and Jujutsu (which is not mainline, but people use it) and it applies Change-Id trailers automatically to your commit messages, for any commits you send out. It's not the worst thing in the world and it is a little fiddly bit of code.
But ignore all that: the actual _outcome_ we want is that it is just really nice to run 'jj gerrit send' and not think about anything else, and that you can pull changes back in (TBD) just as easily. I was not ever going to be happy with some solution that was like, "Do some weird git push to a special remote after you fix up all your commits or add some script to do it." That's what people do now, and it's not good enough. People hate that shit and rail at you about it. They will make a million reasons up why they hate it; it doesn't matter though. It should work out of the box and do what you expect. The current design does that now, and moving to use change-id headers will make that functionality more seamless for our users, easier to implement for us, and hopefully it will be useful to others, as well.
In the grand scheme it's a small detail, I guess. But small details matter to us.
I don't know if it's the only or original reason, but one nice consequence of the reverse hex choice is that it means change IDs and commit IDs have completely different alphabets ('0-9a-f' versus 'z-k'), so you can never have an ambiguous overlap between the two.
Jujutsu mostly doesn't care about the real "format" of a ChangeId, though. It's "really" just any arbitrary Vec<u8> and the backend itself has to define in some way and describe a little bit; the example backend has a 64-byte change ID, for example.[1] To the extent the reverse hex format matters it's mostly used in the template language for rendering things to the user. But you could also extend that with other render methods too.
That's a downside of using headers, not a reason for using them. If upstream git changes to help this, it would involve having those preserve the headers. (though cherry-pick has good arguments of preserving vs generating a new one)
Not parent, but for me it was a couple hours of reading (jj docs and steve's tutorial), another couple hours playing around with a test repo, then a couple weeks using it in place of git on actual projects where I was a bit slower. After that it's been all net positive.
Been using it on top of git, collaborating with people via Github repos for ~11 mos now. I'm more efficient than I was in git, and it's a smoother experience. Every once and a while I'll hit something that I have to dig into, but the Discord is great for help. I don't ever want to go back to git.
As with anything, it varies: I've heard some folks say "a few hours" and I've had friends who have bounced off two or three times before it clicks.
Personally, I did some reading about it, didn't really stick. One Saturday morning I woke up early and decided to give it a real try, and I was mostly good by the end of the day, swore of git entirely a week later.
> I assume the organization uses git and you use jujitsu locally, as a layer on top?
This is basically 100% of usage outside of Google, yeah. The git backend is the only open source one I'm aware of. Eventually that will change...
It's really not that long. Once you figure out that
1. jj operates on revisions. Changes to revisions are tracked automatically whenever you run jj CLI
2. revisions are mutable and created before starting working on a change (unlike immutable commits, created after you are done)
3. you are not working on a "working directory" that has to be "commited", you are just editing the latest revision
everything just clicks and feels very natural and gets out of the way. Want to create a new revision, whether it's merge, a new branch, or even insert a revision between some other revisions? That's jj new. Want to move/reorder revisions? That's jj rebase. Want to squash or split revisions? That's jj squash and jj split respectively. A much more user-friendly conflict resolution workflow is a really nice bonus (although, given that jj does rebasing automatically, it's more of a requirement)
One notably different workflow difference, is absence of branches in the git sense and getting used to mainly referring individual revisions, but after understanding things above, such workflow makes perfect sense.
I don't really remember exactly how long until I felt particularly comfortable with it. Probably on the order of days? I have never really used any other VCS besides vanilla git before, so I didn't really have any mental model of how different VCSs could be different. The whole working revision vs working branch + staging area was the biggest hurdle for me to overcome, and then it was off to the races.
We should say thank you to greedy BitKeeper VCS owners, who wanted Linus Torvalds to pay them money for keeping Linux source in their system. They managed to piss of Linus sufficiently, so he sat down and created Git.
I think the git usage patterns we've developed and grown accustomed to are proving inadequate for AI-first development. Maybe under the hood it will still be git, but the DX needs a huge revamp.
if I'm working in Cursor for example, ideally the entire chat history and the proposed changes after each prompt need to be stored. that doesn't fit cleanly into current git development patterns. I don't want to have to type commit messages any more. If I ever need to look at the change history, let AI generate a summary of the changes at that point. let me ask an LLM questions about a given set (or range) of changes if I need to. we need a more natural branching model. i don't want to have to create branches and switch between them and merge them. less typing, more speaking.
There’s something that bothers me about these sorts of recollections that make git seem… inevitable.
There’s this whole creation myth of how Git came to be that kind of paints Linus as some prophet reading from golden tablets written by the CS gods themselves.
Granted, this particular narrative in the blog post does humanise a bit more, remembering the stumbling steps, how Linus never intended for git itself to be the UI, how there wasn’t even a git commit command in the beginning, but it still paints the whole thing in somewhat romantic tones, as if the blob-tree-commit-ref data structure were the perfect representation of data.
One particular aspect that often gets left out of this creation myth, especially by the author of Github is that Mercurial had a prominent role. It was created by Olivia Mackall, another kernel hacker, at the same time as git, for the same purpose as git. Olivia offered Mercurial to Linus, but Linus didn’t look upon favour with it, and stuck to his guns. Unlike git, Mercurial had a UI at the very start. Its UI was very similar to Subversion, which at the time was the dominant VCS, so Mercurial always aimed for familiarity without sacrificing user flexibility. In the beginning, both VCSes had mind share, and even today, the mindshare of Mercurial lives on in hg itself as well as in worthy git successors such as jujutsu.
And the git data structure isn’t the only thing that could have ever possibly worked. It falls apart for large files. There are workaround and things you can patch on top, but there are also completely different data structures that would be appropriate for larger bits of data.
Git isn’t just plain wonderful, and in my view, it’s not inevitable either. I still look forward to a world beyond git, whether jujutsu or whatever else may come.
A lot of the ideas around git were known at this time. People mentioned monotone already. Still, Linus got the initial design wrong by computing the hash of the compressed content (which is a performance issue and also would make it difficult to replace the compression algorithm). Something I had pointed out early [1] and he later changed it.
I think the reason git then was successful was because it is a small, practical, and very efficient no-nonsense tool written in C. This made it much more appealing to many than the alternatives written in C++ or Python.
[1]: https://marc.info/?l=git&m=111366245411304&w=2
And because of Linux offering free PR for git (especially since it was backed by the main Linux dev).
Human factors matter, as much as programmers like to pretend they don't.
I'm curious why you think hg had a prominent role in this. I mean, it did pop up at almost exactly the same time for exactly the same reasons (BK, kernel drama) but I don't see evidence of Matt's benchmarks or development affecting the Git design decisions at all.
Here's one of the first threads where Matt (Olivia) introduces the project and benchmarks, but it seems like the list finds it unremarkable enough comparatively to not dig into it much:
https://lore.kernel.org/git/Pine.LNX.4.58.0504251859550.1890...
I agree that the UI is generally better and some decisions where arguably better (changeset evolution, which came much later, is pretty amazing) but I have a hard time agreeing that hg influenced Git in some fundamental way.
Please don't do that. Don't deadname someone.
I'm not saying that hg influenced git, but I'm saying that at the time, both were seen as worthy alternatives. Lots of big projects were using hg at one point: Python, Mozilla, Netbeans, Unity.
Sure, you managed to get Github in front of everyone's face and therefore git. For a while, Bitbucket was a viable alternative to many.
Were you involved with the decision to sponsor hg-git? I understand that at one point it was hoped that this would help move more people from hg into Github, just like Subversion support for Github would. I think the latter is still there.
"One particular aspect that often gets left out of this creation myth, especially by the author of Github is that Mercurial had a prominent role." implies to me that Hg had a role in the creation of Git, which is why I was reacting to that.
For the deadnaming comment, it wasn't out of disrespect, but when referring to an email chain, it could otherwise be confusing if you're not aware of her transition.
I wasn't sponsoring hg-git, I wrote it. I also wrote the original Subversion bridge for GitHub, which was actually recently deprecated.
https://github.blog/news-insights/product-news/sunsetting-su...
> For the deadnaming comment, it wasn't out of disrespect, but when referring to an email chain, it could otherwise be confusing if you're not aware of her transition.
I assumed it was innocent. But the norm when naming a married woman or another person who changed their name is to call them their current name and append the clarifying information. Not vice versa. Jane Jones née Smith. Olivia (then Matt).
> Please don't do that. Don't deadname someone.
Is this not a case where it is justified, given that she at that time was named Matt, and it's crucial information to understand the mail thread linked to? I certainly would not understand at all without that context.
Wait a second. You're saying now hg didn't influence git, but how does that fit with your previous comment?
> One particular aspect that often gets left out of this creation myth, especially by the author of Github is that Mercurial had a prominent role
I'm not sure where you're getting your facts from.
I do think an open source, distributed, content addressable VCS was inevitable. Not git itself, but something with similar features/workflows.
Nobody was really happy with the VCS situation in 2005. Most people were still using CVS, or something commercial. SVN did exist, it had only just reached version 1.0 in 2004, but your platforms like SourceForge still only offered CVS hosting. SVN was considered to be a more refined CVS, but it wasn't that much better and still shared all the same fundamental flaws from its centralised nature.
On the other hand, "distributed" was a hot new buzzword in 2005. The recent success of Bittorrent (especially its hot new DHT feature) and other file sharing platforms had pushed the concept mainstream.
Even if it wasn't for the Bitkeeper incident, I do think we would have seen something pop up by 2008 at the latest. It might not have caught on as fast as git did, but you must remember the thing that shot git to popularity was GitHub, not the linux kernel.
The file-tree-snapshot-ref structure is pretty good, but it lacks chunking at the file and tree layers, which makes it inefficient with large files and trees that don't change a lot. Modern backup tools like restic/borg/etc use something similar, but with chunking included.
> there are also completely different data structures that would be appropriate for larger bits of data.
Can you talk a little bit about this? My assumption was that the only way to deal with large files properly was to go back to centralised VCS, I'd be interested to hear what different data structures could obviate the issue.
>And the git data structure... falls apart for large files.
I'm good with this. In my over 25 years of professional experience, having used cvs, svn, perforce, and git, it's almost always a mistake keeping non-source files in the VCS. Digital assets and giant data files are nearly always better off being served from artifact repositories or CDN systems (including in-house flavors of these). I've worked at EA Sports and Rockstar Games and the number of times dev teams went backwards in versions with digital assets can be counted on the fingers of a single hand.
Non-source files should indeed never be in the VCS, but source files can still be binary, or large, or both. It depends on how you are editing the source and building the source into non-source files.
I think this conflates "non-source" with "large". Yes, it's often the case that source files are smaller than generated output files (especially for graphics artifacts), but this is really just a lucky coincidence that prevents the awkwardness of dealing with large files in version control from becoming as much of a hassle as it might be. Having a VCS that dealt with large files comfortably would free our minds and open up new vistas.
I think the key issue is actually how to sensibly diff and merge these other formats. Levenshtein-distance-based diffing is good enough for many text-based formats (like typical program code), but there is scope for so much better. Perhaps progress will come from designing file formats (including binary formats) specifically with "diffability" in mind -- similar to the way that, say, Java was designed with IDE support in mind.
Another alternative is the patch-theory approach from Darcs and now Pijul. It's a fundamentally different way of thinking about version control—I haven't actually used it myself but, from reading about it, I find thinking in patches matches my natural intuition better than git's model. Darcs had some engineering limitations that could lead to really bad performance in certain cases, but I understand Pijul fixes that.
I was a bit confused about the key point of patch-based versus snapshot-based, but I got some clarity in this thread: https://news.ycombinator.com/item?id=39453146
In early 2000s I was researching VCSs for work and also helping a little developing arch, bazaar then (less so) bzr. I trialed Bitkeeper for work. We went with Subversion eventually. I think I tried Monotone but it was glacially slow. I looked at Mercurial. It didn't click.
When I first used Git I thought YES! This is it. This is the one. The model was so compelling, the speed phenomenal.
I never again used anything else unless forced -- typically Subversion, mostly for inertia reasons.
The article is written by a co-founder of github and not Linus Torvalds.
git is just a tool to do stuff. It's name (chosen by that Finnish bloke) is remarkably apt - its for gits!
It's not Mecurial, nor github, nor is it anything else. Its git.
It wasn't invented for you or you or even you. It was a hack to do a job: sort out control of the Linux kernel source when Bit Keeper went off the rails as far as the Linux kernel devs were concerned.
It seems to have worked out rather well.
> There’s this whole creation myth of how Git came to be that kind of paints Linus as some prophet reading from golden tablets written by the CS gods themselves.
What?
> Git isn’t just plain wonderful, and in my view, it’s not inevitable either.
I mean, the proof is in the pudding. So why did we end up with Git? Was it just dumb luck? Maybe. But I was there at the start for both Git and Mercurial (as I comment elsewhere in this post). I used them both equally at first, and as a Python aficionado should've gravitated to Mercurial.
But I like to understand how tools work, and I personally found Mercurial harder to understand, slower to use, and much less flexible. It was great for certain workflows, but if those workflows didn't match what you wanted to do, it was rigid (I can't really expound on this; it's been more than a decade). Surprisingly (as I was coding almost entirely in Python at the time), I also found it harder to contribute to than Git.
Now, I'm just one random guy, but here we are, with the not plain wonderful stupid (but extremely fast) directory content manager.
> But I like to understand how tools work, and I personally found Mercurial harder to understand, slower to use, and much less flexible.
It's a relief to hear someone else say something like this, it's so rare to find anything but praise for mercurial in threads like these.
It was similar for me: In the early/mid 2010s I tried both git and mercurial after having only subversion experience, and found something with how mercurial handled branches extremely confusing (don't remember what, it's been so long). On the other hand, I found git very intuitive and have never had issues with it.
For me, the real problem at the time is that "rebase" was a second class feature.
I think too many folks at the time thought that full immutability was what folks wanted and got hung up on that. Turns out that almost everyone wanted to hide their mistakes, badly structured commits, and typos out of the box.
It didn't help that mercurial was slower as well.
Good point. Git succeeded in the same way that Unix/Linux succeeded. Yes, it sucks in many ways, but it is flexible and powerful enough to be worth it. Meanwhile, something that is "better" but not flexible, powerful, or hackable is not evolutionarily successful.
In fact, now that I've used the term "evolution", Terran life/DNA functions much the same way. Adaptability trumps perfection every time.
Don’t forget Fossil, which started around the same time…
https://fossil-scm.org/home/doc/trunk/www/history.md
I just wish they'd extend git to have better binary file diffs and moved file tracking.
Remembering the real history matters, because preserving history is valuable by itself, but I'm also really glad that VCS is for most people completely solved, there's nothing besides Git you have to pay attention to, you learn it once and use it your whole career.
> I just wish they'd extend git to have better binary file diffs
It's not built-in to git itself, but I remember seeing demos where git could be configured to use an external tool to do a visual diff any time git tried to show a diff of image files.
> and moved file tracking.
Check out -C and -M in the help for git log and blame. Git's move tracking is a bit weirder than others (it reconstructs moves/copies from history rather than recording them at commit), but I've found it more powerful than others because you don't need to remember a special "move" or "copy" command, plus it can track combining two files in a way others can't.
I am rooting for pijul.
I was always under the impression Monotone - which was released two years before Mercurial - was the inspiration for git, and that this was pretty well known.
This is all fairly speculative, but I didn't get the impression that Monotone was a main inspiration for Git. I think BitKeeper was, in that it was a tool that Linus actually liked using. Monotone had the content addressable system, which was clearly an inspiration, but that's the only thing I've seen Linus reference from Monotone. He tried using it and bailed because it was slow, but took the one idea that he found interesting and built a very different thing with that concept as one part of it is how I would interpret the history between these projects.
Linus was definitely aware of and mentioned Monotone. But to call it an inspiration might be too far. Content Addressable Stores were around a long time before that, mostly for backup purposes afaik. See Plan9's Venti file system.
Yes, Monotone partly inspired both. You can see that both hash contents. But both git and hg were intended to replace Bitkeeper. Mercurial is even named after Larry McVoy, who changed his mind. He was, you know, mercurial in his moods.
> There’s this whole creation myth of how Git came to be that kind of paints Linus as some prophet reading from golden tablets written by the CS gods themselves.
Linus absolutely had a couple of brilliant insights:
1. Content-addressable storage for the source tree.
2. Files do not matter: https://gist.github.com/borekb/3a548596ffd27ad6d948854751756...
At that time, I was using SVN and experimenting with Hg and Bazaar. Both were too "magical" for me, with unclear rules for merging, branching, rebasing.
Then came git. I read its description "source code trees, identified by their hashes, with file content movement deduced from diffs", and it immediately clicked. It's such an easy mental model, and you can immediately understand what operations mean.
> 2. Files do not matter
I wish weekly for explicit renames.
> At that time, I was using SVN and experimenting with Hg and Bazaar. Both were too "magical" for me, with unclear rules for merging, branching, rebasing.
I have no idea what you mean.
> It's such an easy mental model, and you can immediately understand what operations mean.
Many people disagree clearly.
Do you happen to know what Linus didn't like about Mercurial?
A lot of things in Mercurial kind of geared you towards using it more like Subversion was used. You pretty much could use Mercurial just like git was, and is, used, but the defaults didn't guide you to that direction.
One bigger difference I can think of is, Mercurial has permanently named branched (branch name is written in the commit), whereas in git branches are just named pointers. Mercurial got bookmarks in 2008 as an extension, and added to the core in 2011. If you used unnamed branches and bookmarks, you could use Mercurial exactly like git. But git was published in 2005.
Another is git's staging area. You can get pretty much the same functionality with repeatedly using `hg commit --amend` but again, in git the default gears you towards using the staging approach, in Mercurial you have specifically search for a way to get it to function this way.
I wonder this too. My guess is that he did not like "heavyweight" branching and the lack of cherry-pick/rebase. At any rate that is why I didn't like it back then.
Sun Microsystems (RIP) back then went with Mercurial instead of Git mainly because Mercurial had better support for file renames than Git did, but at Sun we used a rebase workflow with Mercurial even though Mercurial didn't have a rebase command. Sun had been using a rebase workflow since 1992. Rebase with Mercurial was wonky, but we were used to wonky workflows with Teamware anyways. Going with Mercurial was a mistake. Idk what Oracle does now internally, but I bet they use Git. Illumos uses Git.
I watched that whole process with fascination. It was long, careful, thorough ... and chose wrong.
A part of me thinks that there was a Sun users aversion to anything Linux related.
> A part of me thinks that there was a Sun users aversion to anything Linux related.
It wasn't that. It really was just about file renaming.
Ironically hg now has better rebasing than git e.g. the evolve extension
Ah, right, at Sun we used MQ. But anyways, just glancing at the hg evolve docs I'm unconvinced. And anyways, it's an "extension". Mercurial fought rebase for a long time, and they did dumb things like "rebase is non-interactive, and histedit is for when you want to edit the history".
And that is partly why Mercurial lost. They insisted on being opinionated about workflows being merge-based. Git is not opinionated. Git lets you use merge workflows if you like that, and rebase workflows if you like that, and none of this judgment about devs editing local history -- how dare the VCS tell me what to do with my local history?!
I may be misremembering but c vs python was a part of it. I don't think Linus thought too highly of python, or any interpreted languages, except shell perhaps, and didn't want to deal with installing and managing python packages.
> and didn't want to deal with installing and managing python packages.
Based on the fact that ecosystem torpedoed an entire major version of the language, and that there are a bazillion competing and incompatible dep managers, it seems that bet turned out well
I like Python and hate to admit it but you’re right.
Linus worried Mercurial was similar enough to BitKeeper that BitMover might threaten people who worked on it. Probably he had other complaints too.
Around 2002 or so, I had an idea to tag every part of a project with a unique hash code. With a hash code, one could download the corresponding file. A hash code for the whole project would be a file containing a list of hash codes for the files that make up the project. Hash codes could represent the compiler that builds it, along with the library(s) it links with.
I showed it to a couple software entrepreneuers (Wild Tangent and Chromium), but they had no interest in it.
I never did anything else with it, and so it goes.
I had actually done a writeup on it, and thought I had lost it. I found it, dated 2/15/2002:
---
Consider that any D app is completely specified by a list of .module files and the tools necessary to compile them. Assign a unique GUID to each unique .module file. Then, an app is specified by a list of .module GUIDs. Each app is also assigned a GUID.
On the client's machine is stored a pool of already downloaded .module files. When a new app is downloaded, what is actually downloaded is just a GUID. The client sees if that GUID is an already built app in the pool, then he's done. If not, the client requests the manifest for the GUID, a manifest being a list of .module GUIDs. Each GUID in the manifest is checked against the client pool, any that are not found are downloaded and added to the pool.
Once the client has all the .module files for the GUIDs that make up an app, they can all be compiled, linked, and the result cached in the pool.
Thus, if an app is updated, only the changed .module files ever need to get downloaded. This can be taken a step further and a changed .module file can be represented as a diff from a previous .module.
Since .module files are tokenized source, two source files that differ only in comments and whitespace will have identical .module files.
There will be a master pool of .module files on WT's server. When an app is ready to release, it is "checked in" to the master pool by assigning GUIDs to its .module files. This master pool is what is consulted by the client when requesting .module files by GUID.
The D "VM" compiler, linker, engine, etc., can also be identified by GUIDs. This way, if an app is developed with a particular combination of tools, it can specify the GUIDs for them in the manifest. Hence the client will automatically download "VM" updates to get the exact tools needed to duplicate the app exactly.
yeah, allow me to introduce you to the Nix whitepaper, which is essentially this, and thus worth a read for you:
https://edolstra.github.io/pubs/nspfssd-lisa2004-final.pdf
Another possibly related idea is the language Unison:
https://www.unison-lang.org/
Thank you. Looks like my idea precedes Nix by 2 years!
NixOS may end up being "the last OS I ever use" (especially now that gaming is viable on it):
https://nixos.org/
Check it out. The whitepaper's a fairly digestible read, too, and may get you excited about the whole concept (which is VERY different from how things are normally done, but ends up giving you guarantees)
The problem with NoxOS is all the effort to capture software closures is rendered moot by Linux namespaces, which are a more complete solution to the same problem.
Of course we didn't have them when the white paper was written, so that's fair but technology has moved on.
Nix(OS) is aware of namespaces, and can use them (in fact, the aforementioned gaming support relies on them), but versioning packages still works better than versioning the system in most cases.
Consider three packages, A, B, and C. B has two versions, A and C have one.
- A-1.0.0 depends on B-2.0.0 and C-1.0.0. - C-1.0.0 depends on B-1.0.0.
If A gets a path to a file in B-2.0.0 and wants to share it with C (for example, C might provide binaries it can run on files, or C might be a daemon), it needs C to be in a mount namespace with B-2.0.0. However, without Nix-store-like directory structure, mounting B-2.0.0's files will overwrite B-1.0.0's, so C may fail to start or misbehave.
I dont think thats true. How would you compile a program that has conflicting dependencies with a linux namespace?
Sounds like it's also halfway to a version of Nix designed specifically for D toolchains, too, using GUIDs instead of hashing inputs.
It wasn't designed specifically for D toolchains, that was just an example of what it could do.
Interesting. I thought calling a program an "app" came with the smartphone era much later.
Your description (including the detailed description in the reply) seems to be missing the crucial difference that git uses - the hash code of the object is not some GUID, it is literally the hash of the content of the object. This makes a big difference as you don't need some central registry that maps the GUID to the object.
There doesn't need to be a single central repository, there can be many partial ones. But if they are merged, they won't collide.
The GUID can certainly be a hash.
> The GUID can certainly be a hash.
It can’t be, because a GUID is supposed to be a globally unique. The point is, it needs to instead be the hash of the content.
This can’t be an afterthought.
UUID versions 3 and 5 are derived from hashes (MD5 and SHA1 respectively).
GUID and UUID are different.
The RFC defining them says they're the same and has since the earliest draft I can find, also from 2002. You should offer more explanation when you take a stance contrary to what is well documented.
A hash is not globally unique. I'm not sure what more explanation is needed.
Check my other comment as to how GUIDs are created in many ioquake3 forks. They MD5 hashed a randomly generated file.
Theoretically speaking, UUIDs have a semantic guarantee that each generated identifier is unique across all systems, times, and contexts, whereas cryptographic hashes are deterministic functions (i.e. they produce the same output for the same input), there is no inherent randomness or timestamping, unless you deliberately inject it such as the way ioquake3 forks did with GUID.
UUIDv4's output size is 122 bits usable, so 1 in 2^122 chance of collision, whereas SHA-512 and BLAKE2b has 512 bits, which has a 2^256 collision resistance, bound by the birthday problem.
In any case, SHA-256, SHA-512, BLAKE2b (cryptographic hashes) are unique in practice, meaning they are extremely unlikely to collide, more so than UUIDv4, despite UUIDv4 being non-deterministic, while cryptographic hashes are deterministic.
Of course, you should still know when to use cryptographic hashes vs. UUIDs. UUIDs are good for database primary keys, identifying users globally, tracking events, and the rest, such as verifying file content, deduplicating data by content, and tamper detection is the job of a cryptographic hash.
But to get to the chase: GUIDs (Globally Unique Identifiers) are also known as UUIDs (Universally Unique Identifiers), so they are the same!
I hope this answers OP's (kbolino) question. He was right, GUIDs are the same as UUIDs. Parent confused GUIDs with cryptographic hashes, most likely.
---
FWIW, collision resistance (i.e. birthday bound) is not improved by post-quantum algorithms. It remains inherently limited by 2^{n/2}, no matter what, as long as they use hashing.
---
TL;DR: GUIDs (Globally Unique Identifiers) are also known as UUIDs (Universally Unique Identifiers), so they are the same, i.e. GUIDs and UUIDs are NOT different!
How so? I thought they are the same, at least almost.
Tremulous (ioquake3 fork) had GUIDs from qkeys.
https://icculus.org/pipermail/quake3/2006-April/000951.html
You can see how qkeys are generated, and essentially a GUID is:
So, in this case, GUID is the MD5 hash of the generated qkey file. See "CL_GenerateQKey" for details.> On startup, the client engine looks for a file called qkey. If it does not exist, 2KiB worth of random binary data is inserted into the qkey file. A MD5 digest is then made of the qkey file and it is inserted into the cl_guid cvar.
UUIDs have RFCs, GUIDs apparently do not, but AFAIK UUIDs are also named GUIDs, so...
Bitkeeper maybe somewhat of a precedent (2000)?
Every git repo has a copy of that mapping instead of there being a central registry though, and because the commit author's name and email, and the date of the commit and a commit message (among other things) go into the hash that represents a commit, it's not that big a difference, is it? Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash if I don't also have the metadata that makes up the commit to make the git hash, and not just the files inside of it.
Yes, but the commit object (which includes metadata) references a tree object by its hash. The tree object is a text representation of a directory tree, basically, referencing file blobs by hash. So yes, you can recognize identical files between commits. It's true there's no fast indexing: if you want to ask the question "which commits contain exactly this file?" you have to search every commit. But you don't need to delta the file contents itself.
but people don't use the file hash, that's internal to git. I go to the centralized repository of repositories at github.com and look up tagged version 1.0.0 of whatever software, which refers to a git tag which references a commit hash (which yes it references a tree object as you said).
"People" don't commonly use them, no. But it's a real and documented API to do this (see e.g. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects).
And in any case you had a specific requirement above ("Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash"), and in fact this can be done!
The git tag hash references a commit. Without the commit metadata, you don't have a tree object and thus don't know any hashes. You can take the files on disk and compute the hash and furthermore you can take that hash and make a tree object. but without the commit, all you can say is you have a tree object, you don't have a tree object for the commit in question to compare it to.
That's for human consumption though, which is what frustrates so many "hashing will solve everything!" schemes - it breaks as soon as you need a bug fix.
At the end of the day none of us want "exactly this hash" we want "latest". Exact hashes and other reproducibility are things which are useful when debugging or providing traceability - valuable but also not the human side of the equation.
Isn't this basically... a Merkle Tree, the underlying storage architecture of things like git and Nix?
https://en.wikipedia.org/wiki/Merkle_tree
Except that instead of a GUID, it's just a hash of the binary data itself, which ends up being more useful because it is a natural key and doesn't require storing a separate mapping
I'd never heard of a Merkle Tree before, thanks for the reference.
While I get how that's like git, it sounds even closer to unison:
https://softwaremill.com/trying-out-unison-part-1-code-as-ha...
20 years later :-)
Hey Walter, what would you improve with Git?
Git hasn't quite taken the step of making the hash the URL you use to download a file, any file, and be assured it is exactly what you thought it was, as the hash of the file must match its URL.
This is currently done in a haphazard way, not particularly organized.
git over ipfs then?
I believe that's approximately what this is trying to do https://radicle.xyz/#:~:text=radicle%20is%20an%20open%20sour... although evidently using a custom protocol not ipfs itself
Similarly but I also had rsync or rdiff as a central character in my mental model of a VCS.
So you invented nix :-D
> I started using Git for something you might not imagine it was intended for, only a few months after it’s first commit
I started using git around 2007 or so because that company I worked for at the time used ClearCase, without a doubt the most painful version manager I have ever used (especially running it from a Linux workstation). So I wrote a few scripts that would let me mirror a directory into a git repo, do all my committing in git, then replay those commits back to ClearCase.
I can't recall how Git came to me attention in the first place, but by late 2008 I was contributing patches to Git itself. Junio was a kind but exacting maintainer, and I learned a lot about contributing to open source from his stewardship. I even attended one of the early GitTogethers.
As far as I can recall, I've never really struggled with git. I think that's because I like to dissect how things work, and under the covers git is quite simple. So I never had too much trouble with its terribly baroque CLI.
At my next job, I was at a startup that was building upon a fork of Chromium. At the time, Chromium was using subversion. But at this startup, we were using git, and I was responsible for keeping our git mirror up-to-date. I also had the terrible tedious job of rebasing our fork with Chromium's upstream changes. But boy did I get good at resolving merge conflicts.
Git may be the CLI I've used most consistently for nearly two decades. I'm disappointed that GitHub became the main code-review tool for Git, but I'll never be disappointed that Git beat out Mercurial, which I always found overly ridged and was never able to adapt it to my workflow.
> I started using git around 2007 or so because that company I worked for at the time used ClearCase, without a doubt the most painful version manager I have ever used
Ah, ClearCase! The biggest pain was in your wallet! I saw the prices my company paid per-seat for that privilege -- yikes!
As someone who wrote my first line of code in approx 2010 and used git & GH for the first time in… 2013? it kind of amazes me to remember that Git is only 20 years old. GitHub for instance doesn’t seem surprising to me that is <20 years old, but `git` not existing before 2005 somehow always feels shocking to me. Obviously there were other alternatives (to some extent) for version control, but git just has the feeling of a tool that is timeless and so ingrained in the culture that it is hard to imagine (for me) the idea of people being software developers in the post-mainframe age without it. It feels like something that would have been born in the same era as Vim, SSH, etc (ie early 90s). This is obviously just because from the perspective of my programming consciousness beginning, it was so mature and entrenched already, but still.
I’ve never used other source control options besides git, and I sometimes wonder if I ever will!
What surprises me more is how young Subversion is in comparison to git, it's barely older.
I guess I started software dev at a magic moment pre-git but after SVN was basically everywhere, but it felt even more like it had been around forever vs the upstart git.
I'm old enough to have used RCS. Very primitive and CVS was soon in use. Git is a breath of fresh air compared to these ones.
Any version control where you had to manually (and globally) "check out" (lock) files for editing was terrible and near unusable above about 3 people.
Version control systems where you didn't have shallow branches ( and thus each "branch" took a full copy / disk space of files) were awful.
version control systems which would have corruption data-bases (Here's to you Visual source safe) were awful.
Subversion managed to do better on all those issues, but it still didn't adequately solve distributed working issues.
It also didn't help that people often configured SVN to run with the option to add global locks back in, because they didn't understand the benefit of letting two people edit the same file at the same time.
I have a soft-spot for SVN. It was a lot better than it got credit for, but git very much stole the wind from under its sails by solving distributed (and critically, disconnected/offline) workflows just a bit better that developers could overlook the much worse UX, which remains bad to this day.
>It also didn't help that people often configured SVN to run with the option to add global locks back in, because they didn't understand the benefit of letting two people edit the same file at the same time.
I think it was more that they were afraid that a merge might some day be non-trivial. Amazing how that fear goes away once you've actually had the experience.
(I had to check because of this thread. SVN and Git initial releases were apparently about 4 and a half years apart. I think it was probably about 6 years between the time I first used SVN and the time I first used Git.)
I still use RCS, typically for admin files like fstab or other config files in /etc.
Doing `ci -l` on a file is better and faster than `cp fstab fstab.$(date +%Y%m%d.%H%M%S)`
Yeah, odd to learn. I remember dipping my toes into source control, playing around with CVS and SVN right around when git was originally announced and it felt so "modern" and "fresh" compared to these legacy systems I was learning.
> What surprises me more is how young Subversion is in comparison to git, it's barely older.
Subversion was so awful that it had to be replaced ASAP.
There were far, far worse things out there than Subversion. VSS, ClearCase, an obscure commercial one written in Java whose name escapes me now..
Subversion was basically a better CVS. My recollection is that plenty of people were more than happy to switch to CVS or Subversion (even on Windows) if it meant they could escape from something as legitimately awful as VSS. Whereas the switch from Subversion to Git or Mercurial had more to do with the additional powers of the newer tools than the problems of the older ones.
True. Also, Subversion was so great that it very quickly replaced the alternatives that predated it.
Not true. CVS stuck around a while longer.
Very interesting to get some historical context! Thanks for sharing Scott.
Small remark:
> As far as I can tell, this is the first time the phrase “rebase” was used in version control
ClearCase (which I had a displeasure to use) has been using the term "rebase" as well. Googling "clearcase rebase before:2005" finds [0] from 1999.
(by the way, a ClearCase rebase was literally taking up to half an hour on the codebase I was working on - in 2012; instant git rebases blew my mind).
[0] https://public.dhe.ibm.com/software/rational/docs/documentat...
Good pull. I was wondering if that was a true statement or not. I am curious if Linus knew about that or made it up independently, or if both came from somewhere else. I really don't know.
> He meant to build an efficient tarball history database toolset, not really a version control system. He assumed that someone else would write that layer.
Famous last words: "We'll do it the right way later!"
On the flip side: when you do intend to make a larger project like that, consciously focusing on the internal utility piece first is often a good move. For example, Pip doesn't offer a real API; anyone who wants their project to install "extra" dependencies dynamically is expected to (https://stackoverflow.com/questions/12332975) run it as a subprocess with its own command line. I suspect that maintaining Pip nowadays would be much easier if it had been designed from that perspective first, which is why I'm taking that approach with Paper.
Yes, still odd, but I can deal with it.
FWIW, I just found out you can sign commits using ssh keys. Due to how pinentry + gnupg + git has issues on OpenBSD with commit signing, I just moved to signing via ssh. I had a workaround, but it was a real hack, now no issues!
20 years, wow seems like yesterday I moved my work items from cvs to git. I miss one item in cvs ($Id$), but I learned to do without it.
Oh yeah, SSH signing is incredible. I've also migrated to it and didn't look back.
A couple of differences:
- it's possible to specify signing keys in a file inside the repository, and configure git to verify on merge (https://github.com/wiktor-k/ssh-signing/). I'm using that for my dot config repo to make sure I'm pulling only stuff I committed on my machines.
- SSH has TPM key support via PKCS11 or external agents, this makes it possible to easily roll out hardware backed keys
- SSH signatures have context separation, that is it's not possible to take your SSH commit signature and repurpose it (unlike OpenPGP)
- due to SSH keys being small the policy file is also small and readable, compare https://github.com/openssh/openssh-portable/blob/master/.git... with equivalent OpenPGP https://gitlab.com/sequoia-pgp/sequoia/-/blob/main/openpgp-p...
Wow that allowed signers feature is cool. should pair nicely with ssh key support in sops
AFAIR keyword substitution of $Id$ included the revision number. That would be the commit hash in Git. For obvious reasons you cannot insert a hash value in content from which that hash value is being computed.
You can use smudge and clean filters to expand this into something on disk and then remove it again before the hash computation runs.
However, I don't think you would want to use the SHA, since that's somewhat meaningless to read. You would probably want to expand ID to `git describe SHA` so it's more like `v1.0.1-4-ga691733dc`, so you can see something more similar to a version number.
You can probably setup smudge and clean filters in Git to do keyword expansion in a CVS-like way.
I don't know if it's lucky or unlucky for you that you managed to skip Subversion
:)
Professionally, I went from nothing to RCS then to CVS then to git.
In all cases I was the one who insisted on using some kind of source code control in the group I worked with.
Not being an admin, I set up RCS on the server, then later found some other group that allowed us to use their CVS instance. Then when M/S bought github the company got religion and purchased a contract for git.
Getting people to use any kind of SC was a nightmare, this was at a fortune 500 company. When I left, a new hire saw the benefit of SC and took over for me :)
In the old days, loosing source happened a lot, I did not what that to happen when I was working at that company.
Decentralized, but centralized.
Thanks for the useful article! In addition to a lot of interesting info, it lead me to this repo containing an intro to git internals[1]. Would highly recommend everyone to take a look [1] https://github.com/pluralsight/git-internals-pdf
Ah yes. It was pretty cool that when Peepcode was acquired, Pluralsight asked me what I wanted to do with my royalties there and was fine with me waiving them and just open-sourcing the content.
It also is a testament to the backwards compatibility of Git that even after 17 years, most of the contents of that book are still relevant.
> I would love to do a whole blog post about how mailing list collaboration works and how cool various aspects of it are, but that’s for another time.
This is actually the part I would be interested in, coming from a GitHub cofounder.
You'll be the first to know when I write it. However, if anything, GitHub sort of killed the mailing list as a generally viable collaboration format outside of very specific use cases, so I'm not sure if I'm the right person to do it justice. However, it is a very cool and unique format that has several benefits that GitHub PR based workflows really lose out on.
By far my biggest complaint about the GitHub pull request model right now is that it doesn't treat the eventual commit message of a squashed commit (or even independent commits that will be rebased on the target) as part of the review process, like Gerrit does. I can't believe I'm the only person that is upset by this!
If this is something you're interested in, you may want to try the patch-based review system that we recently launched for GitButler: https://blog.gitbutler.com/gitbutlers-new-patch-based-code-r...
This does look interesting! I’ll take a closer look.
You are not alone. Coming from Gerrit myself, I hate that GitHub does not allow for commenting on the commit message itself. Neither does Gitlab.
Also, in a PR, I find that people just switch to Files Changed, disregarding the sequence of the commits involved.
This intentional de-emphasis of the importance of commit messages and the individual commits leads to lower quality of the git history of the codebase.
Of all the many source control systems I've used git has the worst usability yet it's my favorite
Why?
Not your parent.
I never understood the "git cli sucks" thing until I used jj. The thing is, git's great, but it was also grown, over time, and that means that there's some amount of incoherence.
Furthermore, it's a leaky abstraction, that is, some commands only make sense if you grok the underlying model. See the perennial complaints about how 'git checkout' does more than one thing. It doesn't. But only if you understand the underlying model. If you think about it from a workflow perspective, it feels inconsistent. Hence why newer commands (like git switch) speak to the workflow, not to the model.
Furthermore, some features just feel tacked on. Take stashing, for example. These are pseudo-commits, that exist outside of your real commit graph. As a feature, it doesn't feel well integrated into the rest of git.
Rebasing is continually re-applying `git am`. This is elegant in a UNIXy way, but is annoying in a usability way. It's slow, because it goes through the filesystem to do its job. It forces you to deal with conflicts right away, because git has no way of modelling conflicts in its data model.
Basically, git's underlying model is great, but not perfect, and its CLI was grown, not designed. As such, it has weird rough edges. But that doesn't mean it's a bad tool. It's a pretty darn good one. But you can say that it is while acknowledging that it does have shortcomings.
Sometimes I ask myself if Torvald's greater contribution to society wouldn't be Git, instead of Linux.
20 years! Which recent Git features do you find useful? I think I've never used any feature less than 10 years old. I'm probably missing something
When are we moving to SHA256? Some code bases must be getting massive by now sfter 20 years,
Are you worried about hash collisions from different objects? The probability of a collision of N distinct objects with SHA-1 is (N choose 2) * 1 / 2^161. For a trillion objects the probability is about 1.7 x 10^-25. I think we can safely write code without collisions until the sun goes super novae.
> patches and tarballs workflow is sort of the first distributed version control system - everyone has a local copy, the changes can be made locally, access to "merge" is whomever can push a new tarball to the server.
Nitpick, but that's not a distributed workflow. It's distributed because anyone can run patch locally and serve the results themselves. There were well known alternative git branches back then, like the "mm" tree run by Andrew Morton.
The distributed nature of git is one of the most confusing for people who have been raised on centralised systems. The fact that your "master" is different to my "master" is something people have difficulty with. I don't think it's a huge mental leap, but too many people just start using Gitlab etc. without anyone telling them.
Why did git 'won' over mercurial?
Because Github was better than Bitbucket? Or maybe because of the influence of kernel devs?
GitHub executed better than Bitbucket. And the Ruby community adopted GitHub early. Comparisons around 2010 said GitHub's UX and network effects were top reasons to choose Git. Mecurial's UX and Windows support were top reasons to choose Mercurial.
> Because Github was better than Bitbucket?
Github was more popular than Bitbucke, so git unfortunately won.
For me it won (10+ years ago) because for some reason git (a deeply linux oriented software) had better Windows support than Mercurial (that boasted about Windows support). You could even add files with names in various writing systems to git. I am not sure that Mercurial can do that even now.
Huh, that's not my recollection.
Mercurial on windows was "download tortoisehg, use it", whereas git didn't have a good GUI and was full of footguns about line endings and case-insensitivity of branch names and the like.
Nowadays I use sublime merge on Windows and Linux alike and it's fine. Which solves the GUI issue, though the line ending issue is the same as it's always been (it's fine if you remember to just set it to "don't change line endings" globally but you have to remember to do that), and I'm not sure about case insensitivity of branch names.
Pretty sure Mercurial handles arbitrary filenames as UTF-8 encoded bytestrings, whether there was a problem with this in the past I can't recall, but would be very surprised if there was now.
Edit: does seem there at least used to be issues around this:
https://stackoverflow.com/questions/7256708/mercurial-proble...
though google does show at least some results for similar issues with git
When evaluating successors to CVS back in 2007, Mozilla chose Mercurial because it had better Windows support than git. 18 years later, Mozilla is now migrating from Mercurial to git.
For what it's worth, I've encountered filenames which cannot be decoded as utf-8.
since it seems it has been forgotten, remember the reason Git was created is that Larry McVoy, who ran BitMover, which had been donating proprietary software licenses for BitKeeper to core kernel devs, got increasingly shirty at people working on tools to make BK interoperate with Free tools, culminating in Tridge showing in an LCA talk that you could telnet to the BK server and it would just spew out the whole history as SCCS files.
Larry shortly told everyone he wasn't going to keep giving BK away for free, so Linus went off for a weekend and wrote a crappy "content manager" called git, on top of which perhaps he thought someone might write a proper VC system.
and here we are.
a side note was someone hacking the BitKeeper-CVS "mirror" (linear-ish approximation of the BK DAG) with probably the cleverest backdoor I'll ever see: https://blog.citp.princeton.edu/2013/10/09/the-linux-backdoo...
see if you can spot the small edit that made this a backdoor:
if ((options == (__WCLONE|__WALL)) && (current->uid = 0)) retval = -EINVAL;
I think that was the first time I ever saw Tridge deliver a conference presentation and it was to a packed lecture theatre at the ANU. He described how he 'hacked' BitKeeper by connecting to the server via telnet and using the sophisticated hacker tools at his disposal to convince Bitkeeper to divulge its secrets, he typed:
help
The room erupted with applause and laughter.
Speaking of git front ends, I want to give a shout-out to Jujutsu. I suspect most people here have probably at least heard of it by now, but it has fully replaced my git cli usage for over a year in my day to day work. It feels like the interface that jj provides has made the underlying git data structures feel incredibly clear to me, and easy to manipulate.
Once you transition your mental model from working branch with a staging area to working revision that is continuously tracking changes, it's very hard to want to go back.
GitButler and jj are very friendly with each other, as projects, and are even teaming up with Gerrit to collaborate on the change-id concept, and maybe even have it upstreamed someday: https://lore.kernel.org/git/CAESOdVAspxUJKGAA58i0tvks4ZOfoGf...
This is exciting, convergence is always good, but I'm confused about the value of putting the tracking information in a git commit header as opposed to a git trailer [1] where it currently lives.
In both cases, it's just metadata that tooling can extract.
Edit: then again, I've dealt with user error with the fragile semantics of trailers, so perhaps a header is just more robust?
[1] https://git-scm.com/docs/git-interpret-trailers
Mostly because it is jarring for users that want to interact with tools which require these footers -- and the setups to apply them, like Gerrit's change-id script -- are often annoying, for example supporting Windows users but without needing stuff like bash. Now, I wrote the prototype integration between Gerrit and Jujutsu (which is not mainline, but people use it) and it applies Change-Id trailers automatically to your commit messages, for any commits you send out. It's not the worst thing in the world and it is a little fiddly bit of code.
But ignore all that: the actual _outcome_ we want is that it is just really nice to run 'jj gerrit send' and not think about anything else, and that you can pull changes back in (TBD) just as easily. I was not ever going to be happy with some solution that was like, "Do some weird git push to a special remote after you fix up all your commits or add some script to do it." That's what people do now, and it's not good enough. People hate that shit and rail at you about it. They will make a million reasons up why they hate it; it doesn't matter though. It should work out of the box and do what you expect. The current design does that now, and moving to use change-id headers will make that functionality more seamless for our users, easier to implement for us, and hopefully it will be useful to others, as well.
In the grand scheme it's a small detail, I guess. But small details matter to us.
Thanks for the explanation!
While you're around, do you know why Jujutsu created its own change-id format (the reverse hex), rather than use hashes (like Git & Gerrit)?
I don't know if it's the only or original reason, but one nice consequence of the reverse hex choice is that it means change IDs and commit IDs have completely different alphabets ('0-9a-f' versus 'z-k'), so you can never have an ambiguous overlap between the two.
Jujutsu mostly doesn't care about the real "format" of a ChangeId, though. It's "really" just any arbitrary Vec<u8> and the backend itself has to define in some way and describe a little bit; the example backend has a 64-byte change ID, for example.[1] To the extent the reverse hex format matters it's mostly used in the template language for rendering things to the user. But you could also extend that with other render methods too.
[1] https://github.com/jj-vcs/jj/blob/5dc9da3c2b8f502b4f93ab336b...
Yes, it was to avoid ambiguity between the two kinds of IDs. See https://github.com/jj-vcs/jj/pull/1238 (see the individual commits).
Interesting, that was just a few short months before I showed up. :)
I'm not an expert on this corner of git, but a guess: trailer keys are not unique, that is
is totally fine, but is not.I've also heard of issues with people copy/pasting commit messages and including bits of trailers they shouldn't have, I believe.
~I think it's more that not all existing git commands (rebase, am, cherry-pick?) preserve all headers.~
ignore, misread the above
That's a downside of using headers, not a reason for using them. If upstream git changes to help this, it would involve having those preserve the headers. (though cherry-pick has good arguments of preserving vs generating a new one)
ah, I'm sorry, I misread your comment (and should have mentioned the cherry-pick thing anyway).
It’s all good!
jj is fantastic and I love it so much. Takes the best things I liked about hg but applies it to a version control system people actually use!
Can second this. Beware that your old git habits may die hard though. (It's nice that it uses Git as its storage backend, for now, though)
How long did it take you to become proficient? I assume your organization uses git and you use jujitsu locally, as a layer on top?
Not parent, but for me it was a couple hours of reading (jj docs and steve's tutorial), another couple hours playing around with a test repo, then a couple weeks using it in place of git on actual projects where I was a bit slower. After that it's been all net positive.
Been using it on top of git, collaborating with people via Github repos for ~11 mos now. I'm more efficient than I was in git, and it's a smoother experience. Every once and a while I'll hit something that I have to dig into, but the Discord is great for help. I don't ever want to go back to git.
And yes, jj on top of git in colocated repos (https://jj-vcs.github.io/jj/v0.27.0/git-compatibility/#co-lo...).
If you set explicit bookmark/branch names when pushing to git remotes, no one can tell you use jj.
> Every once and a while…
The expression is “every once in a while” :).
Oops :)
(not your parent)
> How long did it take you to become proficient?
As with anything, it varies: I've heard some folks say "a few hours" and I've had friends who have bounced off two or three times before it clicks.
Personally, I did some reading about it, didn't really stick. One Saturday morning I woke up early and decided to give it a real try, and I was mostly good by the end of the day, swore of git entirely a week later.
> I assume the organization uses git and you use jujitsu locally, as a layer on top?
This is basically 100% of usage outside of Google, yeah. The git backend is the only open source one I'm aware of. Eventually that will change...
It's really not that long. Once you figure out that
1. jj operates on revisions. Changes to revisions are tracked automatically whenever you run jj CLI
2. revisions are mutable and created before starting working on a change (unlike immutable commits, created after you are done)
3. you are not working on a "working directory" that has to be "commited", you are just editing the latest revision
everything just clicks and feels very natural and gets out of the way. Want to create a new revision, whether it's merge, a new branch, or even insert a revision between some other revisions? That's jj new. Want to move/reorder revisions? That's jj rebase. Want to squash or split revisions? That's jj squash and jj split respectively. A much more user-friendly conflict resolution workflow is a really nice bonus (although, given that jj does rebasing automatically, it's more of a requirement)
One notably different workflow difference, is absence of branches in the git sense and getting used to mainly referring individual revisions, but after understanding things above, such workflow makes perfect sense.
I don't really remember exactly how long until I felt particularly comfortable with it. Probably on the order of days? I have never really used any other VCS besides vanilla git before, so I didn't really have any mental model of how different VCSs could be different. The whole working revision vs working branch + staging area was the biggest hurdle for me to overcome, and then it was off to the races.
And yes, I use jj locally with remote git repos.
We should say thank you to greedy BitKeeper VCS owners, who wanted Linus Torvalds to pay them money for keeping Linux source in their system. They managed to piss of Linus sufficiently, so he sat down and created Git.
Git, still a ripoff of BitKeeper. All innovation begins as closed source.
> 20 years ago
> 2005
wow.
I think the git usage patterns we've developed and grown accustomed to are proving inadequate for AI-first development. Maybe under the hood it will still be git, but the DX needs a huge revamp.
What do you mean?
if I'm working in Cursor for example, ideally the entire chat history and the proposed changes after each prompt need to be stored. that doesn't fit cleanly into current git development patterns. I don't want to have to type commit messages any more. If I ever need to look at the change history, let AI generate a summary of the changes at that point. let me ask an LLM questions about a given set (or range) of changes if I need to. we need a more natural branching model. i don't want to have to create branches and switch between them and merge them. less typing, more speaking.
You should ask a LLM to create a VCS that will let you not do all these things you don't want to do.