jakstys.lt/smart-bundling.md at d6d262a8dddca3859caffa6773b86c41099fe854

Motiejus Jakštys 3778b35f12 smart-bundling: no longer a draft

2022-05-12 16:32:49 +03:00

12 KiB

Raw Blame History

title	date	slug
Smart Bundling	2022-05-12T05:55:00+03:00	smart-bundling

TLDR

Could our package managers bundle our dependencies in git so it's robust and frictionless? What about this:

"smart" vendoring to protect ourselves from things disappearing off the internet, and
write/have tools that make this vendoring easy for us.

Number of dependencies

All of the programming languages I've used professionally, the names of which do not start with "c"¹, have package managers², which make "dependency management" easy. These package managers will, as part of the project's build process, download and build dependencies. They are easy enough to use that there is virtually no resistance to add dependencies when they deem necessary.

Dependencies are usually stored outside of the project's code repository; either looked up in the system (common for C/C++) or downloaded from the Internet (common for everything else). Many system dependencies irritates users, so developers are incentivized to reduce them. However, there is no incentive to have few statically linked, downloaded-from-the-internet dependencies (I call them "external"), which brings us to this post.

Adding external dependencies is like candy: the initial costs are nearly zero, tastes good while eating, but the long-term effects are ill and dangerous. Why and how to be cautious of external dependencies is a post for another day, but suffice to say, I have a checklist and am prepared to do the work to avoid adding a dependency if I can.

If at least one external dependency disappears, we have serious problems building our project.

{{<img src="_/2022/brick-house.jpg" alt="House made out of Duplo pieces" caption="Just like this brick house, "modern" package managers are optimized for building, not maintenance. House by my sons, photo mine." hint="photo" >}}

C++ programs I wrote a decade ago still generally build and run; Erlang, Java and Python generally don't. Judging by the way "modern" languages handle dependencies, it is fair to say that they optimize for initial development, but not maintenance. Ten years ago I didn't think this will happen, I am less naïve now. As Corbet says, "We can't understand why Kids These Days just don't want to live that way". Kids want to build, John, not think about the future. A 4-letter Danish corporation made a fortune by selling toys that are designed to be disassembled and built anew. Look ma', no maintenance! Kids are still kids: growing up and sticking to the rules, even if they are ours, requires discipline.

If we require Something On The Internet to be available to build our application, it will inevitably go away. The more things we rely on, and the more time passes, the higher chance of misery when it does. We cannot abolish dependencies these days, since some of them are too good to ignore (hello SQLite, 241,245 lines of top-quality C). So we need to find a balance: how can we have dependencies to satisfy the kids, but be mature and strategic in in the long-term? We have a few options today:

Mirror everything to an internal system, which never deletes code. Change package manager to read from there instead. Discounting convenience, some companies must absolutely have every line of code of their every build for decades, and be able to rebuild it. Think about the firmware of your car's ABS or the infamous Boeing's MCAS. This problem alone is a whole B2B business segment and costs big money.
Copy the dependency verbatim to deps/<dependency>. While easy to do, this loses history of the dependency and rewrites the hashes, also making it difficult to distinguish "our" from the upstream changes. Upgrades become cumbersome, leading to the only obvious outcome of never upgrading after the initial import.
Step up from (2): use git-subtree to copy the dependency to the application tree, but preserve the history of the dependency. This messes up the hashes. Therefore all refs in the dependency, like Reverts <commit> do not make sense in isolation. Upgrades are somewhat easier than with (2), because history is still sort-of there, but still cumbersome, leading to the same unfortunate outcome.
Download the dependencies at build time and store them in a "safe place", like go-mod-archiver. It does not change how day-to-day development works with go modules, but offers a lifeboat when a dependency disappears. History-wise it is still same as (2) — copying the dependency tree without it's history; if dependency does go away, bringing it back under our own wing is an exertion. As it does not change the development process, it is quite easy to sell to any team.

Option (1) is viable for very specific audiences and costs big money. Options (2) and (3) blur the line between our application and dependencies and rewrite the git history. Option (4) serves a different purpose: it is not a dependency management system; it is a lifeboat when they inevitably disappear: dependencies are still downloaded from the internet on every build.

This number of approaches seem to suggest there is an apetite to protect ourselves when dependencies disappear (vendoring of increasing sophistication), that preserve git history in some way (git-subtree) and do not get in a way of using the language's build tool (go-mod-archiver). But the problem is not yet solved for any of the languages that I have worked with.

So what about all of the below:

"smart" vendoring to protect ourselves from things disappearing, and
no friction when doing it?

Avery Pennarun, the creator of git-subtree, wrote git-subtrac, which vendors dependencies in a special branch without rewriting their history (i.e. leaving the hashes intact). Wait, stop here. Repeat after me: git-subtrac vendors our dependencies, but all refs stay in our repository. Since all the dependencies are in our repo, no external force can make our dependency unavailable. Let's discuss it's advantages:

The dependency keeps it's hashes, so it's history is left intact (as a side-effect, git show <hash of the dependency> in our repository, will, surprisingly, work).
The dependency is vendored to our tree, so it will not disappear.
Because humans are more observant to download times than building times, it will keep a nice check on the overall size of the repository. Hopefully preventing us (or our kids) from pulling in V8 (over 2M lines of code) just to interpret a couple hundred lines of lines of JavaScript³.

Some of my friends point out that it has a disadvantage by design: it uses git submodules. Submodules is the only way to convince git to to check out another repository (i.e. the dependency) into our tree without git thinking it's part of our code. Submodules are infamous for their footguns when used directly. Higher-level git-subtrac shields us from being overly exposed to git submodules, keeping footguns at the minimum. Oh, this description also applies to the other 150+ git plumbing commands⁴, so nothing new here.

Andrew meets Git Subtrac (for 5 seconds)

A couple of weeks ago in a park in Milan I was selling git-subtrac to Andrew Kelley as a default package manager for Zig (zig does not have one yet). Our conversation went like this:

me: "git-subtrac yadda yadda yadda submodules but better yadda yadda yadda".
Andrew: "If I clone a repository that uses subtrac with no extra parameters, will it work as expected?"
me: "No, you have to pass --recursive, so git will checkout submodules... even if they are already fetched."
Andrew: "Then it's a piece-of-shit-approach."

And I agree: git-subtrac is a tool for managing submodules, and does not try to be anything more: it is not a package manager, nor it is a dependency management tool. As far as potential Zig users users are concerned, it should be a "git plumbing" command.

If we never expose the nitty-gritty handling of git submodules (like git-subtrac, but more sophisticated), maybe it's OK? I have tried to manage two projects with git-subtrac, and it's quite close to pretend of not using submodules.

Zig and subtrac?

A package manager does much more than just downloading the dependencies:

Resolve the right versions of direct dependencies.
Resolve and download the right transitive dependencies.
Figure out the diamond dependencies and incompatibilities.
Provide one-click means to upgrade everything.
Build the dependencies (it is part of the build system, but usually the build system and package manager are coupled).

Just like git will not build our code, git-subtrac will neither. What if we make zig pkg rely on git-subtrac (or, if we are serious, it's reimplementation) to manage directory trees?

Think about it for a minute. Imagine this workflow:

Step 1: clone

git clone https://git.example.org/project

Download the application source.
Download all dependencies to .git/, but not check them out (due to the nature of git submodules). deps/ is an empty directory at this point.

Step 2: build

zig pkg build:

Check out dependencies in deps/ using git's plumbing commands. No network involved.
Build dependencies, transitive dependencies and the application.

At this point, the dependency is checked out in deps/, ready for hacking. If we change the code there, git makes it obvious, but does not forbid us from doing so, which is nice when hacking.

Step 3: add a dependency

zig pkg get https://git.example.org/dep1

record the path of the dependency (just the user's intent, as typed) to the zigpkg's config file.
download the latest (tagged release|ref) and put amongst other dependencies.

Step 4: upgrading the dependencies

zig pkg upgrade

Go through the list of dependencies recorded in step 3, try to fetch the updaded dependency versions.
With hand holding and guardrails:
- If dependency no longer exists, inform and advice further course of action
- If the "newest version" is not a parent of what we have now, warn the user as it's not an upgrade, but a new thing.
See, no lock file needed! Just the list of dependency URLs, which translate to the exact refs.

This sums up the basic workflow.

Drawbacks

There are a few:

From my experience with git-subtrac, git submodules is still a leaky abstraction. Since we are using git too, I am not fully convinced we may hide all of it from the unsuspecting user.
Obviously, this only works when both our repository and the dependencies are in git. This may not be good if you are in the Fossil land (SQLite and SpatiaLite come to mind).

Did I miss something? Tell me.

Credits

Many thanks Johny Marler and Anton Lavrik for reading drafts of this.

Alphabetically: Erlang, Go, Java, JavaScript, PHP, Perl, Python. ↩︎
Usually written in the same language. Zoo of package managers (sometimes a couple of popular ones for the same programming language) is a can of worms in an on itself worth another blog post. ↩︎
True story. ↩︎
git plumbing commands are ones that the users should almost never use, but are used by git itself or other low-level tools. E.g. git cat-file and git hash-object are plumbing commands, which, I can assure, 99% of the git users have never heard of. ↩︎

12 KiB Raw Blame History