jakstys.lt/content/log/2022/smart-bundling.md

270 lines
13 KiB
Markdown
Raw Normal View History

2022-05-08 21:17:48 +03:00
---
2022-05-09 05:53:25 +03:00
title: "Smart Bundling"
2022-05-12 16:32:44 +03:00
date: 2022-05-12T05:55:00+03:00
2022-05-08 21:17:48 +03:00
slug: smart-bundling
2023-02-20 16:56:46 +02:00
description: "Bundling dependencies is undeservedly frowned upon. Lets explore how to do it reasonably. Update: Zig's package manager will support it a reasonable mode of bundling dependencies!"
2022-05-08 21:17:48 +03:00
---
TLDR
----
2022-05-11 10:18:04 +03:00
Could our package managers bundle our dependencies in git so it's robust and
frictionless? What about this:
- "smart" vendoring to protect ourselves from things disappearing off the
internet, and
- write/have tools that make this vendoring easy for us.
2022-05-08 21:17:48 +03:00
2023-02-20 16:56:46 +02:00
Update 2022-02
--------------
Zig's upcoming [package manager](https://github.com/ziglang/zig/pull/14265)
will have ability to use bundled dependencies from the local checkout: exactly
how it's envisioned in the remainder of the article. Details in
[ziglang/zig#14293][14293]. Looking forward!
2022-05-08 21:17:48 +03:00
Number of dependencies
----------------------
All of the programming languages I've used professionally, the names of which
do not start with "c"[^1], have package managers[^2], which make "dependency
management" easy. These package managers will, as part of the project's build
process, download and build dependencies. They are easy enough to use that
there is virtually no resistance to add dependencies when they deem necessary.
Dependencies are usually stored outside of the project's code repository;
either looked up in the system (common for C/C++) or downloaded from the
Internet (common for everything else). Many system dependencies irritates
users, so developers are incentivized to reduce them. However, there is no
incentive to have few statically linked, downloaded-from-the-internet
dependencies (I call them "external"), which brings us to this post.
Adding external dependencies is like candy: the initial costs are nearly zero,
tastes good while eating, but the long-term effects are ill and dangerous. Why
and how to be cautious of external dependencies is a post for another day, but
suffice to say, I have a checklist and am prepared to do the work to avoid
adding a dependency if I can.
If at least one external dependency [disappears][crash-of-leftpad], we have
serious problems building our project.
{{<img src="_/2022/brick-house.jpg"
alt="House made out of Duplo pieces"
caption="Just like this brick house, \"modern\" package managers are optimized for building, not maintenance. House by my sons, photo mine."
hint="photo"
>}}
C++ programs I wrote a decade ago still generally build and run; Erlang, Java
and Python generally don't. Judging by the way "modern" languages handle
dependencies, it is fair to say that they optimize for initial development, but
not maintenance. Ten years ago I didn't think this will happen, I am less naïve
now. As [Corbet says][linux-rust], "We can't understand why Kids These Days
just don't want to live that way". Kids want to build, John, not think about
the future. A 4-letter Danish corporation made a fortune by selling toys that
are designed to be disassembled and built anew. Look ma', no maintenance! Kids
are still kids: growing up and sticking to the rules, even if they are ours,
requires discipline.
If we require Something On The Internet to be available to build our
application, it will inevitably go away. The more things we rely on, and the
more time passes, the higher chance of misery when it does. We cannot abolish
dependencies these days, since some of them are too good to ignore (hello
SQLite, 241,245 lines of top-quality C). So we need to find a balance: how can
we have dependencies to satisfy the kids, but be mature and strategic in in the
long-term? We have a few options today:
1. Mirror everything to an internal system, which never deletes code. Change
package manager to read from there instead. Discounting convenience, some
companies must absolutely have every line of code of their every build for
decades, and be able to rebuild it. Think about the firmware of your car's
[ABS][abs] or the infamous Boeing's [MCAS][MCAS]. This problem alone is a
whole B2B business segment and costs big money.
2. Copy the dependency verbatim to `deps/<dependency>`. While easy to do, this
loses history of the dependency and rewrites the hashes, also making it
difficult to distinguish "our" from the upstream changes. Upgrades become
cumbersome, leading to the only obvious outcome of never upgrading after the
initial import.
3. Step up from (2): use [`git-subtree`][git-subtree] to copy the dependency to
the application tree, but preserve the history of the dependency. This
messes up the hashes. Therefore all refs in the dependency, like `Reverts
<commit>` do not make sense in isolation. Upgrades are somewhat easier than
with (2), because history is still sort-of there, but still cumbersome,
leading to the same unfortunate outcome.
4. Download the dependencies at build time and store them in a "safe place",
like [go-mod-archiver][go-mod-archiver]. It does not change how day-to-day
development works with go modules, but offers a lifeboat when a dependency
disappears. History-wise it is still same as (2) — copying the dependency
tree without it's history; if dependency does go away, bringing it back
under our own wing is an exertion. As it does not change the development
process, it is quite easy to sell to any team.
Option (1) is viable for very specific audiences and costs big money. Options
(2) and (3) blur the line between our application and dependencies and rewrite
the git history. Option (4) serves a different purpose: it is not a dependency
management system; it is a lifeboat when they inevitably disappear:
dependencies are still downloaded from the internet on every build.
This number of approaches seem to suggest there is an apetite to protect
ourselves when dependencies disappear (vendoring of increasing sophistication),
that preserve git history in some way (`git-subtree`) and do not get in a way
of using the language's build tool (`go-mod-archiver`). But the problem is not
yet solved for any of the languages that I have worked with.
So what about all of the below:
- "smart" vendoring to protect ourselves from things disappearing, and
- no friction when doing it?
Sharing code hygienically
-------------------------
[Avery Pennarun][apenwarr], the creator of `git-subtree`, wrote
[`git-subtrac`][git-subtrac], which vendors dependencies in a special branch
without rewriting their history (i.e. leaving the hashes intact). Wait, stop
here. Repeat after me: _git-subtrac vendors our dependencies, but all refs stay
in our repository_. Since all the dependencies are in our repo, no external
force can make our dependency unavailable. Let's discuss it's advantages:
1. The dependency keeps it's hashes, so it's history is left intact (as a
side-effect, `git show <hash of the dependency>` in our repository, will,
surprisingly, work).
2. The dependency is vendored to our tree, so it will not disappear.
3. Because humans are more observant to download times than building times, it
will keep a nice check on the overall size of the repository. Hopefully
preventing us (or our kids) from pulling in V8 (over 2M lines of code) just
to interpret a couple hundred lines of lines of JavaScript[^3].
Some of my friends point out that it has a disadvantage by design: it uses [git
submodules][git-submodules]. Submodules is the only way to convince git to to
check out another repository (i.e. the dependency) into our tree without git
thinking it's part of our code. Submodules are infamous for their footguns when
used directly. Higher-level `git-subtrac` shields us from being overly exposed
to git submodules, keeping footguns at the minimum. Oh, this description also
applies to the other 150+ git plumbing commands[^4], so nothing new here.
Andrew meets Git Subtrac (for 5 seconds)
----------------------------------------
A couple of weeks ago in a park in Milan I was selling `git-subtrac` to Andrew
Kelley as a default package manager for Zig ([zig does not have one yet][zig-pkg-manager]). Our
conversation went like this:
- me: "git-subtrac yadda yadda yadda submodules but better yadda yadda yadda".
- Andrew: "If I clone a repository that uses subtrac with no extra parameters,
will it work as expected?"
- me: "No, you have to pass `--recursive`, so git will checkout submodules...
even if they are already fetched."
- Andrew: "Then it's a piece-of-shit-approach."
And I agree: `git-subtrac` is a tool for managing submodules, and does not try
to be anything more: it is not a package manager, nor it is a dependency
management tool. As far as potential Zig users users are concerned, it should
be a "git plumbing" command.
If we never expose the nitty-gritty handling of git submodules (like
`git-subtrac`, but more sophisticated), maybe it's OK? I have tried to manage
two projects with `git-subtrac`, and it's quite close to pretend of not using
submodules.
Zig and subtrac?
----------------
A package manager does much more than just downloading the dependencies:
- Resolve the right versions of direct dependencies.
- Resolve and download the right transitive dependencies.
- Figure out the diamond dependencies and incompatibilities.
- Provide one-click means to upgrade everything.
- Build the dependencies (it is part of the build system, but usually the build
system and package manager are coupled).
Just like git will not build our code, `git-subtrac` will neither. What if we
make `zig pkg` rely on `git-subtrac` (or, if we are serious, it's
reimplementation) to manage directory trees?
Think about it for a minute. Imagine this workflow:
**Step 1: clone**
2022-05-09 06:29:28 +03:00
`git clone https://git.example.org/project`
2022-05-08 21:17:48 +03:00
- Download the application source.
- Download all dependencies to `.git/`, but not check them out (due to the
nature of git submodules). `deps/` is an empty directory at this point.
**Step 2: build**
`zig pkg build`:
- Check out dependencies in `deps/` using git's plumbing commands. No network
involved.
- Build dependencies, transitive dependencies and the application.
At this point, the dependency is checked out in `deps/`, ready for hacking. If
we change the code there, git makes it obvious, but does not forbid us from
doing so, which is nice when hacking.
**Step 3: add a dependency**
2022-05-09 06:29:28 +03:00
`zig pkg get https://git.example.org/dep1`
2022-05-08 21:17:48 +03:00
- record the path of the dependency (just the user's *intent*, as typed) to the
zigpkg's config file.
- download the latest (tagged release|ref) and put amongst other dependencies.
**Step 4: upgrading the dependencies**
`zig pkg upgrade`
- Go through the list of dependencies recorded in step 3, try to fetch the
updaded dependency versions.
- With hand holding and guardrails:
- If dependency no longer exists, inform and advice further course of action
- If the "newest version" is not a parent of what we have now, warn the user
as it's not an upgrade, but a new thing.
- See, no lock file needed! Just the list of dependency URLs, which translate
to the exact refs.
This sums up the basic workflow.
Drawbacks
---------
There are a few:
- From my experience with `git-subtrac`, git submodules is still a leaky
abstraction. Since we are using git too, I am not fully convinced we may hide
*all* of it from the unsuspecting user.
- Obviously, this only works when both our repository and the dependencies are
in git. This may not be good if you are in the Fossil land (SQLite and
SpatiaLite come to mind).
Did I miss something? Tell me.
Credits
-------
Many thanks Johny Marler and Anton Lavrik for reading drafts of this.
[^1]: Alphabetically: Erlang, Go, Java, JavaScript, PHP, Perl, Python.
[^2]: Usually written in the same language. Zoo of package managers (sometimes
a couple of popular ones for the same programming language) is a can of worms
in an on itself worth another blog post.
[^3]: True story.
[^4]: git plumbing commands are ones that the users should almost never use,
but are used by git itself or other low-level tools. E.g. `git cat-file` and
`git hash-object` are plumbing commands, which, I can assure, 99% of the git
users have never heard of.
[linux-rust]: https://lwn.net/SubscriberLink/889924/a733d6630e3b5115/
[git-subtrac]: https://github.com/apenwarr/git-subtrac/
[apenwarr-subtrac]: https://apenwarr.ca/log/20191109
[git-subtree]: https://manpages.debian.org/testing/git-man/git-subtree.1.en.html
[go-mod-archiver]: https://github.com/tailscale/go-mod-archiver
[crash-of-leftpad]: https://drewdevault.com/2021/11/16/Cash-for-leftpad.html
[git-submodules]: https://git-scm.com/book/en/v2/Git-Tools-Submodules
[MCAS]: https://en.wikipedia.org/wiki/Maneuvering_Characteristics_Augmentation_System
[ABS]: https://en.wikipedia.org/wiki/Anti-lock_braking_system
[apenwarr]: https://apenwarr.ca/
[zig-pkg-manager]: https://github.com/ziglang/zig/issues/943
2023-02-20 16:56:46 +02:00
[14293]: https://github.com/ziglang/zig/issues/14293