1
Fork 0
jakstys.lt/content/log/2024/web-compression.md

264 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Web Compression"
date: 2024-08-07T13:53:10+03:00
---
I wrote [this comment][1] about pre-compressing web artifacts with zstd:
> I have read somewhere (can't find links handy) that for web server case, zstd
> may not be as useful as brotli due to longer decompression speed, but I may
> be wrong here.
That felt wrong — someone suggesting a cool change in a module, and I am just
FUDing it. If I were the PR submitter, I would certainly not appreciate this
comment. So I decided to conduct a non-scientific experiment: take a big piece
of Javascript and compare brotli with zstd.
Executive Summary
-----------------
* brotli compresses my chosen piece of Javascript better than zstd by 4-22%.
* zstd is faster than brotli by 50-80% (depending on platform) and uses less
system resources than brotli.
As a result, I will re-phrase my comment on github and welcome `zstd` to the
default compressors.
Benchmark Setup
---------------
Hardware:
1. AMD Ryzen 7 7840HS, DDR5-5600.
2. Raspberry Pi 4. Linux v6.6.44. NixOS [24.05-2908-g883180e6550c][2].
CPU scaling governor set to `performance` on both nodes:
```
for f in /sys/devices/system/cpu/cpufreq/*/scaling_governor; do echo 'performance' | sudo tee $f; done
```
Software:
- NixOS [24.05-2908-g883180e6550c][2].
- Linux v6.6.44.
- brotli 1.1.0 from the distribution.
- zstd v1.5.6 from the distribution.
Test Harness
------------
I picked Youtube's `desktop_polymer.js`, because:
2. That file weighs 8.52MB.
1. YouTube is a [somewhat frequently accesssed website][3], so that file is
frequently downloaded and decompressed, making it somewhat representative,
albeit anecdatal[^1].
Acquiring and compressing it:
```
$ wget https://www.youtube.com/s/desktop/bf8c00d7/jsbin/desktop_polymer.vflset/desktop_polymer.js -O y.js
$ for prog in 'zstd -3' 'zstd -6' 'zstd -9' 'zstd -12' 'zstd -15' 'zstd -19' 'zstd --ultra -22'; do $prog y.js -o y.js.${prog##*-}.zst; done
$ brotli y.js
```
[poop](https://github.com/andrewrk/poop) accepts a single command to run, so we
have this wrapper:
```sh
#!/bin/sh
set -e
FILE=y.js
case "$0" in
./brotli)
exec brotli -cd ${FILE}.br
;;
./zstd-*)
level=${0#./zstd-}
exec zstd -cd ${FILE}.${level}.zst
;;
*)
>&2 echo "invalid program $0"
exit
;;
esac
```
Then symlink to it for each compression level:
```
$ for l in 3 6 19 22; do ln -s wrap zstd-${l}; done
$ ln -s wrap brotli
```
Compression Ratio
-----------------
```
Filename Bytes % larger that br
y.js.br 1380352 0.00%
y.js.22.zst 1437696 4.15%
y.js.19.zst 1437696 4.15%
y.js.15.zst 1548288 12.17%
y.js.12.zst 1581056 14.54%
y.js.9.zst 1609728 16.62%
y.js.6.zst 1687552 22.26%
y.js.3.zst 1892352 37.09%
y.js 8519680 517.21%
```
As we can see, `zstd -19` yielded 4% worse compression for this file than
brotli. We should keep in mind that [brotli has web-specific tricks][4], giving
zstd somewhat an advantage with this corpus.
Since `zstd -19` and `zstd -22` yield the same compression ratio, I will
exclude `zstd -22` from the tests.
Decompression Speed
-------------------
```
hyperfine --export-markdown $(hostname) -w 1 -N ./brotli ./zstd-{3,6,19}
```
## AMD Ryzen
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `./zstd-6` | 8.4 ± 0.4 | 7.8 | 11.3 | 1.00 |
| `./zstd-3` | 8.7 ± 0.4 | 8.0 | 11.6 | 1.03 ± 0.07 |
| `./zstd-9` | 9.1 ± 0.6 | 8.3 | 13.6 | 1.08 ± 0.09 |
| `./zstd-19` | 10.8 ± 0.7 | 9.8 | 14.1 | 1.28 ± 0.10 |
| `./brotli` | 15.5 ± 0.8 | 14.4 | 19.4 | 1.83 ± 0.13 |
## Raspberry Pi 4
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `./zstd-6` | 53.9 ± 0.6 | 52.6 | 55.3 | 1.00 |
| `./zstd-3` | 56.2 ± 2.3 | 54.9 | 68.2 | 1.04 ± 0.04 |
| `./zstd-9` | 57.7 ± 0.6 | 56.7 | 59.3 | 1.07 ± 0.02 |
| `./zstd-19` | 65.3 ± 0.6 | 64.3 | 66.8 | 1.21 ± 0.02 |
| `./zstd-22` | 65.1 ± 0.5 | 64.2 | 66.3 | 1.21 ± 0.02 |
| `./brotli` | 82.7 ± 2.2 | 81.5 | 91.2 | 1.53 ± 0.04 |
Summary: `zstd -6` is fastest, brotli is slower by 50-80%.
Memory Usage
------------
```
poop ./zstd-{6,3,19} ./brotli
```
## AMD Ryzen
```
Benchmark 1 (563 runs): ./zstd-6
measurement mean ± σ min … max outliers delta
wall_time 8.83ms ± 1.02ms 7.79ms … 11.8ms 89 (16%) 0%
peak_rss 6.31MB ± 84.0KB 6.03MB … 6.42MB 1 ( 0%) 0%
cpu_cycles 29.1M ± 565K 28.6M … 37.5M 43 ( 8%) 0%
instructions 90.6M ± 14.9K 90.6M … 90.7M 20 ( 4%) 0%
cache_references 1.76M ± 23.0K 1.72M … 1.99M 26 ( 5%) 0%
cache_misses 122K ± 2.76K 116K … 136K 3 ( 1%) 0%
branch_misses 334K ± 1.25K 332K … 342K 51 ( 9%) 0%
Benchmark 2 (546 runs): ./zstd-3
measurement mean ± σ min … max outliers delta
wall_time 9.11ms ± 1.01ms 8.13ms … 12.0ms 84 (15%) 💩+ 3.2% ± 1.4%
peak_rss 6.31MB ± 84.9KB 6.03MB … 6.42MB 1 ( 0%) - 0.1% ± 0.2%
cpu_cycles 30.3M ± 428K 29.9M … 34.3M 44 ( 8%) 💩+ 4.2% ± 0.2%
instructions 97.5M ± 14.7K 97.5M … 97.5M 24 ( 4%) 💩+ 7.6% ± 0.0%
cache_references 1.85M ± 21.9K 1.80M … 2.05M 29 ( 5%) 💩+ 4.8% ± 0.2%
cache_misses 125K ± 2.57K 119K … 142K 12 ( 2%) 💩+ 2.2% ± 0.3%
branch_misses 319K ± 1.62K 317K … 329K 49 ( 9%) ⚡- 4.3% ± 0.1%
Benchmark 3 (524 runs): ./zstd-9
measurement mean ± σ min … max outliers delta
wall_time 9.51ms ± 1.03ms 8.36ms … 12.7ms 82 (16%) 💩+ 7.6% ± 1.4%
peak_rss 8.40MB ± 84.2KB 8.13MB … 8.65MB 4 ( 1%) 💩+ 33.1% ± 0.2%
cpu_cycles 29.1M ± 1.21M 28.1M … 33.7M 80 (15%) - 0.1% ± 0.4%
instructions 85.2M ± 14.1K 85.2M … 85.2M 20 ( 4%) ⚡- 6.0% ± 0.0%
cache_references 1.79M ± 16.4K 1.76M … 1.87M 19 ( 4%) 💩+ 1.7% ± 0.1%
cache_misses 156K ± 2.76K 151K … 168K 10 ( 2%) 💩+ 28.4% ± 0.3%
branch_misses 331K ± 1.04K 329K … 336K 39 ( 7%) - 0.9% ± 0.0%
Benchmark 4 (442 runs): ./zstd-19
measurement mean ± σ min … max outliers delta
wall_time 11.3ms ± 1.19ms 10.00ms … 15.0ms 86 (19%) 💩+ 27.6% ± 1.5%
peak_rss 12.5MB ± 87.9KB 12.2MB … 12.6MB 1 ( 0%) 💩+ 97.6% ± 0.2%
cpu_cycles 31.3M ± 1.58M 30.2M … 39.2M 56 (13%) 💩+ 7.5% ± 0.5%
instructions 88.5M ± 15.5K 88.4M … 88.5M 22 ( 5%) ⚡- 2.4% ± 0.0%
cache_references 1.81M ± 18.2K 1.77M … 1.94M 20 ( 5%) 💩+ 2.6% ± 0.1%
cache_misses 192K ± 2.68K 186K … 200K 3 ( 1%) 💩+ 57.3% ± 0.3%
branch_misses 346K ± 1.06K 344K … 352K 23 ( 5%) 💩+ 3.6% ± 0.0%
Benchmark 5 (316 runs): ./brotli
measurement mean ± σ min … max outliers delta
wall_time 15.8ms ± 1.27ms 14.6ms … 20.5ms 72 (23%) 💩+ 78.7% ± 1.7%
peak_rss 12.0MB ± 102KB 11.7MB … 12.2MB 2 ( 1%) 💩+ 90.3% ± 0.2%
cpu_cycles 52.9M ± 1.53M 51.8M … 71.4M 12 ( 4%) 💩+ 81.7% ± 0.5%
instructions 101M ± 14.4K 101M … 101M 8 ( 3%) 💩+ 11.8% ± 0.0%
cache_references 1.96M ± 155K 1.91M … 3.45M 11 ( 3%) 💩+ 11.3% ± 0.7%
cache_misses 165K ± 1.60K 161K … 172K 1 ( 0%) 💩+ 35.5% ± 0.3%
branch_misses 898K ± 905 896K … 903K 9 ( 3%) 💩+169.1% ± 0.0%
```
## Raspberry Pi 4
```
Benchmark 1 (91 runs): ./zstd-6
measurement mean ± σ min … max outliers delta
wall_time 54.8ms ± 1.71ms 53.0ms … 63.7ms 6 ( 7%) 0%
peak_rss 5.69MB ± 69.8KB 5.51MB … 5.77MB 0 ( 0%) 0%
cpu_cycles 65.7M ± 2.23M 63.4M … 77.1M 12 (13%) 0%
instructions 82.2M ± 888 82.2M … 82.2M 4 ( 4%) 0%
cache_references 29.2M ± 14.1K 29.2M … 29.3M 2 ( 2%) 0%
cache_misses 666K ± 116K 553K … 1.02M 11 (12%) 0%
branch_misses 344K ± 1.50K 341K … 349K 1 ( 1%) 0%
Benchmark 2 (89 runs): ./zstd-3
measurement mean ± σ min … max outliers delta
wall_time 56.2ms ± 871us 55.0ms … 60.0ms 3 ( 3%) 💩+ 2.6% ± 0.7%
peak_rss 5.68MB ± 67.3KB 5.51MB … 5.77MB 0 ( 0%) - 0.2% ± 0.4%
cpu_cycles 68.1M ± 1.14M 66.5M … 73.7M 4 ( 4%) 💩+ 3.6% ± 0.8%
instructions 88.5M ± 436 88.5M … 88.5M 2 ( 2%) 💩+ 7.7% ± 0.0%
cache_references 31.4M ± 10.8K 31.4M … 31.4M 6 ( 7%) 💩+ 7.4% ± 0.0%
cache_misses 676K ± 100K 577K … 1.06M 5 ( 6%) + 1.5% ± 4.8%
branch_misses 326K ± 1.36K 322K … 328K 0 ( 0%) ⚡- 5.3% ± 0.1%
Benchmark 3 (85 runs): ./zstd-9
measurement mean ± σ min … max outliers delta
wall_time 58.6ms ± 2.36ms 56.8ms … 70.5ms 5 ( 6%) 💩+ 7.0% ± 1.1%
peak_rss 7.77MB ± 72.7KB 7.60MB … 8.00MB 0 ( 0%) 💩+ 36.6% ± 0.4%
cpu_cycles 67.7M ± 2.64M 65.8M … 81.7M 7 ( 8%) 💩+ 3.1% ± 1.1%
instructions 77.4M ± 923 77.4M … 77.4M 6 ( 7%) ⚡- 5.8% ± 0.0%
cache_references 27.7M ± 11.9K 27.7M … 27.7M 8 ( 9%) ⚡- 5.4% ± 0.0%
cache_misses 661K ± 86.0K 563K … 958K 6 ( 7%) - 0.8% ± 4.6%
branch_misses 341K ± 1.23K 338K … 344K 0 ( 0%) - 0.8% ± 0.1%
Benchmark 4 (76 runs): ./zstd-19
measurement mean ± σ min … max outliers delta
wall_time 66.0ms ± 811us 64.7ms … 68.9ms 4 ( 5%) 💩+ 20.5% ± 0.8%
peak_rss 11.9MB ± 49.9KB 11.8MB … 12.1MB 11 (14%) 💩+109.8% ± 0.3%
cpu_cycles 71.4M ± 981K 70.0M … 75.1M 7 ( 9%) 💩+ 8.7% ± 0.8%
instructions 80.1M ± 413 80.1M … 80.1M 0 ( 0%) ⚡- 2.5% ± 0.0%
cache_references 28.6M ± 11.8K 28.6M … 28.6M 12 (16%) ⚡- 2.3% ± 0.0%
cache_misses 670K ± 96.4K 560K … 990K 7 ( 9%) + 0.6% ± 4.9%
branch_misses 355K ± 726 352K … 356K 2 ( 3%) 💩+ 3.1% ± 0.1%
Benchmark 5 (61 runs): ./brotli
measurement mean ± σ min … max outliers delta
wall_time 82.7ms ± 1.69ms 81.1ms … 91.8ms 2 ( 3%) 💩+ 50.9% ± 1.0%
peak_rss 11.4MB ± 0 11.4MB … 11.4MB 0 ( 0%) 💩+100.5% ± 0.3%
cpu_cycles 94.1M ± 2.12M 92.1M … 105M 2 ( 3%) 💩+ 43.2% ± 1.1%
instructions 98.8M ± 11.2 98.8M … 98.8M 10 (16%) 💩+ 20.2% ± 0.0%
cache_references 42.9M ± 46.3K 42.8M … 43.0M 1 ( 2%) 💩+ 46.6% ± 0.0%
cache_misses 933K ± 53.1K 874K … 1.06M 0 ( 0%) 💩+ 40.0% ± 4.7%
branch_misses 999K ± 931 997K … 1.00M 1 ( 2%) 💩+190.5% ± 0.1%
```
Summary: `brotli` resource use during decompression is quite significantly worse than `zstd -6`.
[1]: https://github.com/NixOS/nixpkgs/pull/332752#issuecomment-2271803132
[2]: https://github.com/NixOS/nixpkgs/commit/883180e6550c1723395a3a342f830bfc5c371f6b
[3]: https://moz.com/top500
[4]: https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c370f0f3#file-dictionary-bin-L9850
[^1]: anecdatal is a creative variation of
[anecdata](https://www.urbandictionary.com/define.php?term=anecdata).