Web Compression

2024-08-07 14:56:11 +03:00 · 2024-08-07 14:56:11 +03:00 · ddb706884a
commit ddb706884a
parent 0b7b7473d7
1 changed files with 263 additions and 0 deletions
--- a/content/log/2024/web-compression.md
+++ b/content/log/2024/web-compression.md
@ -0,0 +1,263 @@
+---
+title: "Web Compression"
+date: 2024-08-07T13:53:10+03:00
+---
+
+I wrote [this comment][1] about pre-compressing web artifacts with zstd:
+
+> I have read somewhere (can't find links handy) that for web server case, zstd
+> may not be as useful as brotli due to longer decompression speed, but I may
+> be wrong here.
+
+That felt wrong — someone suggesting a cool change in a module, and I am just
+FUDing it. If I were the PR submitter, I would certainly not appreciate this
+comment. So I decided to conduct a non-scientific experiment: take a big piece
+of Javascript and compare brotli with zstd.
+
+Executive Summary
+-----------------
+
+* brotli compresses my chosen piece of Javascript better than zstd by 4-22%.
+* zstd is faster than brotli by 50-80% (depending on platform) and uses less
+  system resources than brotli.
+
+As a result, I will re-phrase my comment on github and welcome `zstd` to the
+default compressors.
+
+Benchmark Setup
+---------------
+
+Hardware:
+
+1. AMD Ryzen 7 7840HS, DDR5-5600.
+2. Raspberry Pi 4. Linux v6.6.44. NixOS [24.05-2908-g883180e6550c][2].
+
+CPU scaling governor set to `performance` on both nodes:
+
+```
+for f in  /sys/devices/system/cpu/cpufreq/*/scaling_governor; do echo 'performance' | sudo tee $f; done
+```
+
+Software:
+- NixOS [24.05-2908-g883180e6550c][2].
+- Linux v6.6.44.
+- brotli 1.1.0 from the distribution.
+- zstd v1.5.6 from the distribution.
+
+Test Harness
+------------
+
+I picked Youtube's `desktop_polymer.js`, because:
+
+2. That file weighs 8.52MB.
+1. YouTube is a [somewhat frequently accesssed website][3], so that file is
+   frequently downloaded and decompressed, making it somewhat representative,
+   albeit anecdatal[^1].
+
+Acquiring and compressing it:
+
+```
+$ wget https://www.youtube.com/s/desktop/bf8c00d7/jsbin/desktop_polymer.vflset/desktop_polymer.js -O y.js
+$ for prog in 'zstd -3' 'zstd -6' 'zstd -9' 'zstd -12' 'zstd -15' 'zstd -19' 'zstd --ultra -22'; do $prog y.js -o y.js.${prog##*-}.zst; done
+$ brotli y.js
+```
+
+[poop](https://github.com/andrewrk/poop) accepts a single command to run, so we
+have this wrapper:
+
+```sh
+#!/bin/sh
+set -e
+FILE=y.js
+case "$0" in
+    ./brotli)
+        exec brotli -cd ${FILE}.br
+        ;;
+    ./zstd-*)
+        level=${0#./zstd-}
+        exec zstd -cd ${FILE}.${level}.zst
+        ;;
+    *)
+        >&2 echo "invalid program $0"
+        exit
+        ;;
+esac
+```
+
+Then symlink to it for each compression level:
+
+```
+$ for l in 3 6 19 22; do ln -s wrap zstd-${l}; done
+$ ln -s wrap brotli
+```
+
+Compression Ratio
+-----------------
+
+```
+Filename     Bytes    % larger that br
+y.js.br      1380352             0.00%
+y.js.22.zst  1437696             4.15%
+y.js.19.zst  1437696             4.15%
+y.js.15.zst  1548288            12.17%
+y.js.12.zst  1581056            14.54%
+y.js.9.zst   1609728            16.62%
+y.js.6.zst   1687552            22.26%
+y.js.3.zst   1892352            37.09%
+y.js         8519680           517.21%
+```
+
+As we can see, `zstd -19` yielded 4% worse compression for this file than
+brotli. We should keep in mind that [brotli has web-specific tricks][4], giving
+zstd somewhat an advantage with this corpus.
+
+Since `zstd -19` and `zstd -22` yield the same compression ratio, I will
+exclude `zstd -22` from the tests.
+
+Decompression Speed
+-------------------
+
+```
+hyperfine --export-markdown $(hostname) -w 1 -N ./brotli ./zstd-{3,6,19}
+```
+
+## AMD Ryzen
+
+| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
+|:---|---:|---:|---:|---:|
+| `./zstd-6` | 8.4 ± 0.4 | 7.8 | 11.3 | 1.00 |
+| `./zstd-3` | 8.7 ± 0.4 | 8.0 | 11.6 | 1.03 ± 0.07 |
+| `./zstd-9` | 9.1 ± 0.6 | 8.3 | 13.6 | 1.08 ± 0.09 |
+| `./zstd-19` | 10.8 ± 0.7 | 9.8 | 14.1 | 1.28 ± 0.10 |
+| `./brotli` | 15.5 ± 0.8 | 14.4 | 19.4 | 1.83 ± 0.13 |
+
+
+## Raspberry Pi 4
+
+| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
+|:---|---:|---:|---:|---:|
+| `./zstd-6` | 53.9 ± 0.6 | 52.6 | 55.3 | 1.00 |
+| `./zstd-3` | 56.2 ± 2.3 | 54.9 | 68.2 | 1.04 ± 0.04 |
+| `./zstd-9` | 57.7 ± 0.6 | 56.7 | 59.3 | 1.07 ± 0.02 |
+| `./zstd-19` | 65.3 ± 0.6 | 64.3 | 66.8 | 1.21 ± 0.02 |
+| `./zstd-22` | 65.1 ± 0.5 | 64.2 | 66.3 | 1.21 ± 0.02 |
+| `./brotli` | 82.7 ± 2.2 | 81.5 | 91.2 | 1.53 ± 0.04 |
+
+Summary: `zstd -6` is fastest, brotli is slower by 50-80%.
+
+Memory Usage
+------------
+
+```
+poop ./zstd-{6,3,19} ./brotli
+```
+
+## AMD Ryzen
+
+```
+Benchmark 1 (563 runs): ./zstd-6
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          8.83ms ± 1.02ms    7.79ms … 11.8ms         89 (16%)        0%
+  peak_rss           6.31MB ± 84.0KB    6.03MB … 6.42MB          1 ( 0%)        0%
+  cpu_cycles         29.1M  ±  565K     28.6M  … 37.5M          43 ( 8%)        0%
+  instructions       90.6M  ± 14.9K     90.6M  … 90.7M          20 ( 4%)        0%
+  cache_references   1.76M  ± 23.0K     1.72M  … 1.99M          26 ( 5%)        0%
+  cache_misses        122K  ± 2.76K      116K  …  136K           3 ( 1%)        0%
+  branch_misses       334K  ± 1.25K      332K  …  342K          51 ( 9%)        0%
+Benchmark 2 (546 runs): ./zstd-3
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          9.11ms ± 1.01ms    8.13ms … 12.0ms         84 (15%)        💩+  3.2% ±  1.4%
+  peak_rss           6.31MB ± 84.9KB    6.03MB … 6.42MB          1 ( 0%)          -  0.1% ±  0.2%
+  cpu_cycles         30.3M  ±  428K     29.9M  … 34.3M          44 ( 8%)        💩+  4.2% ±  0.2%
+  instructions       97.5M  ± 14.7K     97.5M  … 97.5M          24 ( 4%)        💩+  7.6% ±  0.0%
+  cache_references   1.85M  ± 21.9K     1.80M  … 2.05M          29 ( 5%)        💩+  4.8% ±  0.2%
+  cache_misses        125K  ± 2.57K      119K  …  142K          12 ( 2%)        💩+  2.2% ±  0.3%
+  branch_misses       319K  ± 1.62K      317K  …  329K          49 ( 9%)        ⚡-  4.3% ±  0.1%
+Benchmark 3 (524 runs): ./zstd-9
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          9.51ms ± 1.03ms    8.36ms … 12.7ms         82 (16%)        💩+  7.6% ±  1.4%
+  peak_rss           8.40MB ± 84.2KB    8.13MB … 8.65MB          4 ( 1%)        💩+ 33.1% ±  0.2%
+  cpu_cycles         29.1M  ± 1.21M     28.1M  … 33.7M          80 (15%)          -  0.1% ±  0.4%
+  instructions       85.2M  ± 14.1K     85.2M  … 85.2M          20 ( 4%)        ⚡-  6.0% ±  0.0%
+  cache_references   1.79M  ± 16.4K     1.76M  … 1.87M          19 ( 4%)        💩+  1.7% ±  0.1%
+  cache_misses        156K  ± 2.76K      151K  …  168K          10 ( 2%)        💩+ 28.4% ±  0.3%
+  branch_misses       331K  ± 1.04K      329K  …  336K          39 ( 7%)          -  0.9% ±  0.0%
+Benchmark 4 (442 runs): ./zstd-19
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          11.3ms ± 1.19ms    10.00ms … 15.0ms        86 (19%)        💩+ 27.6% ±  1.5%
+  peak_rss           12.5MB ± 87.9KB    12.2MB … 12.6MB          1 ( 0%)        💩+ 97.6% ±  0.2%
+  cpu_cycles         31.3M  ± 1.58M     30.2M  … 39.2M          56 (13%)        💩+  7.5% ±  0.5%
+  instructions       88.5M  ± 15.5K     88.4M  … 88.5M          22 ( 5%)        ⚡-  2.4% ±  0.0%
+  cache_references   1.81M  ± 18.2K     1.77M  … 1.94M          20 ( 5%)        💩+  2.6% ±  0.1%
+  cache_misses        192K  ± 2.68K      186K  …  200K           3 ( 1%)        💩+ 57.3% ±  0.3%
+  branch_misses       346K  ± 1.06K      344K  …  352K          23 ( 5%)        💩+  3.6% ±  0.0%
+Benchmark 5 (316 runs): ./brotli
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          15.8ms ± 1.27ms    14.6ms … 20.5ms         72 (23%)        💩+ 78.7% ±  1.7%
+  peak_rss           12.0MB ±  102KB    11.7MB … 12.2MB          2 ( 1%)        💩+ 90.3% ±  0.2%
+  cpu_cycles         52.9M  ± 1.53M     51.8M  … 71.4M          12 ( 4%)        💩+ 81.7% ±  0.5%
+  instructions        101M  ± 14.4K      101M  …  101M           8 ( 3%)        💩+ 11.8% ±  0.0%
+  cache_references   1.96M  ±  155K     1.91M  … 3.45M          11 ( 3%)        💩+ 11.3% ±  0.7%
+  cache_misses        165K  ± 1.60K      161K  …  172K           1 ( 0%)        💩+ 35.5% ±  0.3%
+  branch_misses       898K  ±  905       896K  …  903K           9 ( 3%)        💩+169.1% ±  0.0%
+```
+
+## Raspberry Pi 4
+
+```
+Benchmark 1 (91 runs): ./zstd-6
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          54.8ms ± 1.71ms    53.0ms … 63.7ms          6 ( 7%)        0%
+  peak_rss           5.69MB ± 69.8KB    5.51MB … 5.77MB          0 ( 0%)        0%
+  cpu_cycles         65.7M  ± 2.23M     63.4M  … 77.1M          12 (13%)        0%
+  instructions       82.2M  ±  888      82.2M  … 82.2M           4 ( 4%)        0%
+  cache_references   29.2M  ± 14.1K     29.2M  … 29.3M           2 ( 2%)        0%
+  cache_misses        666K  ±  116K      553K  … 1.02M          11 (12%)        0%
+  branch_misses       344K  ± 1.50K      341K  …  349K           1 ( 1%)        0%
+Benchmark 2 (89 runs): ./zstd-3
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          56.2ms ±  871us    55.0ms … 60.0ms          3 ( 3%)        💩+  2.6% ±  0.7%
+  peak_rss           5.68MB ± 67.3KB    5.51MB … 5.77MB          0 ( 0%)          -  0.2% ±  0.4%
+  cpu_cycles         68.1M  ± 1.14M     66.5M  … 73.7M           4 ( 4%)        💩+  3.6% ±  0.8%
+  instructions       88.5M  ±  436      88.5M  … 88.5M           2 ( 2%)        💩+  7.7% ±  0.0%
+  cache_references   31.4M  ± 10.8K     31.4M  … 31.4M           6 ( 7%)        💩+  7.4% ±  0.0%
+  cache_misses        676K  ±  100K      577K  … 1.06M           5 ( 6%)          +  1.5% ±  4.8%
+  branch_misses       326K  ± 1.36K      322K  …  328K           0 ( 0%)        ⚡-  5.3% ±  0.1%
+Benchmark 3 (85 runs): ./zstd-9
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          58.6ms ± 2.36ms    56.8ms … 70.5ms          5 ( 6%)        💩+  7.0% ±  1.1%
+  peak_rss           7.77MB ± 72.7KB    7.60MB … 8.00MB          0 ( 0%)        💩+ 36.6% ±  0.4%
+  cpu_cycles         67.7M  ± 2.64M     65.8M  … 81.7M           7 ( 8%)        💩+  3.1% ±  1.1%
+  instructions       77.4M  ±  923      77.4M  … 77.4M           6 ( 7%)        ⚡-  5.8% ±  0.0%
+  cache_references   27.7M  ± 11.9K     27.7M  … 27.7M           8 ( 9%)        ⚡-  5.4% ±  0.0%
+  cache_misses        661K  ± 86.0K      563K  …  958K           6 ( 7%)          -  0.8% ±  4.6%
+  branch_misses       341K  ± 1.23K      338K  …  344K           0 ( 0%)          -  0.8% ±  0.1%
+Benchmark 4 (76 runs): ./zstd-19
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          66.0ms ±  811us    64.7ms … 68.9ms          4 ( 5%)        💩+ 20.5% ±  0.8%
+  peak_rss           11.9MB ± 49.9KB    11.8MB … 12.1MB         11 (14%)        💩+109.8% ±  0.3%
+  cpu_cycles         71.4M  ±  981K     70.0M  … 75.1M           7 ( 9%)        💩+  8.7% ±  0.8%
+  instructions       80.1M  ±  413      80.1M  … 80.1M           0 ( 0%)        ⚡-  2.5% ±  0.0%
+  cache_references   28.6M  ± 11.8K     28.6M  … 28.6M          12 (16%)        ⚡-  2.3% ±  0.0%
+  cache_misses        670K  ± 96.4K      560K  …  990K           7 ( 9%)          +  0.6% ±  4.9%
+  branch_misses       355K  ±  726       352K  …  356K           2 ( 3%)        💩+  3.1% ±  0.1%
+Benchmark 5 (61 runs): ./brotli
+  measurement          mean ± σ            min … max           outliers         delta
+  wall_time          82.7ms ± 1.69ms    81.1ms … 91.8ms          2 ( 3%)        💩+ 50.9% ±  1.0%
+  peak_rss           11.4MB ±    0      11.4MB … 11.4MB          0 ( 0%)        💩+100.5% ±  0.3%
+  cpu_cycles         94.1M  ± 2.12M     92.1M  …  105M           2 ( 3%)        💩+ 43.2% ±  1.1%
+  instructions       98.8M  ± 11.2      98.8M  … 98.8M          10 (16%)        💩+ 20.2% ±  0.0%
+  cache_references   42.9M  ± 46.3K     42.8M  … 43.0M           1 ( 2%)        💩+ 46.6% ±  0.0%
+  cache_misses        933K  ± 53.1K      874K  … 1.06M           0 ( 0%)        💩+ 40.0% ±  4.7%
+  branch_misses       999K  ±  931       997K  … 1.00M           1 ( 2%)        💩+190.5% ±  0.1%
+```
+
+Summary: `brotli` resource use during decompression is quite significantly worse than `zstd -6`.
+
+[1]: https://github.com/NixOS/nixpkgs/pull/332752#issuecomment-2271803132
+[2]: https://github.com/NixOS/nixpkgs/commit/883180e6550c1723395a3a342f830bfc5c371f6b
+[3]: https://moz.com/top500
+[4]: https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c370f0f3#file-dictionary-bin-L9850
+
+[^1]: anecdatal is a creative variation of
+    [anecdata](https://www.urbandictionary.com/define.php?term=anecdata).