This also modifies the inline assembly to be more optimizable - instead of
doing explicit movs, we instead communicate to LLVM which registers we
would like to, somehow, have the correct values. This is how the x86_64
code already worked and thus allows the code to be unified across the
two architectures.
As a bonus, I threw in x86 support.