Recently I spent some time thinking about rowhammer and integrity issues in general. So, when a colleague asked about software-based mitigations to detect errors on faulty hardware, mainly in the context of CPUs in outer space suffering from cosmic rays corruption, it got me thinking.
I was aware that spaceships usually employ–very expensive–hardened CPUs, with special materials, shielding, and redundancy (both spatial, i.e., add resoures, and temporal, i.e., repeat computations). For instance, Perseverance, the recently arrived to Mars rover, runs on a RAD750 PowerPC CPU that costs ~$300,000; or critical systems (like plains or rockets) also tend to duplicate (or triplicate) resources and rely on some majority vote mechanism.
Redundancy is easy to implement and quite useful, and there are many interesting things that I would love more people to research. But what occurred to me was: what if we translate values into codewords that preserve the semantic of the operations while allowing us to detect errors? 64 bits are a lot of bits for many use cases, and the CPU computes over all of them anyway, modulo optimizations and power consumption savings due to 0s.
So I got some pen and paper and played a bit with this idea, and at least for addition an subtraction it was easy to come up with a working code: just multiply by a constant. Only values that are multiple of N
are valid codewords, any other is an error. We assign the codeword N*x
to x
, and the original domain of 2^64 values is reduced to 2^64/N. It is easy to see that If the encoding function is E()
, then E(a)+E(b) = E(a+b)
.
I immediately found this extremely appealing. What about more complex operators? Multiplication, division, logic operators, signed values, floating point? At this point I started googling for things like “computing with error codes”, and as you can imagine most of it had already been studied in the 70s. Still, it’s always fun to rediscover an idea :)
My simple code turns out to be the AN-code, which is just an example of arithmetic codes (not to confuse with arithmetic coding). For a nice introduction to the topic and more details I recommend this thesis from 2011, “Hardware Error Detection Using AN-Codes”.
From there I extracted plenty of interesting references. P. Forin proposed in 1989 the error model, which describes all different types of error that can occur in a computation:
- Exchanged operand
- Exchanged operator
- Faulty operation
- Lost update
- Modified operand
Any type of error can be represented as a combiantion of these.
They also distinguish between deterministic, probabilistic, and permanent errors, but the only reason why I mention all this is because different types of codes protect (i.e., detect or correct) against different types of errors, and as you can imagine there are tradeoffs to be made. Some codes are more efficient than others, spacewise or in terms of instructions overhead, the later is because operations over codewords might require additional computation. To show you a simple example, let’s consider multiplication over our previous code: E(a) * E(b) = N*a * N*b = N^2 * a * b != E(a * b)
.
In order to solve this we can modify the operator *
over codewords to further divide the result by N
. This modified operators can be implemented in hardware or in software, the compiler can add any extra instructions.
Even more interesting are the following definitions:
- Systematic: A code where each each codeword is represented by
n
digits is systematic if there exists a set ofk
digits (k < n
) representing the function value and the remainingn-k
representing the check. - Separate: If the codeword is a tuple of the functional value and its check bits, and the encoded operations computes these in parallel; if both the function value and the check bits are processed together by the same operation the code is non-separate. Separate codes are also systematic.
I do not want to spoil all the fun, if you like the idea there seems to be a lot of related work, although to be honest I’m not sure how much recent. Someone also pointed me to private circuits, or more concretely to tamperable circuits, which seem able to provide much stronger guarantees.
I’m specially curious about codes that do not require modifying operators, in fact I was planning to try to throw the problem to an SMT solver. Has anyone done that? I bet the answer is yes. Likewise, modern CPUs have support for SIMD instructions, can that serve to build codes for the full 64-bit values domain? Can we just have a compiler pass for doing that?
Another type of questions that I haven’t pondered much about yet are the potential security use cases. For instance, if the valid codeword for a value (i.e., the remainder) in an AN-code was secret and dynamic, how much harder would explotation be? I know checksums are generally not useful to protect against arbitrary corruption and one needs cryptographic primitives, but who knows. In some way this all reminds me of PAC and MTE.
Anyway, today I just wanted to share something that a few days ago I had no idea about. Disovering new fields/methods always brings me joy :) I’ll probably dig a bit more, mainly to see how far people have gotten. If anyone has more interesting pointers, please share.