473: Math Is Not the Answer
Transcript from 437: Math Is Not the Answer with Philip Koopman, Christopher White, and Elecia White.
EW (00:00:07):
Welcome to Embedded. I am Elecia White, alongside Christopher White. Our guest this week is Philip Koopman, author of a new book, "Understanding Checksums and Cyclic Redundancy Checks." Now wait, before you wander off, there is a lot more we are going to talk about and it is going to be very fun.
CW (00:00:27):
Cyclic redundancy checks are quite fun. <laugh> Hello, Philip. Welcome back to the show. <laugh>
PK (00:00:33):
Thanks for having me. I was going to jump in right there, but I will be snarky later. We are good.
EW (00:00:38):
Could you tell us about yourself as if we met on, I do not know, Open House Day at CMU?
PK (00:00:45):
You are starting with a hard question, because I had done a whole bunch of things. So here is the short version. We are going to come back to my career, maybe, and that is fine.
(00:00:55):
I was in the military. I was in industry. I am currently ending my cycle at university, shifting over to being consultant. And I have been doing embedded systems and embedded networks for a really long time. And also safety, in a variety of places.
(00:01:13):
So I worked for the Navy. I worked for United Technologies. I was a CPU designer at Harris Semiconductor. I have been teaching at Carnegie Mellon University.
(00:01:22):
And I stopped counting at 200 design reviews for industries. Get in a car, get on a plane, go look at some embedded stuff, get them some ideas how to improve. I stopped counting at 200 and that has got to be ten years ago. So, done a lot of stuff. How is that?
EW (00:01:40):
Well, then all this stuff leads to a couple of pretty interesting books about what you have done.
PK (00:01:46):
Yeah. I have been busy lately. There is the "orange book," as I call it, which is the "Better Embedded System Software," which is now, what, more than ten years old, but still going fine. That is what I learned in my first 90 design reviews. I have learned a lot more since then. But that is not the book we are here to talk about today.
EW (00:02:08):
Actually, are you thinking about doing a second edition for that?
PK (00:02:12):
I have to decide. I need- As you well know, Elecia, putting out a book is a somewhat traumatic experience, so I have to recover.
CW (00:02:23):
It turns out the second time is no less traumatic.
EW (00:02:24):
<laugh>
PK (00:02:27):
That is correct. <laugh>
CW (00:02:29):
You think you are going to go in, and just do a second edition and make a few edits. And then suddenly you have rewritten the whole book, and added a hundred pages.
EW (00:02:35):
<laugh> Yes.
PK (00:02:36):
And at this point, I do not know that a second edition is the right thing, because that book has held up remarkably well. The hardcover printing sold out. So it is now paper back in Amazon and doing way better than I expected. But it has not changed that much.
(00:02:52):
So now I think if I were to do a second edition, it might be better to leave as it is and do something else that is more helpful. Rather than putting a lot of effort in just replowing the same ground.
EW (00:03:08):
I did not modify much from my book. I added stuff. I moved it around. I made the instructions better.
PK (00:03:16):
And at some point when you are an inch thick, maybe it is time for another book, right?
CW (00:03:19):
<laugh>
EW (00:03:21):
Yeah. Okay. Are you ready for lightning round?
PK (00:03:24):
Sure. Lay it on me. <laugh>
CW (00:03:27):
Okay. Least favorite polynomial?
PK (00:03:34):
82608EDB. That is my story. I am sticking to it.
EW (00:03:38):
On a scale of one to ten, where one is a Flintstones car, and ten is hover cars of the future, where are we with self-driving?
CW (00:03:45):
<laugh>
PK (00:03:48):
Well, like four or a five.
EW (00:03:52):
<laugh>
CW (00:03:54):
I do not know. That scale is kind of hard to calibrate.
PK (00:03:56):
It is a hard scale.
CW (00:03:58):
So four or five could be just awful, or mediocre. <laugh>
PK (00:04:03):
And that is exactly the issue why I was not sure <laugh>. We can come back- At some point if you want to come back and talk about that, we certainly can. But goodness knows, I have done enough podcasts on that. This is a chance for a breath of fresh air for me.
CW (00:04:17):
Favorite course to teach?
PK (00:04:19):
The one I am teaching now, which is half embedded code quality, one quarter safety, one quarter security.
EW (00:04:28):
What is your preferred programming language?
PK (00:04:31):
I have been using C and C++ because I am old. <laugh> I have stopped counting at 50 programming languages. I will use whatever is thrown at me. But really in embedded, that is still what is out there in embedded. Rust may come by someday, but it is going to take a while. I am kind of language agnostic. Whatever it takes. It is fine.
CW (00:04:55):
Favorite bagel place in Pittsburgh?
PK (00:04:58):
Oh, well, I guess that Bruegger's does pretty well for me. But there are a lot of other places you can go. Boutique places. So it depends what mood I am in.
EW (00:05:10):
Do you have a favorite fictional robot?
PK (00:05:13):
Well, it is a tie. For some reasons it is Huey, Dewey and Louie from "Silent Running." Partly because they are obscure, and partly because that is what I named my Roombas. Okay.
CW (00:05:24):
<laugh>
PK (00:05:24):
But I would be remiss if I did not mention Johnny Cab, because- From the first movie, right? Because Johnny Cab, right? He is hysterical. "We hope you enjoyed the ride." <laugh>
CW (00:05:37):
What movie was that?
PK (00:05:38):
"Terminator."[Total Recall]
CW (00:05:39):
Thank you. Yes.
PK (00:05:40):
1991. Yeah.
EW (00:05:45):
<music> I would like to thank our sponsor this week. Nordic Semiconductor specializes in ultra-low-power wireless communication. They have a wide technology portfolio including Bluetooth Low Energy, low energy audio, Bluetooth mesh, Thread, Matter, Cellular IoT and Wi-Fi.
(00:06:04):
They have thousands of customers worldwide, with 40% market share in Bluetooth Low Energy. They have nearly two million System On a Chips produced every day. Sounds like a system that you cannot go wrong with. So please check out nordicsemi.com.
(00:06:21):
And if you would like to win one of the Power Profiling Kit IIs that our sponsor is so generously handing out, please send us an email. That is, show@embedded.fm and tell us what your favorite PPK2 feature or spec is. Again, thank you to Nordic for sponsoring this week's show. <music>
(00:06:51):
Okay, let us get to the real questions.
PK (00:06:53):
Sure.
EW (00:06:53):
You mentioned an orange book, the "Better Embedded Systems." That has been around a while.
PK (00:07:02):
That was like 2010, I think. The hardcover. Yeah, it has been a while.
EW (00:07:07):
And now you have a book that just recently came out. It is about CRCs and checksums? Is that not a done sort of field?
PK (00:07:19):
Well, it is pretty niche, right? I will save you having to say that. It goes back to the sixties, I guess. While the mechanics of CRCs are done, the understanding of them took decades. And there are still some open questions, as it turns out. I said, "Okay, I think I understand enough now to write the book."
(00:07:43):
And then I got- There is a thing in Chapter 10. I was like, "You know what? All the stuff that has been written in the last 20 years sort of missed the important part. And I do not know the answer either. So we are going to leave that part open." This is the kind of thing that takes decades to figure out.
(00:07:57):
But to be sure, inside the book is the answers to all the questions I have really heard. And there are one or two niche, really, really narrow places that are still open. But for most people's purposes, the answers are all there.
(00:08:13):
It really takes a lot of work to nail it down, because there is so much going on. Really, there is a lot of folklore. There is a lot of misinformation. There are a lot of people just saying, "Oh, this will be fine." And no it is not. So there is just a lot of that.
(00:08:28):
Inventing the technology was very cool, done a long time ago. Understanding it has taken a long time.
EW (00:08:37):
Do you not just take your list of parameters about how long you are willing for your check bits to be, and how- Christopher keeps trying to interrupt- <laugh> And how long it takes to calculate, and how long it takes to verify? You just take all these parameters and you shove them in, and you find out which CRC is best.
CW (00:08:57):
Doing this every show now. Can we start with a definition of what a CRC or checksum is, for people who might not be familiar?
EW (00:09:03):
Why?
CW (00:09:03):
Because I want people to understand what we are talking about, before we ask which elliptical curve equation is the right one to use for some-
PK (00:09:10):
<laugh>
EW (00:09:12):
Okay, let us start with data word-
PK (00:09:15):
In fairness, so you do not beat up Christopher. I was going to go there too. So we are good.
EW (00:09:19):
Okay, so let us start with what is a CRC, and why would I use one?
PK (00:09:25):
I want to start earlier. Let us start with a checksum, and then talk about CRCs. Because they are really different technology.
EW (00:09:32):
Yes.
PK (00:09:32):
Right. Perfect.
EW (00:09:32):
Checksums are easy. Let us all do checksums.
PK (00:09:35):
Well.
CW (00:09:35):
<laugh>
PK (00:09:35):
Yeah, actually not. It depends which one. <laugh> All right, so let us start at the basics. You got a piece of data, and things go wrong in the world. This is how I got involved, because I do safety, I do dependability.
(00:09:49):
Data rots, the bits go bad, bits rot, they flip, whatever. You get a bad bit in a data transmission. You have a piece of memory and it suffers a bit flip before you read it back. And so data can accumulate faults, especially in embedded systems which have long operational cycles and noisy environments.
(00:10:09):
You are going to get bit errors in your network messages and in data storage. We may talk more about network messages, but the principles are pretty much the same.
(00:10:21):
If you have a piece of data and a bit goes bad, you really want to know that it has gone bad. So the idea is you take this piece of data you care about, and you add another piece of data next to it, which is sort of a summary, a digest. It is all the bits mixed together into something really short.
(00:10:38):
What you want is with high probability, if the data you care about has been messed with, if it has gone bad, that this little digest thing next to it will tell you it went bad. So you can do a consistency check between the data and the little digest, the checksum. And the checksum will say, "You know what? That data is not what it is supposed to be, because I have a different answer."
(00:11:05):
When you store the data, you store the data and you run a computation across the data, that crunches it down into a little digest. And you store the digest as well. Then you send it, or you put in memory, whatever. When you pull them back out, you take the data, you run the same computation, and you compare the result to the stored digest. If they are different, you know something went bad. So the idea of a checksum is to give you a high probability way of detecting faults.
EW (00:11:35):
You could do something really silly. Like you could take a blob of data and then you could make a copy of it. But that does not tell you which one is right. So you have to make another copy, so you can vote between three.
CW (00:11:47):
<laugh>
PK (00:11:48):
Oh, it is even worse than that, if you make two or three copies. The typical faults are the bigger the data is, the more probability something will go wrong. So making two copies and comparing, is like the worst idea ever. Because you just doubled the chance one of them will be wrong. And as you said, you do not know which one.
(00:12:06):
So then you send three, and if two compare, probably that is the right one. But what if all three are different?
CW (00:12:13):
50 copies and you average them.
EW (00:12:15):
<laugh>
PK (00:12:19):
Probably that is not how you want to go. The point of checksums- I am going to use checksums generically sometimes, so I include CRCs in them, right. The point of checksums as a generic, is to do better by saying, "Well, I have got a thousand bits or a few kilobits of data-" You can tell I am an embedded network guy.
(00:12:39):
If you have got a few kilobits of data, and I have eight bits or 16 bits. That is really all I need to, with very high probability, detect that something went wrong in the hundreds or thousands of bits. That is the idea.
EW (00:12:53):
So you have these 16, 32 bits that are checking your kilobytes of data. But it only tells you if something is wrong. It does not tell you what is wrong. That is the error correcting.
PK (00:13:07):
So "error detection" is something is wrong. It is a yes-no question. Actually, it is a yes and probably no question.
EW (00:13:14):
<laugh> Right.
PK (00:13:16):
Because there is some probability you got unlucky, and there is a fault in the check value that compensates for the fault in the data value. So a lot of why you would use one checksum or another, or one CRC or another, is to reduce that residual probability of an undetected fault. But yeah, it is telling you it is wrong.
(00:13:39):
Now, you can do better if you have a higher powered- I am using that term loosely. A higher powered check computation. You can actually do error correction, but this book does not go there.
EW (00:13:56):
Okay.
PK (00:13:57):
The reason is there are all these books on coding theory. It is like the world arguably does not need yet another book on coding theory. <laugh> Okay? There are plenty of books on coding theory. Aficionados may differ, and I respect that. That is fine. It is not what I have done for a living.
(00:14:13):
But they all have this one page with some dense maths saying, "Here are CRCs. Now let us get to the cool stuff."
CW (00:14:19):
Right.
EW (00:14:19):
<laugh>
CW (00:14:22):
If you are doing embedded networks, you care about the CRCs and checksums, and that page of math does not do you much good. So this is the book that takes that page or two, expands it to a book. And calls all of those guys out on all the hand waving they are doing, that actually does not give you the answers you need.
EW (00:14:40):
Does your book answer the question of unbalanced zeros or ones? I always thought that if you are dealing with flash, it is not a fifty-fifty probability when a bit goes bad, whether it is going to zero or one.
PK (00:14:58):
That is right. It does in two different dimensions. So what I do- At the beginning of the book, it runs through all the algorithms. We can run down the list, if you want at some point. But it runs down single-sum and dual-sum-
(00:15:10):
I invented a new checksum writing this book, the "Koopman checksum," because my wife said I had to name it after me, and that is a true story. And CRCs. They all have different performance trade-offs.
(00:15:23):
But then I have some chapters saying, "What about this? What about that? What about the other thing?" And one of the sections is, "Well, what about if it is not all zeros or all ones?" Right? For some of the checksums, the data values matter.
(00:15:38):
For any of the checksums that involve integer addition, the data values matter. And the more ones or zeros you have- If it is all zeros or all one, the checksums do really well. And the worst case is actually 50% bit mix. It does not matter which way they fail, 50% bit mix is the worst case, which was kind of interesting.
(00:16:02):
For Koopman checksums and CRCs, the data values do not matter at all. What matters is how many bits have been flipped. It is the direction of the bit flip that matters, not the data values.
(00:16:15):
The technical reason is integer addition deals with carry bits. And the carry bits are affected by the data values. But CRCs are all XOR operations, so the fault pattern is completely unrelated to the data pattern.
EW (00:16:32):
Okay, wait a minute. Say that again. The fault pattern-
PK (00:16:34):
In CRCs, whether you can detect your fault or not, has absolutely nothing to do with the data values.
EW (00:16:40):
But with checksums-
PK (00:16:42):
With addition based checksums, it does. Because in CRCs it is about bit mixing-
(00:16:48):
I can try and verbally sketch that for listeners who have not heard it. Actually, let me quickly sketch them out, so we have something to work with. Okay?
(00:16:59):
The general classes are there are single addition checksums. We take all the- I am going to call them "chunks," to make it obvious that- I do not want to get too much into the specific terminology. You have a big data word, you break it up into chunks. A chunk might be 32 bits or 16 bits or eight bits, but it does not have to be.
(00:17:19):
You have data where you break it up in the chunks, and you add up all the chunks using integer addition. Whatever you get at the end is your checksum value. Whether or not that detects faults, depends on not only where the bits are, but the carry patterns during the addition. If that makes sense.
CW (00:17:37):
Okay, yeah.
PK (00:17:38):
Yeah. Because you are getting mix- If you just XOR all the chunks, what you get is a bitwise XOR, called a "longitudinal redundancy check," LRC. That one can be fooled by pairs of bits in the same location.
(00:17:50):
But if you are adding, you get carries. And you get better mixing, because the carry bits sort of mix the values across, they smear them across the width. So it performs better. You get better fault detection. But you only get the fault detection, if the values smear. And that all depends on the bit values. Right?
(00:18:10):
So depending which bit values were faulted, you will either get a detectable or undetectable two-bit error, for example. Whether it is zero to one, one to zero, does not matter. The direction does not matter. Just that the values matter.
CW (00:18:26):
I expect we will get into this, but is that something that- How do you decide <laugh> which way to go? Do you statistically analyze your data, or is that just not something that people really need to look at?
EW (00:18:40):
No. CRCs are always better.
CW (00:18:41):
Right. That is the question I am asking.
PK (00:18:45):
<laugh> I have a lot of cores on some servers, that have spent a lot of time running Monte Carlo simulations. For me, a billion simulations is just getting warmed up. <laugh> Because it takes that to get the- It is really amazing how many simulations it takes to get a smooth curve. This book took a year, and a lot of that had 8, 16, 24 cores just running flat out for the whole time, to make the graphs in the book.
(00:19:14):
I have a curve in the book that says for each type of checksum, if you have 128- I think it was 128 bit data word, just to keep tractable. If there are zero one bits, it does this. If there is one one bit, it does this. If there are two one bits- You get where this is going, right?
(00:19:30):
There are these interesting curves, that they are worse at the center and really nice at the ends. So that is how I know it is 50%. A thing is I do not actually try and do the mathematical analysis. I do some, but not a lot.
(00:19:47):
I do mostly empirical analysis, because you look at the graph and you say, "Oh, I get it." You can look at the math and, "Well maybe," but if you look at the graph, it is like, "Oh, it is obvious. Yeah, that is what is going on." So a lot of the book has graphs in it.
EW (00:20:00):
I was kind of shocked a few nights ago. I had started reading your book, and my pre-bed book is "Symmetry" by Marcus Du Sautoy- I think is his name. And it is about group theory.
(00:20:17):
I was really surprised to see error identification and error correcting codes in group theory. Now I know I did not prep you for this question at all, so I am basically asking you to think on your feet, but how is group theory related to this?
PK (00:20:38):
Cannot go there. I am not a math guy. Shockingly, perhaps.
EW (00:20:42):
Christopher?
PK (00:20:42):
I am not that heavy into math. Let Christopher take that one. <laugh>
CW (00:20:46):
Oh, yeah. Let me think back to my abstract algebra class from-
EW (00:20:49):
You are the one who said "elliptical curve," so you are on the hook here.
CW (00:20:52):
Let me cast my memory back to 1994, and... nope.
EW (00:20:54):
<laugh>
PK (00:20:54):
<laugh>
CW (00:20:54):
Group theory is very large in error correcting codes. That is all I can tell you. I do not do not remember why.
EW (00:21:02):
Because you just looked it up on your phone. I watched you type it into Wikipedia.
CW (00:21:05):
No, I was checking to make sure I turned something off.
EW (00:21:07):
Oh, all right. <laugh>
CW (00:21:09):
No, I do not have an answer for you. I am sorry.
EW (00:21:11):
But this whole area-
CW (00:21:13):
I got a B minus.
EW (00:21:14):
Of computer science is partially engineering. Because you do have to deal with all these trade-offs of speed of computation, and size of your checksum versus size of your data.
CW (00:21:25):
And not only that, electrical engineering because what kind of wire- It does not matter in networking so much, because you never know what kind of wire you are over. But in some cases-
PK (00:21:33):
Other than it is a bad one.
CW (00:21:34):
Right. You are on a high speed serial link, and then there are encoding things with high speed serial links, where they want to balance the bits. So that might influence your choice. Anyway. Yes.
PK (00:21:42):
That is also in the chapters. Yes. So the thing is that math is not actually the answer here.
EW (00:21:50):
Let me write down that as a show title. <laugh>
PK (00:21:52):
"Math is not the answer." <laugh> That is a good show title. Let me go through the different types, and then I will-
CW (00:21:59):
That is the question!
PK (00:22:01):
I am in a better position to explain why math is not the answer. Okay? So there are single-sum checksums. There are also dual-sum checksums. Many people have heard of a Fletcher checksum, or an Adler checksum. That is what we are talking about.
(00:22:15):
We actually compute two running sums at the same time. One sum is just the same as a single-sum, but the second is the sum of the values of the running sum. So you are saying, value A is chunk one, plus chunk two, plus chunk three. But value B is B plus A, B plus A, B plus A.
(00:22:37):
What that does is it implicitly multiplies the chunks by their position. And that makes you resistant to two-bit faults, because it is hard- Until you wrap around, you cannot get a two-bit fault, that can fool both the current sum, and the sum multiplied by its integer position in the chunk stream.
EW (00:22:58):
Okay.
PK (00:22:58):
Okay. So that is the deal with Fletcher checksum. Something I found out, is that some of the intuition about the modulus is wrong. So the difference between a Fletcher and an Adler, Fletcher is- I am just going to use eight bit because the numbers are small, but it scales.
(00:23:19):
If you just add mod 256, just integer add, it turns out you get some really nasty behavior on the top bit, because you throw away the carry-out. What you want to do, is you want to wrap around the carry-out, by doing mod 255, or ones' complement. It is basically ones' complement addition.
(00:23:36):
And so a Fletcher checksum uses ones' complement addition, on both the A sum and the B sum. I have seen, and actually there are standardized implementations that use mod 256, instead of mod 255, and they have horrible performance. Really important to use ones' complement.
EW (00:23:53):
Wait a minute. Why?
PK (00:23:58):
Because when you are adding up a bunch of integers- Remember I said that bit mixing is helping you, from the carries. If you just do eight bit integer adds, you throw away the carry-out, do you not?
EW (00:24:10):
Sure.
PK (00:24:12):
And throwing away the carry-out doubles the vulnerability to bit flips in the top bit. It turns out.
EW (00:24:19):
Okay. I kind of get that. Yeah.
PK (00:24:22):
And for the Fletcher checksum in particular, or a dual-sum checksum, that implicit multiplication starts losing information, if you do not wrap that carry bit back. So it becomes vulnerable. Basically it is instantly vulnerable to two-bit faults, if you use mod 256 instead of 255.
(00:24:41):
You do not get the hundred percent detection of two-bit faults. You lose it basically instantly. So it is terrible. And people make that mistake all the time. So these are the kind of things that are in the book like, "Do not make this mistake. And here is the graph that shows you why."
EW (00:24:57):
Sorry, I am just wondering how often I have made that mistake, and-
PK (00:25:00):
<laugh>
EW (00:25:01):
Whether or not I should go and fix it. I do not know that I would have...
PK (00:25:08):
Yeah. The thing is, there is no reason that anyone would even know this, unless someone did the analysis, showed them the picture. That is why I wrote the book. It is because there are so many people unknowingly making these things that seem, "Oh, it is no big deal." It is terrible, right? Unless someone shows you why, there is no way to know.
EW (00:25:29):
How much are you showing with the graphs, versus showing with the math?
PK (00:25:35):
Very little on the math <laugh>. I decided not to use an equation editor. This entire book was written without an equation editor. It is not like I have really ASCII art math, or something silly like that. It is basically the math is all on one line, with superscript and subscript type font, and that is it.
(00:25:58):
I did that on purpose, because there are all these math books. Some folks are fine with the math and that is great. Those books are already written. But some folks, and I am actually one of them, I look at the math and my eyes glaze over. I have to actually understand what is going on with the bits. Then I go back and see, "Oh, that is what the math was trying to tell me."
EW (00:26:20):
Yeah, there are many times when reading the code, is easier than reading the math.
PK (00:26:24):
I read the math, and I understand the math. That is great. But it is not how I think. This approach in this book is, "All right, here is the basics of the math, but I am not going to throw pages of math at you. I am actually going to run examples."
(00:26:37):
For the dual-sum checksum, where I said there is implicit multiplication because the way the addition works, I actually run an example, so you could see what I mean by that. I say, "This is added. This is added. This- And look at the end." At the end, there are four nines and there are three eights, and there are two sevens, and like, "Oh, look at that. It was multiplied by the position. Is that not cool?"
EW (00:27:00):
You have good use of color in the book. I liked that, because it is hard to follow these things.
PK (00:27:09):
<laugh> Oh, I am glad. I was wondering what you are going to say about that. The book is full of color, so it is a full color interior. Now why did I do that? I started, because trying to do monochrome graphs so you can understand is really painful, right?
(00:27:28):
And there are a lot of- I actually do not know how many pictures. Dozens. It depends what you mean by a picture at this point. I spent a lot of times, "I really want to make this explaining things," rather than just saying, "Look how smart I am, because I can do math." It is not. That is pointless for a book, or at least I think so.
(00:27:48):
So a lot of time explaining, a lot of pictures, a lot of graphs. And then the examples, I use color, so you can see where there is a five. "Where that five come from? Oh, that is a green five. And there is a green five. Guess what? Those are the same five moving through." So I used a lot of color on the interior, to try and make it easier. And then sometimes the word in the text is also green. It is like, "Oh, so those are all the same thing."
(00:28:12):
I tried to do that. Apparently it came out okay, because you like it. I was not sure. It was an experiment. But that is what I was trying to do, make it- I just found when I read it with the color coding, it was just so much easier for me to follow what was going on, and make sure it all lined up.
EW (00:28:27):
So much of CRCs especially is about following a bit through a pattern.
PK (00:28:36):
Yes. I did that with the color.
EW (00:28:39):
The time aspect.
PK (00:28:39):
Yes, it is. There is a step, and another step, and another step. I tried to use color to make it more obvious what was going on with the steps. That is right. I could not put video in. That is a whole different deal.
EW (00:28:52):
You do have a YouTube channel.
PK (00:28:53):
I do have a YouTube channel. One of the things I did, was I made a video showing how CRCs work. Which forgive me for staying on task, but you got the dual-sum checksums.
(00:29:09):
Then you have a Koopman checksum, which is this thing that it is actually a remainder after division in integers. Except I found a clever way to structure it, so it looks like an iterated addition. But it is actually doing long division and computing the modulus. It is kind of funky thing, but it works really well. It has much, much better fault detection than the dual-sum checksums. It is a little more expensive, but not as expensive as the CRC.
(00:29:39):
Then there is CRC. So Koopman checksums are our remainder after division in integers. CRCs are remainder after division in polynomials over Galois Field (two)[GF(2)], which is all the field theory I know. At least as much as I am going to admit.
EW (00:29:56):
Galois is my least favorite mathematician.
PK (00:30:00):
<laugh> I am not going to run down a field.
EW (00:30:01):
<laugh>
CW (00:30:02):
Understand abstract algebra is one of the coolest fields, and I do not understand why you hate it so much.
EW (00:30:06):
I do not hate it. I am reading a book about group theory. I am doing okay on that.
CW (00:30:12):
Yeah, but you hate reading it.
EW (00:30:13):
<laugh>
PK (00:30:14):
Good for you. That is not bedtime reading in my book, but for some people it is, and that is all good.
EW (00:30:24):
Oh, it is PopSci. It is not taxing.
PK (00:30:25):
So for those people, it is like, "What are they going on about Galois?" Think Boolean algebra. Galois field (two) for our purposes is Boolean algebra, and the rules are simple. And this is how CRCs work. Remember in grade school probably, you learned how to do long division? Some readers are still traumatically scarred from that, perhaps, but-
CW (00:30:48):
I think they do it a different way now that involves some-
EW (00:30:51):
Calculators?
CW (00:30:52):
No. There is the new math they teach, which has all sorts of techniques that we do not understand.
PK (00:30:57):
Hey, you know what? I had new math when I was a kid too. New math has been around a long time.
CW (00:31:02):
Okay. I guess I just said old math.
PK (00:31:06):
<laugh> I had old new math. Okay? <laugh>
CW (00:31:12):
You are right. Long division, you draw the little thing.
PK (00:31:14):
There is this thing called "long division." They used to teach it back when people dialed phones with rotary dials, and it made clicking sounds and all that other stuff. But if you look it up on Wikipedia, it is there. So you have long division.
(00:31:27):
CRCs are actually a clever hardware based idea. You can do it in software of course, but you are emulating some hardware that did long division in hardware. It was over Galois field (two). What that means is, instead of addition or subtraction, use XOR. That is it.
(00:31:47):
So you do long division and you say, "All right, I have got the dividend, and I got the divisor, and I put the divisor under the dividend. I do a subtraction and if it fits, I go with it. If it does not fit, I undo it, and I shift over one digit and do the next long division."
(00:32:03):
The only difference is you are doing- All the digits are zero or one. You are doing XOR instead of add, and also XOR instead of subtract. It is all XOR. And it is zeros and ones. And that is it. It is long division.
(00:32:19):
You do the long division of the ginormous data word, thousands of bits, by your eight bit or nine bit or ten bit, does not really care how many, or 16 bit divisor, which is the CRC polynomial. We will probably come back to that.
(00:32:33):
You just walk it across, you do the math and when you get a remainder, that is the result of the CRC calculation. That is the essence of CRCs. It is the remainder after long division, using the rule that addition and subtraction are XOR, and all the digits are binary.
EW (00:32:50):
Watching your video about this, it starts out and it shows the long division, which made sense to me. And this whole Galois (two) thing did not penetrate. I mean, that was unrelated.
PK (00:33:05):
It is just XOR.
EW (00:33:05):
XOR is just so simple once you see it.
CW (00:33:07):
You have to understand what fields are. Fields are pretty simple to understand. But anyway, if you do not know what a field is, then that is not going to make any sense.
EW (00:33:12):
It does not matter. You can just push that to the side. Then you redid it as the hardware would implement it. I think my brain exploded, at least a little bit. I am pretty sure that is what the leakage from my ear was.
PK (00:33:24):
Well, I started out as a CPU designer, so I think that way more normally, I guess. I do not know. <laugh> But the point, people have seen the shift, the shift in XOR registers. It is a bunch of bits, and there is a feedback, and it shifts in XORs.
(00:33:37):
The thing to take away is that that hardware mechanism is doing division with XORs. It is doing the division algorithm in hardware. So that shift in XOR is identical to long division. That is the thing to get out of it. It does not matter which way you do it, they are doing exactly the same computation.
(00:34:00):
It took me a long while to understand why that was really true. Part of the reason I made the video, was to give a graphical explanation to convince myself that no one was blowing smoke. In doing that, I realized that the CRC algorithms you are likely to download from the internet, do not actually work quite that way. So there you go. They almost work that way, but not quite.
EW (00:34:33):
So am I not supposed to be just downloading these?
PK (00:34:37):
You can download it from the internet. But if you try and do the polynomial division by hand, you are likely to get a different answer. In some circumstances. <laugh>
EW (00:34:49):
As long as both sides of my communication mechanism get the same answer, I am fine. Right?
PK (00:34:55):
That is right. It has to do with the initial seed value. Is the difference. The initial seed value is- If you do it pure division, you have to put the initial seed in front of the data. But no one wants to do that, because it wastes a lot of time. So they put the initial seed in in parallel.
(00:35:09):
The book has some details. But it turned out. I was trying to do the division and compare it to algorithms, and getting different answers. It took me a while to figure out that the software actually does a different thing.
(00:35:23):
Now the good news is, the different thing has exactly the same fault detection properties. But one of the things I do in the book, is I show you the math way, and then I show you the way people really do it. I give code examples for each one. If you download the code unit tests, the unit tests actually prove that they come up with the same answer, if you know what is going on.
EW (00:35:48):
I am going to take a little step back here, since we have gotten pretty far into the details. And ask a listener question.
PK (00:35:54):
Sure.
EW (00:35:55):
From Doug G, "What is the best practice? Do you put the CRC in the data to be CRC'd? And do you put something blank in, like FFF or 000? Or do you put the CRC outside your data block?"
PK (00:36:12):
Ah, okay. So there are a couple things going on there. There is whether or not you- Let us talk about the hardware shift register version, because some people find that easier to visualize. When you are computing a CRC, you have to reset the shift register, before you start feeding bits into it. You reset it to zero or you reset it to all ones.
(00:36:33):
What do you do with that? You can do anything you want and it does not affect the fault detection, other than a non-zero value makes it easy to see if a bunch of zeros have been stuck in front by accident. Which some communication networks have that failure mode. Then you run the computation. At the end, you have a value.
(00:36:53):
So you take this- You do initial value of whatever you want. Non-zero is probably better, but it is not a huge deal. You do not go back and rebuild your gizmo, just because you did not do that. And you crunch some data, and you come up with an answer you put next to it.
(00:37:07):
The question here is, when you are checking the CRC, what do you do? Well, the purely mathematical way in fact includes the check value in the computation. So you run the data through the shift register, and you run the check value through the shift register, and you check for zero at the end. If you are building hardware, you can do that.
(00:37:29):
If you are building software, it turns out that whether you shift a bunch more times to see if the result is zero, including the check value. Or you just do the data and compare it to the check value. It is the same computation. That makes no difference. So the answer is, "It makes no difference. Take your pick. There is no difference in fault detection."
EW (00:37:52):
So I can have data- My structure that has all of my data that I am serializing and sending over a pathway. I can either have this CRC on the inside, and set it to zero or all ones when I am doing the calculation. Or I can have it on the outside, and not set it at all.
PK (00:38:10):
It depends on which software algorithm you use. But the high-speed algorithms do not care what you set it to.
EW (00:38:18):
As long as I do the same on both sides.
PK (00:38:19):
As long as you do the same on both sides. But the high-speed algorithms actually do not care at all whether you pre-initialize the data in your code word or not. They do not care. They ignore it. There is some code in the book that shows how that works, and it explains why that is so.
CW (00:38:38):
Okay. I have a question. You mentioned at the top of the show, you do a lot of embedded networks, was what got you into this.
PK (00:38:45):
Yep.
CW (00:38:45):
Networking is primarily still TCP/IP, which has a checksum, which is fairly- I mean, it was established probably a very long time ago, I think. RFC-
EW (00:39:01):
With the Flintstones cars?
CW (00:39:02):
1000 or something. Is that checksum a good checksum? <laugh>
PK (00:39:08):
Not particularly. <laugh> I think you said one of your listeners wanted to know what is the worst protocol checksum.
EW (00:39:18):
Mm-hmm.
PK (00:39:18):
It is the internet checksum. The one that is in the most widespread use, that is the worst, is the internet checksum. Because it is a ones' complement addition. They could have done so much better, with very little extra compute cost. At least it is not a two's complement. That is a start, but it is- They could have done so much better.
CW (00:39:37):
Very little extra compute costs back when they were deciding it, was probably a little more than compared to now, but- <laugh>
EW (00:39:44):
They could have had none.
CW (00:39:45):
Well, there were proposals to change it, right? Was there not an alternate- I remember there was some alternate- They were going to-
PK (00:39:53):
There are some alternatives.
CW (00:39:53):
Switch to Fletcher or something.
PK (00:39:56):
Yeah. Switching to Fletcher would have been great. You can actually do better than Fletcher, it turns out. There is a thing called a "DualX checksum," which did not exist until I started writing this book. I only go into that one because that one is kind of cool. It is simple to explain.
(00:40:16):
When you are doing a Fletcher checksum- Let us say you are doing a 32 bit Fletcher checksum. You are running two 16 bit sums next to each other. You have the A sum that takes the data 16 bits at a time. And you have the B sum, which is just the sum of the A sums cycled back into the B.
(00:40:33):
So you are doing 16 bit additions. You may be doing it with a 32-bit CPU, but you have to do this pair of 16 bit additions each time. Then you have to do a mod operation to reduce it, to not lose the carry bit.
(00:40:49):
It turns out that if you run a Fletcher checksum 32 bits at a time, so you have a 16 bit addition, you add a 32 bit chunk in, and the you mod 65,535 for 16 bit to the k minus one. If you add in 32 bits and then do the mod, you can process the data twice as fast.
(00:41:19):
The cool part is Fletcher checksums give you, it is called "Hamming Distance 3." You can detect all one-bit and two-bit errors, and then there are some threes you do not get. If you do processing 32 bits at a time, not only are you faster, but you get that Hamming Distance 3 out to twice the length. So it is faster and it gives you better fault detection, by processing 32 bits at a time.
(00:41:46):
The book has an explanation for why it is, but it is like, "That is just amazing that it works that way." I have never heard anyone talk about that possibility, until I did some work last year, and sort of stumbled into this.
(00:41:59):
So they could have done that and it would run, it is like 1.5 times as much time. It is less. It is not even twice as slow. It is pretty close to the same speed as ones' complement, and gets you all two-bit faults. Which addition does not give you.
EW (00:42:21):
That is so unintuitive. And yet I have wandered around my life thinking, "All checksums and CRCs are pretty intuitive. You just make sure that the bits are what they are supposed to be." Now you are telling me that I do not need to spend a whole lot more processing, and a whole lot more communication data length, to get better effect?
PK (00:42:45):
That is right!
EW (00:42:46):
Looks like something for nothing. Very suspicious.
PK (00:42:49):
Shockingly close. Which is why- Did I mentioned the billions of simulations to make sure I did not make a mistake? <laugh>
EW (00:42:56):
<laugh>
PK (00:42:56):
I am like, "Really? Really? I am not going to believe this one, until I beat on it for a few weeks of CPU time." <laugh>
CW (00:43:04):
Is there- Were people- How do I ask this question? Is this a field that has been kind of ignored for a while? Because it is interesting to me, because- I say "because" a lot. Wow.
(00:43:15):
It is interesting to think how much computation power is going on all the time, computing checksums. Every single packet of data that is happening on the internet, is having its checksum checked. That is a lot of stuff. That is a lot of computing power devoted to just adding numbers and seeing if something went wrong.
PK (00:43:33):
Yep. Right.
EW (00:43:36):
And doing it badly, apparently.
CW (00:43:37):
Lately I have been thinking about efficiencies that we are just kind of ignoring in the world. From waste heat, to all these things. This sounds like an inefficiency that is fairly large. Although now it is all computed in custom ASICs, and maybe it does not matter that much to get any more efficiency, because-
PK (00:43:55):
Well, in embedded systems there is still a lot of software.
CW (00:43:57):
Right. Yeah.
PK (00:43:59):
But Christopher, it is actually worse. Because there are cases where you could do dramatically better, at zero cost. Those are the ones- That is why I got into this field. I found these places where, with zero cost whatsoever, other than everyone agreeing to change their protocol. But in terms of hardware-
CW (00:44:17):
Which is a big, big ask. <laugh>
PK (00:44:18):
Which is a big ask, but that is a social ask. It is not a CPU time ask. It turns out new protocols get invented all the time. So there is actually a lot at work here. You asked me what my least favorite polynomial was. That is the IEEE 802.3 polynomial, is the one I gave to you in hex notation, because it is actually pretty terrible.
(00:44:44):
People have known for decades you can do better. As you point out, it is hard to change that. But at the time they picked it, they picked for reasons, and that is fine. Because that was the best they knew, and that is fine.
(00:45:00):
But you can do so much better than that at no compute cost, just by changing a constant in your code, and boom, you are dramatically better. So that is what fascinated me about this topic. It is sort of what you are saying, is that you can do so much better for low cost or sometimes no cost, and you just have to have the knowledge to be able to do that.
EW (00:45:24):
What kind of witchery is this? It is mathematics, is it not?
PK (00:45:28):
<laugh> Well, it is kind of math, but it is kind of not. That is the part. So you asked if it is a broad field. There are very, very, very few people who are specialists. There are a lot of hobbyists, and a lot of people are sort of interested, which is cool. You would be surprised how often I get emails on this topic, from people who just think it is cool to get into this. And that is great. It is a fascinating area.
(00:45:51):
There are very few people who have really gotten to be pros. Most of them did so in previous decades, so they are not active anymore. They are certainly knowledgeable people around this in coding theory, to be sure. But the coding theory guys spend most of their time on other things, not on this. So there are not a lot of people working on it.
CW (00:46:12):
Right. Encryption is probably where the overlap in math is, and everyone will just go do that.
EW (00:46:19):
Yeah, you can make a lot more money <laugh> in encryption.
PK (00:46:22):
And that is fine. Yeah, this is a hobby. There is no big checksum lobby cutting me checks to- <laugh>
CW (00:46:29):
Are you in the pocket of big checksum?
PK (00:46:31):
I am not, sir. <laugh>
EW (00:46:33):
Well wait, actually that is a good question.
CW (00:46:36):
What! That one? No, I do not think so.
PK (00:46:37):
You are bringing back to congressional testimony that I do not want to relive, so we are just not going to go there. <laugh>
EW (00:46:45):
Why do this book? It has actually turned out to be way more interesting than I expected, for a 700 page book about CRCs.
PK (00:46:53):
Well, that is the Kindle version. The print one is half that size, just so nobody is-
EW (00:46:56):
Okay.
CW (00:46:57):
Yeah. See? Yeah.
PK (00:46:59):
So nobody is scared off. Okay?
CW (00:47:00):
Your book came up at 700 something on Kindle.
EW (00:47:02):
Oh really? So they are about the same.
PK (00:47:06):
It is like 370 something. It is under 400 printed.
EW (00:47:12):
It is a dense, detailed book. The introduction is really good. Then there are some sticky parts that definitely my brain had to go through multiple times to get, which- It is a hard problem. But you did not just do this for fun. How did you come about this?
PK (00:47:29):
Actually kind of, yeah. <laugh>
EW (00:47:30):
Oh, okay. Well, never mind. <laugh>
PK (00:47:35):
I have an odd sense of fun. What can I say? Let me give you the history of how I got sucked into this area. I did not wake up one day and say, "I think I am going to be the king of checksums." It was not my life plan when I started this. Please do not use the show title "King of Checksums." I do not want to have to-
CW (00:47:56):
You could tell she was typing that right now, could you not?
EW (00:47:58):
<laugh>
PK (00:47:58):
I know how her brain works. <laugh>
EW (00:48:00):
No. It was not going to be "King of Checksums." It was going to be, "I am going to be a King of Checksums."
PK (00:48:08):
<laugh> But that would be false, because it is not what I did. So what happened was, I was trying to get into embedded networks. Because I thought they were pretty cool. And there is a constant reinvention of embedded networks. So the thing about embedded networks, is everyone makes their own network.
CW (00:48:22):
Why?
PK (00:48:24):
It has been going on for decades. The reason is because they need optimizations that they cannot get. I am not talking reinventing the internet, that is a whole different thing. But you have got a little 8-bit micro, and you have got a sensor, and SPI does not quite do it- Although you should use it if you can. Or using SPI, but what is the protocol for transferring data on top of it, right?
CW (00:48:48):
Yes. I had this problem last week, and I made something stupid up. Yes.
PK (00:48:54):
Or using control area in network and you got these header bits, but what pattern do you use in the header bits? So embedded networks are continually reinvented, and that is not going away anytime soon. It just is.
(00:49:06):
Yes, you should use a standard network if one fits, but often it does not. Or sometimes you get folks like, "We need a better network for general aviation on top of CAN." There is a committee that is working on that, that I talk to once in a while. So this is going on.
(00:49:24):
I was contacted by some folks in both automotive and trains. It turns out there was a big checksum cartel in Switzerland-
CW (00:49:32):
<laugh>
EW (00:49:32):
<laugh>
PK (00:49:33):
Back in the nineties. <laugh> That ABB Research- This guy, I never got a chance to meet him, which is just saying. This guy, Dr. Funk, who actually did a lot of the CRC work, have found the optimum CRCs up to 16 bits and published it.
(00:49:52):
So he and this cabal of- I am having fun here, but this cabal of embedded network guys, were actually the folks who were behind a lot of the different networks. The automotive networks and the train networks. It is all the same handful of guys at ABB Research, near as I could tell.
(00:50:08):
They hired me out for a second opinion on a train control network. I also was doing automotive work, as part of the self-driving car stuff. So one of the things we did, was we just came up with Monte Carlo simulations, to see if the CRCs were really working well as they thought they were.
(00:50:29):
Monte Carlo, you take a random network data packet, and you throw some random corruption at it, and you see whether the checksum catches the fault or not. Rinse, repeat. Right?
EW (00:50:40):
Rinse, repeat a lot.
PK (00:50:41):
A lot! Yes. A lot.
CW (00:50:44):
There are a few permutations of a common tutorial problem.
PK (00:50:48):
On a spiky, noisy curve you do not need too many. If you want that curve to be- Some of the curves in the books are a little bit spiky, and that tells you how incredibly painful that one was to get to converge.
EW (00:50:59):
<laugh>
PK (00:50:59):
But most of them are smooth. Why my CPU is just- My desktop is an eight core, and I have an older eight core in the next room. They are just begging me for mercy at the end of this book. <laugh>
(00:51:14):
We did that, and I just got fascinated by the question of, "Well, is that the best one they could have used?" And the answer was, "The only way to know is to try them all." That one takes age of the universe is just getting started, if you do it the brute force way.
(00:51:37):
So the book has a section, which is probably what melted your brain Elecia, about how I sped that up by many orders of magnitude. It is okay for that to be tough sledding, but I thought I would write it down so someone else can replicate the work. Because this is the first time I have ever written down what I actually did. That is why it is there.
EW (00:51:55):
Cool.
PK (00:51:57):
That is how I got sucked into it. Then when TTTech was doing TTP, and the FlexRay Consortium was doing FlexRay, because they did not want to pay royalties, the TTP people, they all came to me and look at their checksums and their protocols. <laugh> It was pretty funny. I just got into being the CRC guy.
(00:52:16):
I got some funding from the FAA to work on CRCs and checksums, because they are like, "Well, is the airplane going to fall out of the sky because of the wrong checksum?" And it turns out nobody knew the answer. So I got a little tiny bit of research money from them.
(00:52:32):
I teamed up with some guys at Honeywell Aviation, and we co-authored- They led, but I co-authored, an aviation embedded network handbook for airplanes. This has really been my journey through embedded networks.
(00:52:44):
But while I was at it, the CRC checksum thing kept coming up. And coming up, and coming up. I was the guy who had acquired the knowledge over time to be able to answer it. So here I am decades later, I am at the university now, but I will not be there that much longer. Time to retire to do something else.
(00:53:03):
It is a good chance to just take all the stuff in my head and get it down on paper, so that it is there. Because my guess is ten years from now, 20 years from now, people are still going to want to know about checksums and CRCs. And it will be there for them to learn from my journey.
EW (00:53:20):
There is something really nice about a book, instead of trying to learn each piece from the internet. Having it laid out in order, with terminology that is consistent. Yes. I did not realize how much of it I knew, but did not know that I knew.
PK (00:53:41):
That is part of it. But writing the book is also a journey for the author. So the bunch of pieces I had not put together either, until I wrote the book. That was actually part of the satisfaction, was getting it straight in my own head. Then trying to help the reader get it straight in theirs.
EW (00:53:57):
Okay. You wrote that book, but you have written a couple of others, since you were last on the show.
PK (00:54:04):
Yeah, I guess I have been busy. That is fair.
EW (00:54:06):
"How Safe Is Safe Enough? Measuring and Predicting Autonomous Vehicle Safety," and "The UL 4600 Guidebook: What to Include in an Autonomous Vehicle Safety Case." So autonomous vehicles.
(00:54:22):
Osau wanted me to ask about this. "What is the state of autonomous driving? Where should it be heading, and where is it heading now?" Maybe pun intended, we are not sure. And basically just said that autonomous vehicles are more interesting than CRCs. But I am not as sure any longer because-
CW (00:54:42):
I am incredibly bored by autonomous vehicles. But yeah. <laugh>
EW (00:54:45):
I mean it is a hard problem, but I did not know CRCs were interesting at all.
PK (00:54:52):
If I convince you they are somewhat interesting, then that is a win. I think they are fascinating in a very geeky way. There is that.
EW (00:55:00):
They really are. That they need to be simulated via Monte Carlo, in order to find the best path is odd.
CW (00:55:07):
The thing that gets me is the lack of- Wrinkled my brain a little bit. The question about is the TCP/IP checksum a good checksum goes to that. It is like, "These are really important. It does not seem like people have really been thinking rigorously about them for a long time." So it makes me happy.
EW (00:55:27):
Did Greg Wilson say, "Maybe we should try testing this stuff, that we all agree is true"?
PK (00:55:34):
<laugh> That is fair. Let me go back. Another piece of why I wrote it. The reason I stayed with the research is I kept turning up- I was originally way back decades ago, I was going to write a book on embedded network protocols. Because I found an interesting subject. It is a way to focus your mind, start a research program.
(00:55:53):
I tried writing one, and what I found out was every time I said, "Oh, everyone says to do this and this. Is there a reason why that is a good idea?" The answer is "No. They are just making it up."
(00:56:06):
It is like, "Which CRC polynomials? This is the pattern of the feedback of the XORs. Which one should you use?" "Well, the one that is in the standard, and the ones that-" Some of the ones in the standard are just awful. And some of them are really good.
(00:56:20):
The ones that it came out from those ABB guys in Switzerland, they made good choices. The CAN polynomial is really good. The train communication network, those guys knew what they were doing. But that is the only place I have gone where I have seen impressive results. The rest has just been, it is like "Really? You decide to use that one. Are you kidding me?"
CW (00:56:39):
Parenthetically, this is a similar thing that bugs me about machine learning.
EW (00:56:42):
<laugh>
PK (00:56:42):
<laugh>
CW (00:56:45):
Because there is a lot of stuff-
PK (00:56:47):
Getting back to self-driving cars.
CW (00:56:48):
There is a lot of stuff in deep neural networks and machine learning, where it is just like-
PK (00:56:52):
Hand waving.
CW (00:56:52):
"Well, we need an activation function for these things to make it non-linear. Let us try this." And they try that. It is like, "Oh, it works all right." And then they just stick with it. It is like, "Wait a minute!" <laugh>
PK (00:57:03):
So there is an important difference there. Okay. I am an engineer, not a mathematician. So empirical evidence works for me. I do not have to have a proof. Oh, we never really answered this question. We are diving all over the place, but hopefully the listeners are okay with this.
(00:57:19):
The reason I say math is not enough, is when you are talking about detecting all two-bit errors and three-bit errors with the CRC, you can prove that mathematically. Although I give an intuitive bit motion based argument, rather than the math in my book. Then I point to the proofs.
(00:57:34):
But once it is detecting all four-bit errors and five-bit errors, which you need for safety critical, the math does not help you. The only way to find one that detects all five-bit errors, is to try them all, all possible five-bit errors, and see how it turns out.
CW (00:57:49):
That is fine. I am okay with empirical.
PK (00:57:51):
My results are that. Yeah. I am okay with empirical.
CW (00:57:51):
I am not saying you need to give first principles. But stopping and resting on laurels, is what bugs me. <laugh> It is like, "Maybe there is something better!"
PK (00:58:00):
Well, the difference is historically there was not enough compute power to try all five-bit errors.
CW (00:58:05):
Yeah, that is fair.
PK (00:58:06):
So the work I have been doing in the last ten years or whatever, is actually found algorithmic speed-ups that I can actually guarantee. No, really, I have looked at every single possible five-bit fault over 64k bits. No, really I have done it; the book talks about this. No, really I have done it. Even though if you do the math, it is not possible.
(00:58:27):
The book explains the algorithmic speed-ups that let you do it. I can prove that this has it, as opposed to, "Ah, it looks okay. Let us go with it." If you empirically have shown by brute force evaluation or something unquestionable, then I am okay not having the theory.
(00:58:46):
But you are right that what has really happened is people have just sort of said, "Ah, this looks okay to me," and, "We are using it, because they used it." There is no rigorous evaluation. It just becomes folklore. Yeah, that is why I got into this, because I kept seeing all this folklore and stuff like numerical recipes. You guys know numerical recipes? "Numerical Recipes in C," version two.
CW (00:59:11):
I have in C and Fortran.
EW (00:59:11):
<laugh>
PK (00:59:13):
Yes. So second edition has a typographical error in one of the polynomials for their CRC chapter.
CW (00:59:20):
And people have used that for thousands of years. <laugh>
PK (00:59:22):
That is right.
CW (00:59:22):
<laugh>
PK (00:59:22):
And if you look at the third edition, there is a little footnote thanking me for correcting them on that. <laugh> So that is the kind of stuff that bugs me. I was just finding the stuff all over the place. And it is not like-
CW (00:59:39):
What are you supposed to do as a software developer? You are going to look at that, and you are not going to know. There is no way for you to know that that is wrong.
PK (00:59:44):
That is right. So I decided it was a worthwhile use of my time, sort of as a hobby. I have got a little bit of research funding, but trust me, it has mostly been a hobby. Like I said, there is no big checksum- <laugh>
CW (00:59:59):
Checksum World Tour.
PK (01:00:01):
It is not the Checksum World Tour. I am very appreciative of the sponsors who did give me support when they could, but a lot of it has been my own time, which is fine. I am not complaining. It is great. But that is why I call it a hobby.
(01:00:12):
Part of my hobby was- It has been decades and people have just been getting it wrong on folklore. Why do not we write down the right answers, so people thou know the right answer. Simple as that.
EW (01:00:27):
I thought people knew the right answer, and I was just using their code.
CW (01:00:32):
Yeah, exactly.
PK (01:00:34):
In fairness, they thought they knew the right answer too.
EW (01:00:37):
But only because that was what they learned too.
PK (01:00:39):
Exactly.
CW (01:00:40):
Well, who is to fault? Who is the original person at fault here? Let us go find them. <laugh>
PK (01:00:45):
No, there is no fault. There is a bunch of engineers who had to make decisions on limited knowledge.
CW (01:00:48):
I know. Yeah.
PK (01:00:49):
And the knowledge has changed and the decisions have not.
CW (01:00:53):
That is very typical of a lot of things, is it not?
PK (01:00:55):
Is it not.
CW (01:00:57):
Speaking of which, we would be remiss if we did not answer Osau's actual question.
PK (01:01:04):
Now you have to remind me.
EW (01:01:04):
Oh, are we going back to self-driving cars?
CW (01:01:05):
Well, you asked the question, so you should probably let him answer it.
EW (01:01:10):
"What is the state of autonomous driving? Where should it be heading, and where is it heading now?" I know you have some very good YouTube videos on this question.
PK (01:01:21):
The answer is look at my YouTube channel, but I am happy to give the short answer.
EW (01:01:23):
Sure.
PK (01:01:23):
The short answer is the industry is selling on "We are busy saving lives." But the reality is nobody really knows how that is going to turn out. I did not say they are not, but clearly there is not evidence there. There are five, ten million miles of experience, fatality rates once per a hundred million miles, round numbers.
(01:01:43):
Bunch more of tens of millions of miles to go, before we know how it turns out. Now the industry would say, "We are really confident it will be fine." It is like, "Well, yeah, but there are some assumptions in the models that I do not happen to agree with, and we will see." Maybe they are right, maybe they are wrong. I do not know.
CW (01:02:01):
Some of the people who are part of that industry are not particularly, as you will not say, trustworthy.
EW (01:02:07):
Well, 16 year olds are also very confident about their driving capabilities.
CW (01:02:12):
I was not. I tended to get on the freeway the wrong direction, so I knew I was a screw up.
EW (01:02:15):
<laugh>
PK (01:02:16):
That is fair. But the hundred million includes the drunks and the teenagers.
CW (01:02:19):
Exactly.
EW (01:02:19):
Yeah.
PK (01:02:19):
<laugh> And that is where we are setting the bar, right? To be clear. So we are going to see how it turns out. I know some people think I do not like the technology. Well, if I did not like the technology, why would I have written an international standard that I wrote?
(01:02:37):
Technically, I originated it. Which means I dumped a couple hundred pages of standard into a process, and then industry helped clean it up. Then it was a real standard, consensus standard. I invest a lot of time in that. Why would I have written two books on this topic, if I did not like it. I like the technology. What I do not like is dangerous technology. And there is a big difference.
CW (01:03:00):
It does feel like to me that there was a great deal of hype for- What year is it now? 2024. My God! There was a great deal of hype, say seven years ago. Six, seven, eight years ago. Everybody was like, "This is just going to take off. What are we going to do with all these truck drivers who cannot drive? If all are going to lose their jobs."
EW (01:03:20):
I remember hearing that. Yes.
CW (01:03:22):
"It is going to affect the economy in all these amazing ways." Then it just plateaued as, "Well, it is pretty good at lane keeping. And as long as you pay attention, you can relax when you are commuting, a little bit more."
EW (01:03:34):
But as soon as you have a construction zone-
CW (01:03:35):
And that is where it kind of stopped.
EW (01:03:36):
Or an emergency vehicle, it all just goes kablooey.
PK (01:03:40):
I am not going to weigh in on that part. It is all fair. I am not going to weigh in on that. Let me give you high level perspective. I have been doing self-driving cars as long as I have been doing checksums and CRCs. So I started in, I think it was '97.
(01:03:54):
So in 1995, Carnegie Mellon had a vehicle Navlab 5 that went across the US. There is a segment to DC to Pittsburgh. But Pittsburgh to San Diego, 98 point something percent hands off the wheel. In 1995, 98% hands off the wheel. The alums of that project, all love to say, "And ever since we have been working on that last 2%." <laugh>
CW (01:04:21):
Yeah. And that probably was not using DNNs, because that was impossible back then. It was totally different technology.
PK (01:04:28):
Yeah. It was a long time ago. It was a laptop and a camera doing optical flow, is my recollection. That is good enough to keep you on the line 98% of the time. It is the 2% that is a bear.
CW (01:04:40):
<laugh> When you are alive 98% of the time, and dead 2%. Well, that is a real problem.
PK (01:04:45):
Safety lives at 99.9999999. <laugh> Statistics struggle to get you that many 99.999% without problems. Statistics struggle with that many nines, and machine learning is a statistical approach. So how do you think that is going to work out? That is why it is so hard.
CW (01:05:07):
I think that is something people do not get about machine learning, especially today, even with LLMs and stuff. It is all statistics. All the problems inherent in statistics are there in greater magnitude, because you are doing billions of things. <laugh>
PK (01:05:21):
If 90% works for you, you are great. If 99% works for you, it is a challenge, but maybe they can pull it out. But airplane falling out of the sky, there is an official FA number for that. Airplane falls out of the sky once every billion with a "b" hours, ten to minus nine per hour. That is per hour. There are a lot of seconds in an hour.
(01:05:38):
Statistics does not really get you there. You need something else.
EW (01:05:45):
You cannot just do simulation.
PK (01:05:48):
No, you cannot simulate enough. That is right. Well, that is why- So for CRCs- I said I did Monte Carlo simulation for the other checksums and stuff. But for CRCs, I actually evaluated every single bit pattern, because simulations will not find the one bit.
(01:06:05):
Let us say 32k bits, taken five in a time. How many combinations of that? And there is exactly one of them with an undetected fault. You are not going to find that with Monte Carlo simulation, now are you?
CW (01:06:22):
Check back after the heat death of the universe.
PK (01:06:24):
That is right. I actually do that calculation. It is an impressively large number in the book. That is why I switched over to, "No, no, we checked them all." That is the only way to know. That is it. You got to check them all.
CW (01:06:36):
Like Pokemon.
PK (01:06:38):
Yes. Oh, okay. So yes, informally. The folks who go out and test, and something goes wrong and they fix it. And they test. And they go out. It is like they are going out playing Pokemon Go, and trying to collect them all. The problem is there are an infinite number of Pokemon.
CW (01:06:51):
<laugh>
PK (01:06:51):
That is why it takes so long for self-driving cars to get scalable. It is as simple as that. But in CRCs- Elecia is writing that down, I know.
EW (01:07:04):
<laugh>
PK (01:07:04):
In CRCs, <laugh> the results at the table in the appendix, of these are what I think are the best ones, or at least really good ones. Those are based on, "No, really. I caught all the Pokemon."
CW (01:07:20):
All right.
PK (01:07:21):
She is writing that down as a title too. I am having so much fun with this.
CW (01:07:23):
No, we have got so many titles now, we just going to-
EW (01:07:28):
We are pretty much out of time. Phil, do you have any thoughts you would like to leave us with?
PK (01:07:33):
I will tell you, really at the high level, let me say this. A lot of this is about engineering. So what I found with the experience of the book, was that math only got me part of the way through. The rest of it had to be engineering and other techniques, that were not just mathematical analysis. That is part one.
(01:07:55):
Part two is everyone wants to know what is the best CRC. And the answer is, "It depends." But more generally- Maybe a checksum is what you need, because one of the graphs in the book is the speed versus time trade-off. If this is how much CPU power you have, and this is how much data you have, this is the algorithm you should use to get the best bang for buck.
(01:08:15):
So if you think you know the optimal answer to a general problem, probably you do not understand the question. Engineering is never about the one optimal answer. Engineering is almost always about trade-offs. The one thing I try and do this book, but also when I teach, when I do research, is to help people understand the trade-offs. So they can make an informed choice, and pick the answer that is right for them.
EW (01:08:43):
If you would like to pick up Philip Koopman's book about "Understanding Checksums and Cyclic Redundancy Checks," it is out in all forms.
(01:08:53):
Our guest has been Philip Koopman, professor at Carnegie Mellon University. He is also the author of two books about autonomous vehicle safety, and the classic "Better Embedded System Software." He also has three blogs, one about each subject here, checksums and CRCs central, safe autonomy, and better embedded software.
CW (01:09:14):
Thanks, Philip. This was really quite entertaining.
PK (01:09:17):
Thanks for having me on. You know what? I never thought I would be on a podcast for checksums and CRCs.
CW (01:09:21):
<laugh> I know. It is amazing. <laugh>
PK (01:09:22):
So thanks for making my day. <laugh>
EW (01:09:26):
Thank you to Christopher for producing and co-hosting. Thank you to our Patreon listener Slack group for questions. And of course, thank you for listening. You can always contact us at show@embedded.fm or hit the contact link on embedded.fm.
(01:09:39):
Do not forget that there is a whole page of show notes, which will include links to Phil's blogs and his books. And now a quote to leave you with, from- Well, let us just go with Koopman. "Safety is not about working right most of the time. Safety is about the rare cases where it does not work properly."