451: From Concept to Launch

Transcript from 451: From Concept to Launch with Phillip Johnston, Tyler Hoffman, Noah Pendleton, and Elecia White.

EW (00:00:06):

Hello and welcome to Embedded. I am Elecia White, and this week we have a special episode for you. Tyler Hoffman and Phillip Johnston and I were on a panel for Memfault, talking about "From Concept to Launch: What it Takes to Build and Ship a Device." This is all the software you forgot about, when you were developing your product. Take a listen, and watch Memfault's newsletter for more panels like this.

NP (00:00:34):

Welcome everyone to our quarterly embedded panel. This is a really great panel we have. As always, we have got an amazing topic to cover as well. We are going to dive right into it. To kick us off, Elecia is going to describe what the topic is for today's panel.

EW (00:00:49):

We are going to talk about "From Concept to Launch: What it Takes to Build and Ship a Device." This is all the software that you maybe forgot to write, the manufacturing software, a little bit of the cloud software that interfaces with your device. I am excited to have Tyler and Phillip join me, because they are going to be experts in this area as well. So, those things you forgot to do. This is the new list.

NP (00:01:18):

Amazing. Thank you very much. Now we are going to go through a quick round of introductions. Although, for folks who are joining us from previous sessions, they will probably know these faces. Have we got anyone today? Elecia, do you want to introduce yourself first?

EW (00:01:30):

Sure. My name is Elecia White. I host the Embedded podcast, where we talk about all sorts of embedded topics, mostly with guests, and we find out maybe more about people than technology. I also have Logical Elegance, as a consulting firm, that I run with my partner. And I wrote a book, called "Making Embedded Systems," from O'Reilly. I have taught a class by the same name, for Classpert. So yes, this is where I live. This is my field.

NP (00:02:08):

Amazing. Yeah, Embedded.fm is one of my favorite of all time podcasts. It is great to have you here today. Phil, do you want to introduce yourself?

PJ (00:02:16):

Hi everyone, I am Phillip Johnston. I am the founder of Embedded Artistry. I am an embedded systems consultant and educator. I have been doing this for about 13 years now, and I have shipped a lot of products in that time. So I have seen and experienced all the things you forget about, multiple times. So it will be fun to get into that with this crew.

NP (00:02:38):

Awesome. And Tyler, could you introduce yourself?

TH (00:02:41):

Of course. Yeah. I am Tyler, one of the co-founders of Memfault. I am not a consultant. Full-time employee over here. I got my start in firmware at a company called Pebble. We were making smartwatches, had a few mil or a couple of million of those in the field. I just found myself constantly working on tools, and then doubled down that at Fitbit, when I was a firmware engineer there. I was building a lot of tools, a lot of the things that we will be talking about here. Did some work on the firmware, of course, as a firmware engineer, but it was not my focus. I am excited to talk to the group here. Noah, do you get to introduce yourself too?

NP (00:03:16):

Sure, why not? My name is Noah Pendleton. I am moderating today's discussion, so you will not hear too much from me, lucky for you guys. Yeah, I am a firm engineer by trade, so I have been doing it a long time. Currently an employee at Memfault, working with Tyler every day, which is basically a dream come true. It has been a good career for me.

(00:03:31):

All right. Well thanks everyone for introductions. Let us dive into our first group question. Just a little bit of background on this format, for folks who are joining us for the first time. We do a couple of group questions, and a couple of individual questions, just to keep the discussion flowing. And yeah, keep those Q & As coming in as well. Super fun for us to answer those questions from you.

(00:03:53):

All right, so, first group question for us is, we are going to talk about some fundamentals in this sort of background in shipping an IoT product. So the question is, what pieces need to be in place when you want to have your device being ready for release to manufacturing?

TH (00:04:12):

So one of the things, of course, you need a rock solid bootloader. You need a way to update your devices. If you are even thinking about getting into manufacturing, then you have to have the ability to fix things. You have to have a bootloader, you have to have an OTA.

(00:04:30):

Then my favorite thing as well is we go through manufacturing, and leading up to it, and have bugs in the file system or other various things. I always strive to have a factory reset. I think we did not do factory reset correctly on maybe Pebble v1, but on v2 and v3 it was the most reliable, most robust. It cleared almost everything <laugh>. There was no usage of any sort of file system or anything that could be corrupted. It was the most brutal factory reset, but it was the thing that saved us probably many, many times at the end of the day. Those are a couple of my recommendations to you.

EW (00:05:11):

I would say you need to have manufacturing software. The difference between making one prototype or ten engineering prototypes, and making a thousand or a million, is huge. You cannot just fix it on your desk anymore. You have to make it so that someone else can program it, can test it, can make sure the hardware is good. And then do all of the things that we are going to talk about, with respect to does this unit connect to the right cloud? So there is a lot more that goes on.

PJ (00:05:54):

I think I will throw in maybe something even more fundamental. Before you start manufacturing your devices, you really need to have a clear idea of what "done" looks like for your system, and you really want to be there. You do not want to do what I have seen so many times, which is, we are really rushing forward on producing the hardware, so we can make our September deadline to start producing units so we can be in stores by Christmas.

(00:06:19):

But then your software team is actually another six or eight months out. So all you are doing is spending money to put things in a warehouse. You could have used that time refining your hardware, or figuring out various problems that you are glossing over, because you wanted to start producing units as early as possible.

EW (00:06:42):

But that is how we get, as customers, devices that as soon as you open them, you have to update their firmware maybe three or four times.

PJ (00:06:50):

Right. Which I do not enjoy <laugh>. It is probably my least favorite part about buying a new product.

EW (00:06:55):

It is not a good customer look.

TH (00:06:59):

The the thing that I hear so many people forget about, with manufacturing in general, is that there is no internet connection, or very little internet connection, in the factory itself. We have had a number of customers that are like, "Oh, our device requires that we contact a server."

(00:07:14):

Or even maybe from the beginning, their debug flow requires an LTE connection or a cloud component, and they do not actually build the local CLI debugging experience. So they get to the factory assembly line and everything is broken <laugh> or completely dysfunctional. Have an offline mode <laugh>.

PJ (00:07:37):

Yeah, we forget that a lot of these factories are in places too, where there might not be great connectivity for you to rely on. Let alone the fact that you are dealing with some other company's network security now, and trying to get your stuff out. How are you going to put a dedicated antenna just for you on top of their factory?

(00:07:56):

It definitely is something that you do run into a lot, that can really throw a wrench in the works. If you banked on something that you just cannot do. Or it is going to take you a year to get your manufacturer to actually take care of that.

EW (00:08:11):

And do not forget the problem that if you do have a wireless device in your office, maybe you have 15 working. But now in manufacturing you are building a thousand an hour, and 200 of them are on at a time. They cannot talk to anything, because they are trashing each other's network.

(00:08:36):

All the manufacturing pieces that I have seen, I have seen that happen. Where people suddenly have enough devices that everyone in their office has one, or they are building them in manufacturing, and suddenly the device no longer works. You cannot see why. It is not telling you that it is broken.

(00:08:54):

The software that used to work, that sometimes works when everybody goes home, it suddenly does not work very often. It is not consistent, it is crunchy and hard to use. That is a manufacturing problem that comes up a lot.

NP (00:09:13):

Electrical engineers would have- They would talk about "design for manufacturing," which is more about, "Okay, how do we actually build this thing?" But from the firmware side, I think that sometimes gets forgotten. Like, you had mentioned what your manufacturing software look like, and that is a pretty important piece in the puzzle.

EW (00:09:32):

And it is something firmware engineers end up doing, because no one else can. Who else is going to write the, "Oh, why do you not blink green if all of your hardware is in place?" That is not usually the manufacturing engineer's job. They are working on getting their manufacturing line to be efficient, not just to get it up for the first time.

TH (00:10:00):

I do not actually know if it is all too common. One of the things I loved at Pebble at least, is even if it was the form factor board, we still had a fixture that gave us every debugging utility that we could possibly want, as if it was a development board. And so it had all the pins, had JTAG, had serial <laugh>. It behaved as if it was a normal development board, which I never found the ones at Fitbit for the fixtures to allow us to do that.

(00:10:24):

But cracked open a lot of watches, had to break through the glue. Thankfully we added screws later on, because that was a pain. The ability to open the board, or the sealed unit, was incredibly important. <laugh>

PJ (00:10:40):

I think that-

EW (00:10:41):

In this- Oh, go ahead Phillip.

PJ (00:10:43):

I was going to take this to a slightly different direction, in that you mentioned cracking open units, and figuring out what is going wrong. I think another thing that is easy to forget about, is we are selling devices to customers. Some of those devices are going to break in the field, or not work, or have some performance characteristic that we do not understand, that makes the experience bad.

(00:11:01):

And so you do need to have a repair flow, and the ability for your team to actually investigate these units, and figure out what is going wrong. You need to be able to do all the things you might do at the factory, at your office or some other place where you are performing these repairs, so you can send them back out to a customer. Or you could put it into a refurbished unit box, that you are selling at a discount, or something like that. That is something that is critically important.

(00:11:27):

And also, how are you going to handle your customer support needs? Somebody needs to be able to contact you to file an issue. You need to be able to keep track of all this stuff, as it is going through the various steps in the process. All of that needs to be designed and handled.

EW (00:11:42):

Well then, usually you do not want it to be you. You are saying, like "at your desk." No, really what you want is to write your documentation well enough, that you do not need to be involved. Somebody else, who is not an engineer, can do all of the preliminary, so you only get the interesting bugs.

PJ (00:11:59):

That is right. If you have done your job right.

NP (00:12:05):

You mentioned something interesting there, Phillip, about the out of box or RMA side of things. Is that something you have experienced, with developing the piece of firmware that enables that?

PJ (00:12:16):

I have used the same manufacturing test software to help set up a repair line that can be- It is like a smaller factory, essentially. I have been involved in that process, or with early field FA, when you are having all the engineers look at the first hundred thousand product returns that come back in, and actually do that triaging yourselves, and trying to figure out what the factory process is. I have spent quite a lot of time dealing with that, but usually I just use the same manufacturing software, if at all possible, for that.

TH (00:12:53):

When you are debugging the first hundred devices that come back from the field, are you just constantly updating those devices too, to add more logs? Did you add enough to begin with? I am imagining- I was never in that position, but I am imagining at Pebble, we would just constantly update those watches. Like, every couple hours be like, "Oh, let us add a log line here," because you are trying to figure out how the hardware is failing. There probably are hardware bugs. <laugh>

PJ (00:13:17):

Yeah, it depends on the actual problem. I have done that. I have also been in the case where whatever firmware we had was good enough to get the information, and it was clearly we had a factory escape, or some other issue that happened, and that was not required. I have definitely done both. It is more difficult though if you have blown the JTAG fuses and you cannot actually easily connect to the unit to debug. But that is not always the case, thankfully.

EW (00:13:45):

Well sometimes those first beta units, the ones that do not usually go outside the building, but do go to people who are not engineers, will not blow those fuses for that reason. But once you do, it does get a lot harder to debug. In the first hundred, or first thousand, units out, if they report back to a cloud server, I have had them aggregate why they reset, and then chase down the boot causes.

(00:14:22):

In fact, there was one in the early days of Fitbit, where I actually called the person and said, "Okay, at three o'clock your Fitbit did some weird reboot, and I do not understand. So do you know what happened?" The answer was really surprising. That was when the Fitbit went into the dryer.

TH (00:14:48):

<laugh> Covered by warranty, right? Just RMA the unit out next time <laugh>.

EW (00:14:52):

That was still internal release, but yes.

TH (00:14:55):

Got it.

EW (00:14:56):

That was not a bug that I was going to spend a lot of time chasing down, then <laugh>.

NP (00:15:03):

Amazing. What a use case. I guess it did not count too many steps during that, huh <laugh>?

EW (00:15:08):

No, actually it had <laugh>. I mean, the washing cycle...

NP (00:15:11):

Oh right <laugh>. That is great. Yeah, this has been good stuff. We are talking a lot about the manufacturing side of pieces, or the ops side of things. How does that differ from what you need to have ready on the device, prior to it landing in a customer's hands, rather than just hitting the manufacturing line?

EW (00:15:31):

For all that we hate updating devices as customers, that has to be rock solid. That is the piece that you cannot do without. But I know you have guys already talked about OTA. What else would you answer?

PJ (00:15:51):

You would certainly have other backend servers that need to be up and running, whether your device is checking in for remote monitoring, or you have some kind of IoT backend that you are dealing with. Right? That is a whole secondary software system that you are building, that has to be in place and functional and tested to actually make your device work.

(00:16:12):

Same with your phone applications, or desktop applications, or your web interface. However you are engaging with the device, that needs to be completed and ready to go to. I have certainly seen phone apps delay hardware product ship dates, because firmware is ready, the product has been built and sitting in a warehouse. But we did not finish the iOS app on time, or some other problem is gating that. So all the various pieces that go into building and managing a fleet of devices, have to be ready together to make that work.

TH (00:16:52):

And then trying to find the most stable way to update. Sorry. <muted sneeze>

EW (00:16:59):

Bless you.

TH (00:17:01):

Thank you. Trying to find the most stable way to update the device. At Pebble, the factory firmware, the firmware that the customers received never changed. And I think the only time- Maybe it changed once, because as time went on, Android phones, some of them, some of the random ones, could no longer actually do the initial firmware update on the device, because their Android Bluetooth stack was actually so bad <laugh>. Or broken in weird ways, that we actually had to add some workarounds to the factory firmware.

(00:17:31):

But the fix for that, for those customers- We had a wide test plan, but the fix for those customers was like, "Go borrow your friend's iPhone, download the Pebble app, or install the firmware." The newest firmware has all the fixes and the workarounds that we had. But that was truly the fix. It is finding the way that is the most stable over time, to basically reset or update the device, is something that I never thought about either. The firmware is stable, it is great, but it does not mean the things that you connect to are going to be stable.

NP (00:18:07):

Yeah, that is quite challenging in these fast moving IoT deployments. That is a really good point. All right, I think we are going to move on to our first individual question. So Tyler, you are in the hot seat first.

TH (00:18:20):

Yes!

NP (00:18:20):

So <laugh>, your question is, I think you are going to like this one too. For customer support flow, how do we get enough information out of our customers, to make your life easier? Especially when you are wildly successful, and you have thousands of devices hitting customers' hands.

TH (00:18:36):

Make sure to get their phone number, <laugh> so that you can call them if anything crazy goes on <laugh>. I think there is so much to touch on here. So yeah, at the very, very first thing like beta testing, with some customers that are not engineers or do not really know how to use the device, I think is incredibly important. And at that point, as long as the device is storing information on the device, and you can retrieve those devices, that is probably good enough so that you can like start debugging things.

(00:19:14):

These beta customers are usually cheerleaders of the company. I know at Pebble we had maybe a hundred or so trusted people, who were like, "Sure, I will have a beta unit, if it means I can use the product early." But their devices are crashing constantly, and they wore two watches instead of one, because they needed to have a watch that worked too.

(00:19:36):

We tried to store as many logs on those devices, knowing that we may replace them sooner, even if the flash chip burned out. That is fine for us in those cases. Getting a way to retrieve the data off that device in some manner, is also incredibly important. At Pebble our flow for that- We had a mobile app which connected to the device. Our flow for that was if you clicked "report a bug" in the mobile app, it would then pull off the logs from the device. It was not just an automated fashion, but it was an on-demand sort of thing.

(00:20:09):

And then over time, we pulled off more information from the device. We added new Bluetooth endpoints to ask the device like, "What firmware version are you running?" What are some metrics that are key, rather than just pulling off the logs. And then, of course, we built the whole system to pull it automatically, and then automate it, and collect it all. But that was probably the most important thing. One, the "report a bug" flow, making sure a customer has some way to pull data off the device.

(00:20:37):

Two, my favorite feature was, one of my favorite features of any of the internal releases at Pebble was, when the device crashed or experienced a known issue, we could basically pop up a banner that told the customer, "Please go into the mobile app and report a bug."

(00:20:56):

And so we could do that, like in special situations, maybe we are really trying to track a hardware bug for a user. We do not want to just collect their stuff automatically, but we are going to tell them, "Please file a bug <laugh>." That was good for the various employees at Pebble, legal team, people who did not really know a lot about the engineering side, but they at least knew how to report a bug.

(00:21:20):

That was because the other company I worked for, did not have that banner. And so what actually happened was a lot of bugs just went unnoticed, until a customer noticed. So even internally, we had hundreds of employees wearing the devices, all the crashes went unnoticed, because the device did not even have a bootloader, usually. Because it was like, "Oh, if you have a bootloader, then people are going to notice it is rebooting <laugh>."

(00:21:44):

Having the banner was really good for stabilizing the firmware internally. Some people got really annoyed by it, because they had have to report a bug every hour. But, such is life.

PJ (00:21:57):

Part of what you are being paid for.

TH (00:21:59):

<laugh>. It is true. All I do is report bugs. I use the watch. I use it in various different environments. I click; the button mash. Every so often I do that, for trying to force crashes. It is all actually quite fun. Yeah. Go ahead.

NP (00:22:20):

Oh, I was just going to say, nice. Yeah, those would definitely be the two top of my list, for sure. Make it easy for people to report things. And then the question you have to ask yourself is, "Do you leave that banner enabled, for real customers?" <laugh>

TH (00:22:33):

We chose not to, because we were pretty confident at that point, that the device is not crashing. We had enough metrics, we had enough data to know. We would not really ship a firmware out, if it crashed less than once every seven days on average. So that was fine. If it crashed every so often, people would notice it was crashing. But we live in the world of IoT, and that is fine.

(00:22:59):

No one was doing mission critical stuff with their Pebble smartwatch at the time. I want to believe that Fitbit, I think, was a 14 day average, before shipping. I want to say it was a little bit higher bar. Our clientele at Pebble was hackers and developers, so they were probably fine with it.

(00:23:21):

But we did not leave it enabled. No. We, of course, left almost everything enabled in terms of the "report a bug." Collected a bunch of data from the device, and then allowed engineers to really quickly figure out the actual issue that was going on. That was core dumps, metrics and logs. So that was great.

(00:23:38):

I think the other thing that was important at the time for Pebble- One thing we did collect automatically, was just raw metrics from the device. Numbers, battery life, battery drain, CPU usage, LCD backlight usage, because that also played into battery. A bunch of things around battery, and connectivity.

(00:24:00):

The support team could pull that dashboard up and see, over the last ten days, what were the metrics on the device looking like, so that they could help. For customer support, it could help paint a picture of what was going on on the device. For engineers, we could see if there was a weird regression, or if the accelerometer was stuck on, due to a new bug that we had never seen before. Or if the Bluetooth radio was being used a lot, and that was causing the [battery to] drain.

(00:24:29):

There were a lot of metrics that we would continue to add, firmware release by firmware release, and that was actually kind of fun. Debug devices remotely in production, using only numbers. You do a lot of creative things.

NP (00:24:45):

That is great. Yeah, we could do certainly a whole webinar, series of webinars probably, on that.

TH (00:24:52):

Just collecting some form of number. I think Elecia had touched on it during our- Even our chat before. But it is, collect some vital metric or heartbeat or ping or just know a device is alive, and collect maybe a reboot reason, if that is the minimal case. Know why the device rebooted. If the devices are rebooting due to the user shutting it down, that is one thing. But if it is due to a fault or an assert, it is another thing.

NP (00:25:19):

Nice. Thank you. That is super great answer. I love all it. All right, so next up we have Phil for an individual question. So your question is, "What are the basics that we need to think about for manufacturing tests?"

PJ (00:25:32):

Yeah, it is dangerous to get me started on this topic, because I could go on for a long time. We touched on manufacturing firmware, obviously that is essential. We do not usually want to use our customer software for manufacturing tests, for a number of reasons.

(00:25:49):

One is, our customer software is often doing things autonomously, or in response to events. Say you are building a camera, and you want to press a button, and that is going to start a recording or stop a recording. That is not really behavior I want to have happen on the manufacturing line. So I need firmware that is not really doing anything, unless it is instructed to, and it is only doing what it is instructed to. So it is, I can get this deterministic environment actually used for testing.

(00:26:16):

And you also need functionality, that you probably do not want your customers to have access to in their firmware. Whether that is just for the possibility of something going wrong, or for somebody nefarious trying to poke around your system. So you might add extra capabilities, just for the purposes of manufacturing tests, that you do not need in your customer firmware.

(00:26:38):

And there are other things you need to think about. Like, you need usually the ability to set up the device's initial configuration, and write all that critical information to some kind of non-volatile area in flash. So, your manufacturing firmware probably is going to have the smarts to unlock that region of flash, to write to it. But you do not want your customer firmware to have the ability to do that, so you can guarantee that in whatever process is happening, that region of flash is going to stay locked, and my factory written information will remain valid.

(00:27:12):

So, it is a second application you are writing, essentially. You can reuse a lot of what you are doing for your customer facing application, but they will diverge pretty heavily at some point.

(00:27:23):

Following that, you need to know how to test your product, and that is unique to every product. It depends on what you are doing, how complex it is, what you need to check at the factory.

(00:27:34):

But I find that every product has a pretty standard manufacturing flow, with the same basic requirements. And you can build on that basic flow. So, you have manufactured some PCBs, you need to put your manufacturing software on it, right? You need to be able to do what we call "provisioning," which is writing out critical information your device needs. So PCB serial numbers, final device serial numbers, MAC addresses for your radios, security keys for code signing, or authentication with your server, whatever it might be. That information is going to be written at the factory. And so you are going to need processes for doing that. And usually that happens alongside the flashing step.

(00:28:18):

You are going to want to test your PCBs, to make sure that there are no short circuits, open connections, there are no defective components, before you go through all of the effort of assembling that into a finished device.

(00:28:31):

Once you have assembled a device, you actually need to make sure that assembly went well, right? So you are going to have some tests that will run through all the basic functionality checks, to make sure your assembled device works well.

(00:28:46):

You might do some calibration steps. Say you are building a camera again, right? I am going to do some color calibration on all my cameras, so that when I record video, I am getting as close to the same colors out of all my cameras as possible.

(00:29:01):

And then at the end of the line, you are going to need to flash your customer software on it, and potentially put your device into a shipping state, if that is relevant. You might, for example, open a battery FET, to make sure you are not losing charge while you are just sitting in a box in a warehouse. So your customer actually receives a unit that they can start using immediately.

(00:29:22):

So from all of that, if you can hit all of those steps, you have got a basic manufacturing process that you can incrementally build on over time.

NP (00:29:35):

Nice. Thank you Phillip. As, as usual, very, very thorough answer. Love it <laugh>. One follow up question on that, that I wanted to ask was, this out-of-box ship mode situation. Since that is part of the manufacturing test image, would you say that in general you would recommend leaving your manufacturing test commands or whatever on the device, while it is going to the warehouse? Or would you say erase that and put in some alternate image?

PJ (00:30:02):

Usually the last step is to flash the customer firmware, which probably is not going to have those manufacturing interfaces on there. You could, if you have a bootloader and you are just going to go through an OTA process anyway, maybe you could skip that step. I do not think I have ever done that though. I think usually there is a known good, we want to ship the units with this customer firmware that we have tested, and make sure that we can update from, and we know it is reliable, just as a starting point.

NP (00:30:32):

Nice. Thank you. All right. Our next individual question is for Elecia. What are some strategies that can be employed, to store keys, credentials, securely on devices that need to connect to a network, or send data to a server.

EW (00:30:45):

Security. Well, I decided that I could not do this without slides. So I am going to start with a incredibly brief introduction to public-private key encryption. I took these images from Wikipedia. It is a great introduction, so go there. But hopefully I am just going to remind you of a few things you already know.

(00:31:11):

Public-private key encryption is a really good way to start your encryption journey. You decide you want to have security, so you do this key generation thing, where you get a private key and a public key. It does not matter what these are. Both are pretty valuable, although the private key is more valuable.

(00:31:36):

If somebody wants to talk to Alice who has the key, they take her public key and they use that to encrypt their data, and they send it to Alice, and Alice can use her private key to decrypt it. Note that the public key does not decrypt data. It can only encrypt data.

(00:31:54):

So if you are thinking about an embedded device, this is the TX line. This is the transmit line, and you are going to need another set of keys. You are going to need Bob's keys to go the other way. You can only go one way per set of keys. So here we have Bob talking to Alice, using Alice's public key, and Alice is decrypting using her private key. Go to the next slide please. Oh, up, that one.

(00:32:23):

Now that image with Bob's transmit is on the left. And on the right we have the TX and RX, where Alice has her private key and Bob's public key, and Bob has Alice's public key and Bob's private key. The thing with public-private key encryption like RSA, is that it is a pain. It is computationally intensive, it is slow, blah, blah, blah.

(00:32:51):

So you do not use these when you are communicating usually. What you do is you combine them in super secret ways, possibly with some other information, like the time of day. And then you have a shared secret, and you use that shared secret to do some form of encryption that is much simpler.

(00:33:10):

Slides are going backwards now. Uh, go down one more. So when we talk about the keys on the device, we have the device's private key and the cloud's public key. That means the device can decrypt anything sent to it with its private key, and it can encrypt anything sent to the cloud.

(00:33:32):

Why are we bothering to encrypt things? There are two main reasons for that. One, you want to make sure that the information you are getting is from who you think it is. So signatures. If, for example, Pebble took over Fitbit's devices keys, it got these keys that live on the device, they could fake, they could spoof, and then get onto the Fitbit servers. And I know we are using these companies because Tyler has been involved with them, but let us stick with it. Then the Pebble users can use the Fitbit apps, and Fitbit probably does not want to support that. So the signature piece is a big piece. And going the other way, you want to make sure that your firmware updates come from your cloud, and not somebody else.

(00:34:32):

And then the second thing we want to do after signatures, is actual what we think of as encryption, which is protect it from anybody else reading the data. If you are doing a medical device, you definitely want to stay away from HIPAA, and all of the things involved with keeping patient data private. Fitbit did this too, because patient data such as exercise habits should be kept private.

(00:34:58):

So, we have signatures and encryption for our public-private key thing, that shares the secret so they can talk to each other via simpler mechanisms. Where do you put these? Phillip mentioned that you probably want to put them in in manufacturing, and you probably need a serial number to go with them. Okay, so we have a serial number, we have some keys, we compile them into the code, it will all be fine. If we ever need to update them, we will do the OTA and update them that way.

(00:35:32):

Well, what if, and hear me out here, what if some bad person, attacker, script kiddie, some interested party, a reverse engineer, hacker, whatever, they say, "This is a really interesting application, I want to know more." Well, no device is perfectly secure, once people have physical access.

(00:36:00):

Whether it is sanding down the chip to read out the code, or figuring out that there is a secret debug serial port, or forgetting to blow the flash fuses so that people can read your code out with the right tools, it does not matter. What you need is to make the process of breaking your code more expensive than anybody wants to pay for it.

(00:36:29):

That does not get you away from things like script kiddies, who do it for amusement. But it does make it so that you are not professionally attacked. The amount of effort you put into this, and it is continuing effort, it is not just once, should be a reflection of how much money your company will lose, if the data goes public.

(00:36:55):

Okay, so you can compile them into the code, and somebody can then crack them. Well now they have keys to everything. They can send data as a device. So you say, "Okay, I do not want it to be that simple. In manufacturing, I am going to put them on an external SPI flash." Okay, well then you get people like me, who can read your SPI flash as soon as they have a board.

(00:37:19):

"Okay, I will put it on the internal flash." Okay. That is not bad. And some chips have have features that make that very difficult to read. Most of them are imperfect. But then you get things like the ChipWhisperer, that does magical things that will tell you these things just by sitting next to it. There are external crypto chips. The choice you make here is really about what you need to do as a company for security things.

(00:37:51):

So the next slide. Let us say they get broken. If you are using the same key for every device, all devices now are broken. But instead you can do a per device key. And that is far more secure. It means that every device has their own set of keys. If you break the keys for that device, you can only spoof that device. You cannot read traffic for other devices, or from other devices.

(00:38:26):

Except that is a huge pain. You talk about manufacturing software, every unit now has a serial number and its private key. And if you are really being fancy, maybe it has a clouds public key as well. An individualized clouds public key, and that will let you make sure that if anybody cracks a device, they only get that device. They cannot build a whole army of new Pebbles invading Fitbit servers.

(00:39:01):

One more slide. The other side of this equation are the keys that live in the cloud. Whether you do each device has its own key, or any device has its own private and public key from the cloud, does not matter. These are secret. These are very secret. Once people have these, they can pretend to be your cloud. They can take your devices, they can take your firmware, they can do anything they want. This is not the sort of thing you leave in your office drawer. This is the sort of thing that should be in a safe.

(00:39:43):

And yet we need them to actually do our work. We cannot talk to the devices without them. And this is not necessarily part of the discussion. It is just really important to understand that all of this security stuff, the easiest way to break it, is people. And so you need to keep them locked, and do not check them into GitHub. I know you forgot that one time. Now you have to change all of them.

(00:40:07):

Yeah, be aware of how the security is going to affect your manufacturing, because it will. Whether it is just you protecting your company, protecting your customers, protecting your trademarks, all of the stuff is part of the manufacturing process, that you should not wait until the end to think of.

(00:40:32):

Okay. I am sorry that was a little more prepared than you probably wanted, but what do you think Phillip?

PJ (00:40:37):

Well, I have a question. To me it seems like the most challenging part of what you described, would actually be getting the- How do I exchange data with my CM? And how do I prevent my CM from having access to all my keys? Especially given the fact that we talked about, we do not often have network connectivity between our CM site and our office. So how do you typically manage that?

EW (00:41:03):

The companies who have the most to lose, people who may have military contracts, or HIPAA violation issues, end up doing their final manufacturing in wherever domestic is.

PJ (00:41:19):

Right.

EW (00:41:19):

So if it is in the US, they do it here. That means the unit gets almost fully manufactured, may even be in its case, or may even be in its packaging. And then that last bit happens in the company. As for not having connection, well, you do not actually need to. You can send a database of keys over, and then they get programmed in. The manufacturer says, "We are done with these keys." You load them to your cloud, and it just goes in that cycle.

PJ (00:41:56):

And you just sent them half the dump. You did not send them all the information.

EW (00:41:59):

Right. I mean that is the beauty of the public keys, is that everybody can actually know the public keys. You do not really want to spread them around, but it is possible. Yeah, you do not have to send them everything.

(00:42:16):

You can also have devices that will find their own private keys. The device itself will tell the CM what its public key is, and the CM knows the public key for the cloud. So nobody but the device ever knows its private key. That is kind of mentally challenging, that your device is going to go out and make its own security. But it is one of the best ways to do it.

PJ (00:42:54):

Thanks. That was a great answer. And I just want to underscore for the topic of this panel, this is clearly not something you can cowboy at the last minute, right? A lot of thought has to go into how you are securing your devices and managing that. This is the kind of thing that if you wait until the last minute to think about it and do it, you are not going to do it well, or you are going to delay the ship date of your product, right? As we see security regulation coming down upon us, it is certainly something you cannot ignore.

EW (00:43:25):

But do look to your chip vendors. Like OTA, a lot of these things are becoming more standard, and becoming less of something you need to do yourself.

NP (00:43:36):

Yeah, that is a great comment. Buy off the shelf. Someone solved this problem <laugh>. We should do a webinar on security. I am realizing that <laugh>, that will be something TBD <laugh>. Thank you so much Elecia. That was an amazing, amazing answer.

(00:43:50):

All right, so next up we have got a group question for the panel. So that is going to be more a broad one, but what we really want to see is any common pitfalls or hurdles that you see in general when shipping IoT devices, when you are shipping these products out.

EW (00:44:09):

Bricks. Bricks are the most fun. Where you mess up OTA, and suddenly maybe the manufacturer has built a thousand of these, and the OTA does not work. So they are all just trash unless you unbox them. That is heartbreaking.

PJ (00:44:32):

There is somewhere in a factory in China, a crate of 5,000 iPhones that it was my fault for bricking <laugh>, we just never dealt with. So yeah, it is painful.

EW (00:44:44):

Let us see. 5,000 times, how much does an iPhone cost?

PJ (00:44:48):

<laugh> Yeah, I have not done this math <laugh>.

EW (00:44:51):

No.

PJ (00:44:52):

I do not think it would be my most expensive mistake, though. I will say that I think one of the biggest challenges that we face, is the fact that we have to interface between software teams and hardware teams, who have totally different jobs. And what you often see with embedded devices, is that orgs are going to be dominated by one or the other, right?

(00:45:18):

You are going to have your head of whatever you call it for the device itself, is going to be a mechanical engineer or an electrical engineer, who is going to be focused on the physical side. Or you are going to have somebody who comes from a software background, and they have this great idea that requires a product, but they have no idea what it takes to build a product.

(00:45:37):

When you are in these situations, and for example you are really an expert in hardware, but you have no idea of all the external software pieces that are required to make your device function, you are just not going to think about it. It is not going to be in your schedule. It is going to be a surprise. You are going to have continual slips as you learn this. And then vice versa, right?

(00:45:57):

If you are a software person, you do not realize that you need to get your FCC certification, and your Bluetooth radio has to be certified, because you did not pick a pre-certified module. Or you forget that you actually need to figure out how to provision all this information at a factory, right? It is the same thing.

(00:46:13):

You are just going to not know what you have to do. And you are going to be surprised. And it is going to be very, very painful when you have months of delays, because of these critical pieces that you cannot overlook, and you cannot just shoehorn in at the last minute.

TH (00:46:30):

I have definitely used the smart device, that has the same name for every single device in the Bluetooth pairing screen. Then you have to literally walk down the street to pair the thing, and then you walk back <laugh>. We also had that at Pebble one time. That was fun. <laugh>

(00:46:49):

I think one of the things at Fitbit- Honestly maybe, Noah you remember this. Shipping one of our devices, we were displeased with how slow we were provisioning data, and running manufacturing tests. So one guy, one firmware engineer, his full-time job for one or two months, was building a web application that talked locally to a fleet, or a farm, of Raspberry Pis, that would then run the manufacturing tests.

(00:47:13):

It was totally out of the blue. Was not what he was planning on doing <laugh>. Because we could not literally produce enough of these devices quickly enough, if we had just run the normal manufacturing line. We had to have this automated system, that was honestly a huge project.

NP (00:47:37):

Yeah. Those types of things are very sneaky. You do not think about, "Okay, it takes me 45 seconds to generate a private key on the device, and get the flash set up, and then put it into ship mode." Whatever. That is 45 seconds.

TH (00:47:48):

And I need a million of these by Black Friday, you know?

NP (00:47:50):

Yeah. Then the time. You cannot even do that provisioning, right <laugh>?

PJ (00:47:56):

Yeah, that is something I really appreciate from working at Apple. I do not remember the UPH we had to hit, but I think it was something like a million. So time was very important, because you cannot fill a factory floor with just one test station. There were hundreds.

TH (00:48:14):

On hardware stage rollout.

PJ (00:48:18):

I think on the manufacturing front and the speed front, something that I see a lot is people forget about manufacturing tests between development builds. You go to a build event for your- You are producing engineering prototypes, you create your manufacturing tests, your firmware does some things. And then you are back in the office. It is in between events. You are changing your firmware, you are adding new features, you are rewriting commands, things like that.

(00:48:46):

Then you go to build more units. You have changed everything, so your firmware does not work, or your firmware and your test scripts are incompatible. And now there is a last minute scramble to get those up to date. And you have not been validating your manufacturing firmware in between builds. So now there are bugs, and your retest rate is high, which is causing you to spend twice the amount of time to get units through the production line.

(00:49:09):

You cannot just ignore manufacturing firmware if you are not doing a build event. It needs to be something that you include in your CI pipelines that you are- In fact on that point, it is a very easy way to get into hardware-in-the-loop testing and automation. Because you could for example on every build, flash a couple devices, maybe hooked up to a tester or even just running the commands. And make sure that your manufacturing test scripts still work, they execute in the expected amount of time, things like that.

(00:49:39):

You need to do that validation anyway. Right? You cannot release buggy, crashy builds to customers. You cannot release buggy, crashy builds to your factory either. So it is something to keep in mind, because I see that happen a lot, that people just forget about them.

TH (00:50:01):

How many firmware builds do you typically have, throughout the whole process? Do you have a release, a debug, manufacturing, inbox firmware. Like it is almost four different ones.

PJ (00:50:15):

Do you count bootloader as a different one?

TH (00:50:17):

Depends on if the bootloader is a micro image or not. The full-fledged image. I think at Pebble we had four.

PJ (00:50:26):

Yeah. I would say three to four is pretty common.

TH (00:50:29):

Dang. Yeah.

PJ (00:50:34):

It can get more complicated I guess, if you are changing your keying strategy based on whether you are a development unit or a production unit too. Or any other details, like if you have a a totally separate development backend environment, which you should.

TH (00:50:50):

Yeah.

PJ (00:50:50):

Then you might hard code things, or depending on how you are handling that, you might now multiply your configs. Which points out that that stuff should probably be configurable information on your device, and not a variant if you could help it.

NP (00:51:07):

And that is often something that you learn the hard way as you are realizing <laugh>.

PJ (00:51:13):

Yeah, let us do development and testing and debugging on our production backend. That does not usually work out very well.

NP (00:51:19):

<laugh> Something is going to break, inevitably. <laugh>

PJ (00:51:21):

Yep.

NP (00:51:21):

Awesome. That was great. Thank you guys very much. I think we are going to jump into a few audience questions now, unless anyone else had some more items they wanted to talk about for the common hurdles section.

TH (00:51:34):

Let us do questions.

NP (00:51:38):

All right, awesome. Yeah, so we have got some great questions coming in, and please keep sending them in, for any of the attendees. Yeah, we will happily answer any we can.

(00:51:47):

The first one that I would like to ask the panel is, "What is the best way to secure a UART or a programming interface? Certainly one common password is not good enough. And if you disable it, then you are unable to debug a sealed unit." What are strategies that you might use for that?

TH (00:52:03):

I am going to pass on this question. We use a hard-coded thing at Pebble. <laugh>

PJ (00:52:11):

Yeah, I think there is a balance between going back to what Elecia said, if somebody has your device, there is a limit to what you can secure. So commonly I do see a hard-coded password. If it is beyond that, it would be you need to authenticate with some kind of key over a special Bluetooth connection endpoint or something like that. I have done that in the past.

(00:52:36):

I have seen strategies where you might have a special debug board you plug in, that has to send a password and like toggle some IO lines in a specific sequence with very precise timing. But again, these are all things that can be totally reverse engineered from your firmware, and used against you, if you are really concerned about that. Blowing fuses and just eliminating your debug interfaces altogether is- But then you cannot deal with debugging after the fact.

TH (00:53:06):

Do not even say that. Do not say that Phillip. No. <laugh>

PJ (00:53:12):

Again, it depends on the degree of concern you have, and how much you really need to protect that.

EW (00:53:19):

Another one I see a lot is your debug interface is only available for a short time. So you have to type that password in in the first two seconds after booting. Which, if you know the password, is easy. But if you do not, it is hard to start guessing, if you have to wait for reboot each time.

NP (00:53:41):

Yeah, I have definitely seen that strategy used. All right, another question from the audience we have is, "How do you manage revocation of keys for critical devices?"

EW (00:53:51):

That is probably a whole paper.

NP (00:53:54):

I was thinking that. <laugh>

EW (00:53:54):

One I have not written. It is a hard problem. It is not so much revoking the keys, it is replacing them, because you cannot- I mean, if you want to just trash a bunch of units so they cannot get to your cloud, because they have fallen off a truck or something, that is one thing. You just delete them from your access list.

(00:54:20):

But if you have devices you think may have been accessed improperly, then you probably- Just sending them new keys is not going to help, because if they have your device, then sending them a new key does not do much, you just give them the new key. So is not really a good solution that I know of. I would be happy to be wrong if anybody else has a good solution.

PJ (00:54:53):

So I do not know how it is done, but I know from just watching security news. For example, MSI just had their BIOS updating keys leaked in a ransomware attack, and they do not have a revocation mechanism for making so those keys cannot be used anymore.

(00:55:15):

But apparently other motherboard manufacturers have solved this problem. So I would be really curious how various vendors like that handle revocation of the UEFI keys for their secure boot processes. Maybe that would be a good model. But off the top of my head, I do not know how that is done at all.

NP (00:55:36):

Got it. So in the case where it would be needed, it is probably like a whole, <laugh> a whole month long project for someone. Multiple months probably <laugh>.

PJ (00:55:44):

Right.

NP (00:55:46):

We got a good one here. Can you talk more about bootloader design, and the concept of a micro image? I think Tyler maybe you mentioned that one.

TH (00:55:54):

Yeah, there is a webinar done by Francois, who basically I think iterated through a bunch of designs from his time at Pebble and then at Oculus. But I guess the idea is- So one, look at the webinar. It was given by an engineer for engineers, not pitching Memfault really, but talked about the multi-stage bootloader design.

(00:56:17):

But in short, it is the tiniest little bootloader, that knows where to look for the bootloader image, so that you can update the bootloader image. But then the bootloader is only responsible for verifying the new firmware image, making sure it is signed, and then knowing how to actually boot it, and knowing if you have two slots, or knowing how to update the main image.

(00:56:44):

I have seen both types of bootloaders, that one, can connect to Bluetooth or have some sort of connectivity stack, and then others that do not. What I typically work with is the bootloader itself does not have connectivity. There is usually a- What did we call it? A recovery firmware, and then a main firmware. The recovery firmware is like a full-fledged image that has a UI. It has a lot of stuff, and then that is the thing that has the connectivity. And then there is the full image that has everything.

(00:57:14):

But it is like, you are booting through stages, and then each time you are verifying, and you are seeing if you need to update the thing. You are basically performing hardware validation across, as you boot through. I am sure Phillip or Elecia have extra things to add there, but I would say watch the webinar.

PJ (00:57:35):

Yeah, I thought that was a good answer.

NP (00:57:39):

Nice. Another great question we have- Actually we are at the top of the hour, but we will do one more question, and I think we will close out the panel, and answer the rest in the Interrupt Slack. But this last question that we will do live, "Do you worry about security with the ability to crack open a released product?"

EW (00:57:57):

Sure. It is about how concerned you are. It is not easy. It is not usually- People are usually buying your product because they want the product. It is when you have things that your product can cause other problems. Medical devices, or locating people. That is when you start worrying about how you are doing the security.

(00:58:33):

But again, if you have one unit, you should only be able to break that one unit. If you can break all the units by cracking open one, then you have a much bigger problem.

PJ (00:58:47):

And there are strategies you can take, should your device warrant such strategies. You can create a circuit that is only closed, for example, when your product is fully enclosed. And the circuit is broken when it is open. You can use that to control behaviors. Obviously that can be faked.

(00:59:07):

There are also things like a lot of RTC components will have tamper alarms, that can be used to trigger a wake up event. And so you can leverage things like this, if you are really concerned, to for example, self-brick a unit. It is not really something that you want to do, but if your security warrants such a degree, or such a concern or consideration, that you might really want to to take catastrophic extreme measures if somebody is opening your device, there are ways to detect that that is happening and respond to it.

(00:59:41):

You have to be prepared to do an RMA, in case something goes wrong. Or have some flow in place, that you could repair those units and make them good again, if it was an accident or some other thing happened. But you definitely can detect those devices being cracked open.

EW (01:00:00):

I have actually worked on one that self-bricked. The hardest part for me was at the end, when we really had to test this functionality, but we were still in engineering, so we did not have that many prototypes. But you have to test the functionality. You have to make sure it bricks. And for us, it was-

TH (01:00:20):

How did you brick it?

EW (01:00:21):

Oh, we updated- We put a bootloader into RAM, and then updated the flash to be all ones, and then all zeros, and then all ones.

PJ (01:00:31):

Yep.

EW (01:00:34):

Yeah, we did not have a recovery method, and it was a unit that we had taken out the flash fuses, because it was supposed to be very secure. So it is truly bricking the unit. There is no recovery. It is hard to do that to the unit that you have had lovingly sitting on your desk. It has been your friend through all of these debugging adventures, and now you are just going to kill it.

PJ (01:01:04):

Sometimes that is a lot of money, depending on the cost to build those prototypes. That is thousands of dollars, that can be just flushed down the drain to test this functionality.

EW (01:01:15):

Thousands of dollars, hundreds of hours. Yes. It is hard to watch. Harder to do.

NP (01:01:22):

Sometimes necessary, but heartbreaking nonetheless <laugh>. Great. Thank you guys so much. I am just going to share the panelists' information really quick. All right. Yeah, so big thank you to our panelists today. That was amazing, especially our special guest, Elecia White. So thank you guys so much for entertaining our questions. That was a ton of fun. I could go for hours.

(01:01:49):

You can see on the slide, and we will share this as part of the webinar follow-up. There is contact information for all these nice folks. Especially big shout out to embedded.fm and the Embedded Artistry content that those two guys put out, which is just amazing stuff. So please go check it out.

PJ (01:02:03):

Thank you.

EW (01:02:05):

Thank you.

NP (01:02:06):

Big thank you as well to all the attendees. Thank you for joining us while we discuss this topic, and looking forward to the next one.