Transcript from 390: Irresponsible At the Time with Tyler Hoffman, Elecia White, and Christopher White.

Welcome to Embedded. I'm Elecia White, alongside Christopher White. Today we'll be discussing why I hate the term Internet of Things. Wait, no, we'll be discussing the management of distributed systems with Memfault's Tyler Hoffman.

CW (00:00:23):

Hey, Tyler. Welcome.

TH (00:00:25):

Hello.

EW (00:00:27):

Could you tell us about yourself?

TH (00:00:29):

Yeah, for sure. Yeah. I'm Tyler Hoffman. I am generally an embedded firmware engineer. Over the last three years I've been doing, apparently like Chris, mostly Python, and building Memfault's backend service's data infrastructure to manage our device management platform and diagnostics tools.

TH (00:00:53):

Before then I was a firmware engineer at Pebble and Fitbit, where I constantly found myself doing more developer tools and infrastructure than writing firmware.

EW (00:01:06):

Alright. We will have questions about, well, writing firmware, and managing it, and all of that. But first we're going to do lightning round.

CW (00:01:16):

Narrow topic. Narrow topic.

EW (00:01:16):

Lightning round, where we ask you short questions, and we want short answers. Are you ready?

TH (00:01:21):

I am.

CW (00:01:22):

Okay. Easy one. Favorite fictional robot?

TH (00:01:25):

WALL-E.

EW (00:01:26):

IoT, edge devices or distributed systems?

TH (00:01:31):

IoT edge devices.

CW (00:01:35):

Who had a better smartwatch, Fitbit, Pebble, or Apple?

TH (00:01:39):

Pebble. That's an easy one.

EW (00:01:41):

Preferred code editing tool?

TH (00:01:43):

Now? PyCharm. It's great.

CW (00:01:46):

CMake, Make or something else?

TH (00:01:50):

CMake. But I don't know it super well.

EW (00:01:54):

Open source software, yes or no?

TH (00:01:56):

Yes.

CW (00:01:58):

Complete one project or start a dozen?

TH (00:02:02):

Finish two 80% of the way.

EW (00:02:06):

If you were teaching a course about embedded systems, what three topics should you definitely cover?

TH (00:02:15):

Unit testing, debugging, and build systems.

CW (00:02:22):

Okay. I have a late breaking question for you. Have you ever ridden the Boilermaker Express?

TH (00:02:28):

I never did, actually. I hopped aboard it when it was stationary, but never while it was moving.

CW (00:02:35):

Follow-up. Where did the name Boilermaker come from?

TH (00:02:40):

I mean, I'm going to guess here. I mean, I do know. It was from the men who worked on trains and railroads.

CW (00:02:48):

I'm asking real-time questions that are coming in to me from a fellow Purdue alum. So...if anybody's wondering what the heck is going on, that's what those questions are for.

TH (00:02:57):

I went to Purdue for undergrad, for the listeners.

EW (00:03:01):

Okay. We're going to go back to the course thing, because that was kind of important. That was -

CW (00:03:05):

I'm sorry.

EW (00:03:05):

Do you remember?

CW (00:03:06):

I forgot that you're trying to get everybody to do your homework for you.

EW (00:03:10):

That wasn't it. Okay. So listeners, sorry, Tyler, it's just going to take a second. Listeners, I am teaching a course for a company called Classpert. That's like class and expert had a little word together and called it Classpert.

TH (00:03:25):

I like it.

EW (00:03:25):

And it's about embedded systems. It goes through my book. It has a whole bunch of extra stuff. I am doing videos. I'm doing all kinds of lectures. There'll be mentors, and real time discussions, -

CW (00:03:38):

And projects.

EW (00:03:39):

Projects. And I'll put a link in the show notes, but I hope you check it out. The first class is going to be kind of small, because let's face it, I haven't done this before. But the Classpert folks seem to really have their act together, and let's face it, my logo for them is awesome.

EW (00:04:00):

Okay. Sorry, Tyler. Back to you. Debugging, unit testing, and what was the other one?

TH (00:04:07):

Build systems.

EW (00:04:09):

Well, alright. I think that's where we're going to head for the whole show. Recently on Twitter, I asked about IoT management for non-cellphone devices like BLE or Zigbee with a backhaul cell phone or coordinator, non-Linux, Ethernet devices.

EW (00:04:29):

And I wanted to know what platforms people use, and what they like, and what you'd suggest for a new small company entering the IoT space. Do you have an answer to that?

TH (00:04:41):

I think we all have somewhat strong opinions to that.

EW (00:04:46):

I didn't get any response. I mean on Twitter. I was so surprised. But yes, I have strong opinions, mostly in the, "Oh God, get me out of here" opinion.

TH (00:04:54):

Yes.

EW (00:04:54):

But you actually are in that space.

TH (00:04:57):

We're in that space. And my guess as to why people did not respond to you would be because no one has a very strong, or confident, or probably even right answer to that question. Because I feel like a lot of them are mediocre at best, a lot of these systems.

TH (00:05:14):

In terms of what platforms we've seen people use, so, yeah, so we work with a lot of customers at Memfault. We talk to a lot of engineers. I have never talked to more embedded systems engineers in my entire life than I have over the last two or three years.

TH (00:05:30):

Zephyr, Mynewt, FreeRTOS, and the Espressif IDF are the ones that come up most commonly in the customers that we talk to.

EW (00:05:40):

But those are the devices. I was looking for what happens after you get to ten units in your lab.

TH (00:05:47):

Yes. And that is a fantastic question. There's really not much.

TH (00:05:54):

Of course there are all these big cloud providers that provide some sort of IoT system, and they believe that your small embedded devices are computers sitting in your offices or in your closet somewhere and not necessarily a very small embedded device. And I know that's what you're looking for.

EW (00:06:19):

Yes. Non-Linux Ethernet devices.

TH (00:06:21):

Exactly, right? And so AWS has one, right? You can use FreeRTOS with AWS IoT, and Microsoft bought ThreadX, the RTOS there, and Espressif has their own cloud backend that they want to use as well, or they want people to use.

TH (00:06:39):

I wouldn't say all of them are good, and they weren't written to be usable, especially by engineers or people who don't know exactly how to use these systems to begin with.

CW (00:06:51):

Why is this such a hard problem? Is it because you're taking a step beyond just firmware to now having an understanding of networking and software-as-a-service kinds of things?

CW (00:07:02):

Do you have to make that kind of a jump in expertise, or is it that nobody has made a real kind of turnkey, "Okay, this is easy. We will do everything for you kind of solution?"

TH (00:07:12):

I mean, the ones writing the firmware are very much not the people writing the backhaul services. And I don't know if they talk to each other often enough.

TH (00:07:26):

And...I'm sure both of you, working at previous companies, doing embedded systems, that's probably true, is the firmware engineers very rarely talk to the cloud engineers, I believe. I know that was true for the last two companies that I worked for.

EW (00:07:40):

I think it's worse at some of the big companies. Azure and Amazon both have IoT offerings that really do seem to be written for software engineers working on computers, not written for firmware engineers, trying to squeak out one last byte of RAM.

TH (00:08:00):

Exactly, right? And especially trying to do SSL connections, and -

CW (00:08:05):

Yes, yes.

TH (00:08:05):

- HTTPS with 64K, or some of our customers like 32K of RAM. It's just not happening.

CW (00:08:12):

Yeah. I think you're right. That's a huge piece of it is, some of the things you must conform to don't really fit.

TH (00:08:21):

Yeah.

CW (00:08:21):

And they were never intended to fit.

TH (00:08:23):

And the other thing that's also tough with a lot of those platforms that exist today is they assume that these devices have infinite power, pretty much infinite resources, and they have a constant and stable internet connection to these systems.

TH (00:08:40):

And that's very rarely the case, unless you are literally a computer in a closet running Linux.

EW (00:08:48):

I've worked on two big distributed systems such that I've had to get involved with both the software and the hardware. And one was ShotSpotter, where we had dozens and dozens of sensors in each covered city. And we had dozens and dozens of covered cities.

EW (00:09:06):

And every day we wanted to know, well, was there a sensor that didn't have its heartbeat, that didn't check in, which meant its radio or power was down? Was there a sensor that had a fault, or didn't hear anything, and therefore probably...had something wrong?

EW (00:09:28):

And I mean, once you get up to a thousand sensors, it becomes hard. And we did it with Visual Basic, querying SQL tables in Excel, and color coding. My big -

TH (00:09:42):

In Excel too?

EW (00:09:43):

In Excel.

CW (00:09:44):

In your defense, AWS didn't exist. None of this stuff existed back then.

EW (00:09:49):

That's true. I mean, it was, it was 2007, 8, 9-ish.

TH (00:09:55):

That's kind of still what they want you to do though...They're just going to export your data to a CSV file in S3. And they're going to tell you to do it yourself. That's all they're going to provide.

EW (00:10:07):

But one of the other problems with that, I mean, the communication was part of it, but for some of the devices, we were on a cell modem. And so every byte you sent back actually cost money. And so we didn't want to do a heartbeat every minute, because that actually adds up to a lot of money every day.

EW (00:10:28):

...What is this, device management? IoT management? Takes into account the need for small data -

CW (00:10:43):

Updates. Yeah.

EW (00:10:43):

- updates.

TH (00:10:44):

And so, yeah. So can I pitch Memfault really quickly -

EW (00:10:48):

Yes. Yes, please.

TH (00:10:48):

- or just say what we're attacking, right?...So yes. What we did at Pebble and Fitbit, Pebble, we built our own, it was very simple. Our devices connected through a phone and every so often reported back to that phone to a very scalable Python application written on Heroku. Honestly, that's how we got most of our data back.

TH (00:11:11):

At Fitbit, massive systems. I'm sure both of you have some history on that and how those are built, but yeah, very complex systems, but completely homegrown.

TH (00:11:21):

And why we wanted to build Memfault was because we kept seeing this problem over and over again, where no matter what company we went to, we were going to have to build this system, or shoe horn one of these larger systems, into a hardware product embedded system again.

TH (00:11:40):

And me, Chris, and François were just like, "We can't do that again. We don't want to solve this problem for the third or the fourth time." And so that's Memfault. And...we're getting in, I would say more so device management. I think everyone defines it differently, which I guess is also part of this conversation.

TH (00:11:58):

I see it as kind of three or more things. It's provisioning, it's giving the device some sort of certificates or device serial that you put it on in the factory assembly line. It is knowing whether that device is alive and how well it's doing. And then it's also pushing new updates to those devices.

TH (00:12:19):

I think those are the three things for device management. Memfault does very well, in my opinion, the OTA delivery and the monitoring and diagnostics. We do not have, yet, maybe, any sort of provisioning services, security keys.

TH (00:12:35):

We're not doing those things yet, which I think is the one thing that AWS IoT maybe does well, but also very confusing.

EW (00:12:43):

How do you do over-the-air updates if you don't have security keys programmed in manufacturing?

TH (00:12:51):

For our customers, we assume they are going to do that themselves.

EW (00:12:55):

Fair enough.

TH (00:12:55):

So we are basically saying, bring your own system...I think other companies are attacking it in the way that, "You need to use our platform. You need to use our chips that we provide you."

CW (00:13:06):

Right.

TH (00:13:06):

"They're $10 a piece. And please use our chips. Please use our backend, and you can build your product on top of it." We're just saying -

CW (00:13:17):

Not very scalable. I mean -

TH (00:13:18):

Which is not very scalable. But I think a lot of these companies are building it for very large and expensive devices, right? If you're building a tractor, or if you're building -

CW (00:13:33):

Sure.

TH (00:13:33):

- a big machine on an assembly line, you don't care about the cost at that time. But if you're building a wearable device that costs $100 bucks, you need something that works for that company and for that business model. And there's not much there.

EW (00:13:48):

One of the problems with supporting provisioning in manufacturing that I've seen some vendors try to help with, ends up with them having the keys. And that's always been a non-starter for me.

TH (00:14:02):

Vendor lock-in.

EW (00:14:04):

Yes. Exactly. I mean, in the end, if I'm protecting the customer data or protecting my device through secure over-the-air downloads, I don't really want anyone else to have that information.

TH (00:14:22):

Correct. Yes. And so, yeah, there's no good solution. But...have you heard of providers that don't give you the private keys if you give them on a device, the provisioning?

EW (00:14:37):

I think so, because...I don't want to call out anybody, but there are some companies that provide a whole solution from network to dashboard. And you write a little bit of code for their widget, and you don't really get to know anything else about it.

TH (00:14:59):

Got it. And so you're basically writing software for this thing that exists in the environment that you're placing it in.

EW (00:15:06):

Yeah. And sometimes, I mean, over-the-air updates happen kind of magically, which is terrifying, because you don't really want over-the-air updates to happen like you want continuous integration with software.

TH (00:15:21):

Correct. And especially when it comes to hardware, because the inevitable and the worst case is you're going to brick units or have issues in the field that you can't possibly handle or want to deal with, basically.

CW (00:15:35):

Yeah. It really feels like the companies that do what you're describing Elecia is,...like we said, they have that software perspective. It's like, "Well, how can we make the device a software thing? How can we make the device just part of the cloud?"

CW (00:15:50):

And if you write software for it, it's, "We own everything that's involved with it," basically. So it's a difficult balance.

EW (00:15:59):

And when you say provisioning, you mean the security piece, not the provisioning that the customer has to do when they get it home and have to connect it.

TH (00:16:08):

Correct. Yes. I mean -

CW (00:16:10):

The certificates and stuff.

TH (00:16:12):

Flashing the device with, "This is your device serial. This is your Mac address. This is your Bluetooth ID. And this is your security token that is how you will communicate to anything." Not necessarily customer onboarding and, Let's install your first OTA payload," and everything.

EW (00:16:31):

Is there a different word for what the customers do?

TH (00:16:34):

Honestly? I would call it onboarding, honestly.

EW (00:16:38):

Okay.

TH (00:16:38):

I think it's what I've always used. Yeah. So going back very briefly to the larger systems, what they tried to do, and what they're focusing on, is secured transport. And in my opinion, a lot of it, for OTA updates specifically, is as long as you have secure boot, you're fine.

TH (00:16:59):

As long as the payload is signed and you install it, which I think most of the boot loaders today, in the embedded system platforms that you can use, you're generally going to be fine.

EW (00:17:10):

What do I need to know as a firmware engineer about OTA when I'm thinking about these large distributed systems? Signing and hashing are important where hashing is the checksum, but with security and signing says it really did come from the person I said it came from.

TH (00:17:28):

Yeah, it's funny.

EW (00:17:32):

It's funny?

TH (00:17:33):

It has to be 100%, sorry. I'm only saying funny because at Pebble we actually didn't have secure boot. We didn't have signed payloads. It was more of a hacker device. And so we just assumed the device connected to the mobile app, and everything was fine.

TH (00:17:50):

But I'm thinking back now, it's funny because there was a group of people, they were called PebbleBits. And they would modify the Pebble firmware in whatever way they wanted, where they added new fonts, they built internationalization...

TH (00:18:06):

But they basically just modified our firmware in very different ways, adding really cool features. But then you would just click the link in the mobile app, and it would just automatically push that firmware to the Pebble, which I thought was fantastic.

TH (00:18:21):

But you could install whatever you wanted on the Pebble device, as long as the CRC matched.

EW (00:18:28):

Which is great when you have a hacker device.

TH (00:18:31):

Yes.

EW (00:18:31):

And it's great when you have it on your desk as a developer. But it is not great when the president of the United States is wearing your smartwatch.

TH (00:18:41):

Totally.

EW (00:18:41):

At that point, you want a little more security.

CW (00:18:45):

What are you saying?

TH (00:18:48):

And let's just hope that every hardware company is making sure that they are using those secure practices. All we can do is write on Interrupt...that you should do it.

EW (00:19:00):

Interrupt is your blog, right?

TH (00:19:03):

Yeah. Memfault's founders, the three of us, kind of just were like, "We need to write some content because it doesn't exist. So let's do it ourselves."

EW (00:19:12):

And it's a good blog, and I have pointed to it and been pointed to it various times. And yet I was totally unaware of the connection to Memfault or what Memfault did. Have you considered maybe just a little more promotion?

TH (00:19:30):

I mean, yeah, so our marketing employee, Colleen, would love that. There is this fine line that we are trying to balance between aggressive self promotion and also trying to build this community on the side of the company that we don't ultimately control. I've seen it time and time again.

TH (00:19:58):

And the reason I don't like a lot of the embedded systems communities is the ones that you find are almost always owned by a company or enterprise. And the largest and arguably best LinkedIn group that I found for embedded systems is blatantly owned by a consultant, an embedded systems consultancy.

TH (00:20:19):

And it's just awful, and they've actually ruined it now. And so we wanted to just not do that. But yes, we should do a little bit more self promotion now that we have a very good product that we all believe in.

TH (00:20:33):

And we do think almost any hardware company that's building on embedded systems, and now Android, and soon embedded Linux, all of them would benefit from it. And so now we're not super opposed to it.

TH (00:20:44):

We...actually just had a meeting last week about how we're going to get some more people to understand what Memfault is, who are reading Interrupt.

EW (00:20:54):

Marketing is really hard. I mean, because there is that balance between, "I did this thing. I think you'll think it's cool," that most engineers are hesitant about, and then there's this, "You know what I need? I need this thing." And not realizing that somebody else has already built it and done a good job of it.

EW (00:21:15):

I don't know how to do that. I mean, I have that problem with the podcast that I think I should be marketing more. I think there should be more out there, because I do think it's a good thing. And I think people like it, but I don't really want to market. It's no fun. And it feels wrong.

TH (00:21:34):

Yeah. It always feels like you're advertising to people that don't want to listen. And I mean, what we've learned a lot is people actually want to hear about Memfault, and read more content, and yeah, to tie it back.

TH (00:21:51):

So what we want Interrupt to become ultimately is a community of developers that, they feel like it's at least helped by Memfault. We may provide resources to the community. Eventually it could come into a more fleshed-out website that's more of a hub that you kind of hop into and learn more about embedded systems.

TH (00:22:14):

Maybe a conference in the future. But we don't want to be the company that owns it. If somebody else wants to come in and help us out, great. And we will provide resources to it, but that's kind of it. And yeah, embedded.fm should become a community that, if you two want to stop doing it, it should live on, right?

EW (00:22:34):

Yeah.

TH (00:22:35):

Hopefully.

EW (00:22:35):

Although if somebody offers us enough money for the Slack, we will totally sell, but it's going to have to be a lot.

TH (00:22:42):

Yeah.

CW (00:22:43):

Yeah. 5, 10 dollars. Maybe 20.

TH (00:22:48):

I mean, at least buy you a couple of meals in San Francisco, right?

CW (00:22:51):

Well, that's going to be more like a hundred then.

TH (00:22:53):

Yeah, exactly.

EW (00:22:55):

Okay. I want to go back to over-the-air programming, sorry. All the way back to there. Security and hash, hash is for when you get the firmware or signatures and hashes. What else as a firmware engineer do I need to be thinking about with over-the-air updates?

EW (00:23:16):

You mentioned a secure bootloader. Is that something the vendors are providing now? Or is that still something I have to write?

TH (00:23:25):

Unless you have specific needs or requirements, generally you're not writing it. I think most vendors are providing it. They're not great. And so...if you're using a standard enough chip, a bootloader is probably built for you, whether that's wolfBoot, or MCUboot, or -

EW (00:23:48):

Nordic's DFU.

TH (00:23:49):

Nordic's DFU, and I'm sure Zephyr, they don't necessarily have a bootloader, but they basically will tell you how to go about doing this.

EW (00:23:58):

Well, and TI has OAD. Why are there different initials for everybody? This seems like a term we should agree on now.

TH (00:24:06):

At least DFU. I mean, I think most people who I talk to will now use the phrase DFU, but that's also the only way that they know how to install firmware too.

EW (00:24:19):

Doing a firmware update, over-the-air programming, device firmware update, over-the-air downloads, whatever it's called, for a few units in your lab is different than deploying to a few million smartwatches.

CW (00:24:40):

Well, you don't have to go that far even. But yes.

EW (00:24:42):

No, you don't have to go that far.

TH (00:24:44):

I mean, that's drastic. Yeah. But we've done it.

EW (00:24:46):

What are the steps? What are the gotcha, and what do I need to know as a firmware engineer taking the steps to get from a few devices to consumer production level?

TH (00:25:01):

Yes...So yes, I will answer that in just a moment. Even before we get there, you asked what is a requirement. You have to just have an OTA system that works and has a fail safe.

TH (00:25:12):

And so if you ship a bad firmware, the device should be able to restore an old firmware, or restore a very, very minimal firmware that...knows how to phone home or send out a signal that some phone passing by will eventually install a firmware on it. At Pebble, we chose the minimal firmware route.

TH (00:25:38):

If you booted a firmware and it failed three times in a row within I think the span of 15 minutes, we would boot up into what we called the factory firmware, which we had tested and hardened for a very long time, that you could install a firmware over Bluetooth. And you could factory reset the watch to absolute factory conditions.

TH (00:26:00):

And so in my opinion, that is step one. And...you should build that when you have five devices, or at least you're starting to build sealed units. Because if you can JTAG anything, you're probably not going to care about a reliable OTA delivery system at that point. Getting to millions of devices, that's a whole different ball game.

TH (00:26:24):

I think the buzzwords and actually true words are you need staged rollouts. This is deploying to ten devices, then 100 devices, and then 1,000, 10,000, and you scale linearly, basically.

TH (00:26:43):

And that entire time you are getting data back from the devices, how are you doing? How's the new firmware behaving? And are there any new crashes or anything that I should be aware of? That's all a different system.

TH (00:26:57):

We'll talk about that later, but staged rollouts, and making sure that you get some form of ping, or heartbeat, or status after you've installed a firmware update is probably the most critical thing when you're dealing with the millions of devices.

TH (00:27:15):

Because if you update even 1,000, and you just don't hear anything from a device anymore, that's when the sirens go off and you press the big red button on the side of your desk, right?

CW (00:27:28):

That's when you go to Reddit and see what everybody's complaining about.

TH (00:27:31):

Exactly. You start reading Amazon, you check Reddit, you check Twitter. Yeah. I mean, it's so true. You hit the nail on the head there. That's exactly how we felt at Pebble. As soon as the Reddit thread came up, "Hey, is version 2.4 broken for anyone else," we're like, "Stop everything!"

EW (00:27:50):

It's weird to get that sort of feedback from customers. I mean, that's the exact sort of feedback you desperately don't want. And yet, if there's an error that only happens on 1 out of 100 units, you're not going to find it in the first couple stages of roll out, unless you get lucky.

TH (00:28:09):

Or unless that person is a very vocal Reddit user, for sure. Yes, absolutely. I mean it's so relevant to me as well. I just remembered, one of my first tasks at Pebble, and it was so irresponsible at the time, but I came in, I'm just out of college.

TH (00:28:30):

And in my first month I did a couple tickets, fixed a couple of bugs, and then they were like, "Alright, Tyler, no one wants to be the release lead for this version 2.4." And they were like, "You are going to be the one to release this firmware." And it takes about a month, month and a half.

TH (00:28:45):

You're basically working on it full-time. You are triaging every bug that comes in. You are fixing all the bugs that are easy. And then you're kind of making sure that all the other ones that are harder or more specific to engineers, you're making sure that those all get fixed. You're deploying nightly firmware updates.

TH (00:29:02):

And ultimately what it means is you're dealing with the one or two people that just break their watches in every which way. And yeah,...that was my second month at Pebble, and it was super fun. Thankfully at that point we had logs, we had core dumps, and we had some very minimal metrics coming back from devices.

TH (00:29:26):

So we had battery life and a few heartbeats here and there. So we generally knew how things were going when we were releasing, even internally. But yeah, you have to watch Reddit, honestly. As soon as something comes up there, it's like, "Let's pause for a second."

EW (00:29:43):

But this release engineer position or role does tend to get passed around. Because it's not very fun, especially if you're the person who has to make all the versions be right, do the test verifications, make sure the security keys are in the right place, do some documentation for manufacturing, all of these little things.

EW (00:30:11):

And then you have to compile the image with the security keys, make sure that you can update the firmware downgrade the firmware, upgrade the firmware, this whole dance to make sure that it's releasable.

EW (00:30:26):

It is a pain, but it's also one of the most important things we have to do. And...it's the least often thing we do for a lot of us. And so it's full of mistakes.

TH (00:30:42):

Yes.

EW (00:30:43):

I mean, how many times have you had to write the checklist?

TH (00:30:46):

I've had to modify the checklist and update it plenty of times. I think during my tenure at Pebble, I think I was the release lead four or five times. In every single release something changed or was out of date. And I skipped a few steps here and there, and I only messed up once.

TH (00:31:00):

I think we deployed, not a bad firmware, but a incorrectly labeled firmware to 100 to 1,000 people. It got the Git SHA, it got 2.8-abcdefg instead of 2.8. But that was my one mistake there.

EW (00:31:21):

That's not so bad.

TH (00:31:22):

It's not so bad. No, it wasn't bad at all. But it was also just something that somebody will post on Reddit and just be like, "Hey, what's going on here? This is not normal." And it just doesn't look great.

EW (00:31:35):

Working at LeapFrog on consumer devices, we didn't have the problem of over-the-air update, but we did have the problem of releasing in manufacturing...They make masked ROMs, and so you can't change them afterwards. And so you have to make sure you get the firmware right.

EW (00:31:56):

And the number of times someone had to change a version number so that it matched some document and did it with a hex editor.

TH (00:32:07):

Oh gosh.

EW (00:32:07):

Because if you recompiled, you had to go through testing again, but if you just made it match the documentation, it was all fine. Yeah. Well, sorry, distraction.

TH (00:32:21):

And when you rebuild a firmware, and you try to ship it to people,...depending on how sophisticated the company is, you're either going to say, "Alright, well now it needs 7 days of soak time or 14 days of soak time." But...what we did at Pebble instead was like, you track it for a day.

TH (00:32:43):

And if the battery life is trending in the right direction, where instead of letting every single watch run out of batteries for 14 days and then measuring the duration that each watch took to die or needed to recharge, we just said, "Okay, cool. Every single device that is out there today running our firmware dropped 7% today. Okay, great."

TH (00:33:06):

"We're ready to ship the firmware tomorrow. The battery life trends look good," rather than waiting 14 days, which I know many, many other teams basically have a requirement to do that.

EW (00:33:17):

But if you have to wait that long, then if there's a really important bug, you have time, and you have more people on Reddit complaining about you.

TH (00:33:25):

You're preaching to the choir. Yep.

EW (00:33:26):

There's this balance of, "Do I let it go?" And with wearables, with Pebble, with Fitbit, that whole, "You did something to make the battery life die." If you're running on your desk with a unit that has a power supply instead of a battery, you're never going to know that.

CW (00:33:45):

Or you're just not connected to the right Android phone from -

TH (00:33:50):

Oh my gosh, yes.

CW (00:33:50):

- a particular vendor with a particular Bluetooth -

EW (00:33:54):

Right.

CW (00:33:54):

- stack in a particular day of the week. Yes.

EW (00:33:59):

And then people complain that their batteries die...Yes.

TH (00:34:04):

It's the number one complaint. Well, number two complaint. Number one complaint's probably...it doesn't connect or it drops constantly. Number two is battery life drops, or it's terrible. Yeah.

EW (00:34:15):

Okay. You mentioned monitoring the battery life,...and we've both mentioned heartbeats. Once I get my firmware out there, what else do I need to know?

TH (00:34:26):

And this is where Memfault comes in. This is our bread and butter. It's like, "Once you get the firmware out, what are you tracking, and what are you making sure looks good?" And so number one, make sure your devices are alive and reporting anything. Number two, make sure your devices aren't rebooting.

TH (00:34:46):

I think the simplest thing you can track for for firmware is, count the number of times, or at least send an event, or find some way to report whether your devices are crashing, or resetting, or hitting an assert. And then ideally reporting some piece of information about how it's asserting or crashing.

TH (00:35:10):

And that's usually the program counter or the link register, or...a more complex firmware, you can usually pull the function, or a backtrace basically. And so at least get those two things so you can kind of tell whether this firmware is more crashy or not than the other ones.

TH (00:35:29):

Beyond that, now you're kind of searching for trends like battery life. And you said heartbeat, that's actually the phrase that I use, and Memfault uses, for events that happen periodically. We can talk about this as well. It's like, "How often do you send these periodic heartbeats?" At Pebble, we did it every hour.

TH (00:35:55):

And so for every single hour, we would track, "How much did the battery life drop? How many ticks or seconds was the CPU active? How many seconds was the Bluetooth chip on? How many disconnects were there on Bluetooth? How much time was I connected on Bluetooth for this hour?"

TH (00:36:20):

And before we shipped any firmware at Pebble, you had to at least meet or exceed those certain trends. So your battery life had to drop less than the previous one, or be within acceptable limits.

TH (00:36:32):

And if your Bluetooth connection time per hour dropped significantly, that is a regression in firmware, and we made a bug, or there's contention on the CPU, or the connection interval changed that you made, just like Chris mentioned with Android phones.

TH (00:36:52):

The connection interval changed, and it made a lot of Android phones upset. I can't tell you how many times we had to change that as well. Yeah, I mean, I think I can go on and on about this too.

EW (00:37:05):

It's something that I've seen IoT companies not consider. On one hand, all that information is very useful. On the other hand, if you are a battery-powered device, the more often you send that information, the less often you will manage to live through the whole day or however long your battery is supposed to last.

EW (00:37:32):

There's a cost associated with sending those reports. Do you have a way to balance the trade-off there?

TH (00:37:44):

Not one that's not obvious, I guess, right? If you are on a coin cell battery where...it's one and done, maybe it's even a fixed battery, you can't send a heartbeat every minute, like you said before. Sending a heartbeat every hour, or every few hours, or even once a day is pretty good, actually. You should be able to do pretty well.

TH (00:38:07):

And if you have persistent storage, what we tell a lot of people as well is, store up a week or two of heartbeats on flash in a compressed format, and then send them up when you're ready. Slight plug there for Memfault as well. We are able to store plenty of heartbeats.

TH (00:38:28):

And now I think each metric that you track is basically six to eight bytes. And if you have a few K of flash, you can batch up...quite a few heartbeats. And yeah, when you then have a connection, or maybe even a user plugs your device into a wall socket or charges it, then you can send up everything.

EW (00:38:52):

Yeah. I've had some devices that it's, once you are plugged in, "Okay. Now just send everything you've ever wanted to send in the past."

TH (00:39:00):

Yeah.

CW (00:39:01):

Well, and it's a balance between logging and statistics too. If you can boil stuff down to statistics, that's easy.

EW (00:39:08):

To a few numbers. Yeah.

CW (00:39:08):

Yeah. That's easier to send periodically than, "Okay. I have one megabyte of the day's event logs. I've got to ship all that up there." Both can be useful, but there's a big trade-off there.

TH (00:39:22):

Yes, for sure...Yeah. And I think there's a mixture between the two, right? You have logging. You have metrics. And then I think everyone has kind of come out with their own flavor of it...The compressed logging, we call it compact logging, other people call it hash logging.

TH (00:39:39):

But it's basically, "Take this human readable message, provide that an ID. You can pass a couple arguments and everything is basically stored as un32s, and then you send up those." And that's much more compact and compressed than sending ASCII text.

EW (00:39:56):

Of course.

CW (00:39:56):

It's like how Applesoft BASIC used to work.

TH (00:39:59):

Oh, really?

EW (00:40:00):

And they wonder why I still know the ASCII table pretty darn well. So there's logging and metrics. There's...the statistics that you were mentioning, which you were calling the heartbeat. For me, a heartbeat is just anything from the unit, which usually is this little -

TH (00:40:21):

Yeah.

EW (00:40:21):

- packet of statistics. And you said those, because you have battery issues, but...also, the length of time you don't check in is the length of time it takes for a user who has changed something on the website to get that on their watch.

TH (00:40:44):

And so this is like if a user clicks install on the app store, and then they're trying to send that down to the watch?

EW (00:40:49):

Yeah. If you're only checking in once an hour, doesn't that mean it takes an hour for it to check in?

TH (00:40:57):

Oh, I mean, this is more for diagnostic data...I mean, they opt in, of course. I think that's generally the trend now is you opt in to all this diagnostic data. You have to...The device is basically in control of sending that data.

CW (00:41:11):

Yeah.

TH (00:41:12):

...At least at Pebble and Fitbit, it was like, "You're on the phone. Let me install an application." You directly connect with the device. And then that just sends it over immediately.

EW (00:41:21):

Sorry, I was back in cellphone, where if you only said, "Hello," actually, this is in the underwater thing I've been working on. If you only say, "Hello," once an hour, and you're only awake that one time to listen, then if somebody has to wait an hour for you to get around to say hello again.

CW (00:41:45):

Yeah. It's like working on a Mars rover.

EW (00:41:47):

Yeah. It's like a Mars rover.

TH (00:41:48):

I mean, you laugh at this. This is a way for people to implement their own version of staged rollouts though, right? If a device only wakes up every hour, or once every 24 hours, and checks in, "Here's my heartbeat. Do you have anything for me," and that's kind of in the payload, right?

EW (00:42:07):

Yeah.

TH (00:42:07):

You send all your information and then the server responds, "Okay, I got it. And also, here's some things you should know about the world." A lot of times that's going to be like, "Here's an OTA payload for you to install." And if you just release the firmware for 30 minutes and then turn it off, that's pretty much the staged rollout, right?

EW (00:42:27):

That's one way to do it, yes. Although realistically, I would rather have...the engineers, the company, and then the company's friends and family, who will report bugs directly to us, and then go out to the bigger picture, to the larger audience. Even though that means you may have a bias towards different cellphones.

TH (00:42:52):

True. Very true.

EW (00:42:53):

Or environments.

TH (00:42:55):

Yes. Or environments. I was going to say that is always one suggestion as well. We say, "Do your staged rollouts, but also have your internal developers or users," which it's usually the company employees, like, "if you're working for a hardware company, every employee should be required to test your device or use it or wear it."

TH (00:43:16):

And...another thing I always suggest is, if your device is experiencing an issue, or asserting, or has rebooted when you're doing internal testing, make that very loud. If you're making a smart lamp, even the simplest thing, make the lamp flash on for 30 seconds, on and off.

TH (00:43:40):

And that's telling the user, "This thing probably crashed. Please load up your phone and submit a bug report internally." And at least at Pebble, if the device crashed on an internal build, we had a build flag that basically said, "Pop up this window."

TH (00:43:56):

If it reset, if this was an internal build, it would pop up a screen that you couldn't do anything else. It was like, "Your Pebble just reset. Please submit a bug," and you had to dismiss it.

CW (00:44:07):

Oh, we didn't do that at Fitbit.

TH (00:44:07):

...We didn't, no. I mean, I pushed for that very hard. We did do it on, I mean, now we're getting into history, Ionic. I built it on Ionic, but it was only for internal and beta testers.

CW (00:44:19):

Not sure I ever saw that happen. [Huh.] Okay.

TH (00:44:22):

Yeah. I think it was a build flag.

CW (00:44:23):

Yeah. Okay.

TH (00:44:23):

But it was only I think if you opted -

CW (00:44:26):

Oh.

TH (00:44:26):

- in as well, and...I mean, yeah.

EW (00:44:30):

Okay. So that's the firmware side. That's some trade-offs on the firmware side, and a little bit on the management side.

EW (00:44:38):

But one of the things at ShotSpotter and Fitbit was, "Okay, now that I have thousands or hundreds of thousands of units, these 50 or 100 have had problems. How much time do I spend each day looking at those problems, or trying to find the root cause, or even finding out about those problems, which - "

TH (00:45:08):

Ding, ding, ding. Finding out about those problems was the hardest part, right? It comes back to millions of devices. Everyone's going to have a problem, right? Everyone is, I mean -

EW (00:45:22):

Everyone is going to have a problem.

TH (00:45:24):

Well, I mean, not necessarily everyone, but there will always be, at that scale -

CW (00:45:29):

Yes.

TH (00:45:29):

- there will be thousands of bug reports every single day, right? No doubt about it. Thousands. And yeah, it's generally, "My battery life was bad," and it was probably the user was out of range or something, right? And the other issues will be, "My device didn't connect to Wi-Fi or Bluetooth."

TH (00:45:49):

And it will probably be they have a weird router or phone, and it just doesn't work. In those weeds, there are actually bugs, and then trying to find those is the hardest part...And what I see people do time and time again, is they build a firmware, and they capture logs, and they send logs somewhere.

TH (00:46:11):

They usually end up in some S3 bucket or on some person's hard drive. And when you're doing 20 devices, you can look through those logs generally every single day and "Control+F" it or "Command+F" it -

EW (00:46:25):

Grep, yeah.

TH (00:46:25):

- depending on which platform you're on. And you can build some really simple Python scripts that can basically parse through some logs. But yeah, to your point, when you're doing even a thousand devices, or a million, no one is going to find the real issues, and especially the new issues that happen, right?

EW (00:46:43):

...If you've seen this issue a bunch, and you've kind of gotten the idea that it happens, and the unit resets, and...I can't find it in the code, but that's okay.

EW (00:46:55):

But when you get the new issue, and you've never seen it before, and you're like, "Oh, is this the start of the tidal wave of problems?" How do you bubble those up? How do you decide what's a important thing to tell people?

TH (00:47:13):

Yup. And this is where Memfault really comes into play, honestly. Because, yeah, quickly to cover this, what are those issues that are going to be very important, right?

TH (00:47:29):

It's probably going to be, your device is crashing, or it's going to be sounding some alarms on asserting or...your device and its heartbeat is saying bug, or issue, or holding up a red flag, right? Memfault is built in a way that when a device crashes or has a particular log, it will basically capture a signature of it.

TH (00:47:56):

It captures a core dump, or it captures a log. It sends that to our server. We basically generate a signature of it. And if it's a new signature, we will generate a new ticket. We'll send you an email. We'll send you a Slack message.

TH (00:48:10):

And we will show it on the front page, be like, "Hey, the firmware version you just updated and pushed out has a new bug." And if it's one we've seen before, we will increment counter. And so...you're not getting a thousand new bug reports that you have to basically crawl through.

TH (00:48:28):

You're just being alerted to the one or two new ones that you have maybe that day. And to figure out which ones are actually important, it's probably the ones that are affecting the largest number of devices I would say, or the CEO's device. Usually -

EW (00:48:45):

Yep.

TH (00:48:45):

- those two.

EW (00:48:46):

Yep. Yep. The CEO's device is always high importance.

CW (00:48:49):

Or the press reviewer.

TH (00:48:52):

Or the presser viewer. Exactly. Oh, man. Yeah. We've done that as well, right? You put them into a special cohort of devices, or a special cohort, and you do not update their firmware during the release event, or if you do you make sure it's a special build that doesn't do anything fancy. It's kind of a facade.

EW (00:49:13):

No matter what you do, whatever button you press, it goes to the next screen and looks perfect.

TH (00:49:20):

I mean, we've done it.

EW (00:49:21):

Oh, yeah.

CW (00:49:21):

It's just a sticker.

EW (00:49:24):

I remember at Fitbit, finding a new issue in the company-wide rollout of a problem and realizing I didn't know that person. But since this was important, and the bug was whacked, I mean, just crazy, couldn't figure out what it was doing, I actually called, and said, "Okay, so, at [blah, blah, blah] time - "

CW (00:50:01):

This was an internal person.

EW (00:50:02):

This was an internal person -

CW (00:50:03):

She didn't -

EW (00:50:03):

Never do this -

CW (00:50:04):

- call up -

EW (00:50:04):

- with actual customers.

TH (00:50:05):

Oh my gosh. Okay.

EW (00:50:07):

No, no. This was an internal person who knew they had -

CW (00:50:10):

"I went into the customer service database, found this person's registration. I just called them at home and said, 'Hey, I noticed your watch isn't working.' " No.

EW (00:50:15):

And they were very confused, naturally, and then looked at the time, and then said, "Oh, that's when I put it in the dryer."

TH (00:50:26):

Oh.

EW (00:50:27):

I decided I didn't have to chase that bug anymore. Yeah. And actually, the whole creepiness of that, especially as you go to customers, how do you handle those data ethics? I mean, internal customers, and Fitbit was small at that time, but I had the keys to their debug database for a little longer than I should have.

EW (00:51:00):

How do you balance the, "I need this information," versus, "Oh, this shows the customer was in such and such a place at this time and...so they must be," I don't know. This is like when the watch that people were running with was showing how the military base was set up.

CW (00:51:23):

Right. Right. The Strava. Yeah.

TH (00:51:26):

There are different types of debug information that you can send from a device, right? There are hardware metrics, like what is the readout from the sensors? Are the sensors reporting faulty information? I know we tracked some metrics at Pebble where we record the max and the min X, Y, and Z axes from the accelerometer.

TH (00:51:55):

And basically what we would verify from that is, if we just got bogus results for that hourly heartbeat, we knew that that accelerometer, either one, is completely faulty and that product should be replaced or two, something really weird went wrong during that time and...maybe there's a firmware bug.

TH (00:52:14):

And so, that's not revealing anything private about the user in any way. It's just hardware data. GPS locations are very different. That's where the product is located.

TH (00:52:26):

At least for us at Memfault,...we tell people explicitly, "Do not send us that type of information. Don't send us where people are located, how quickly they're moving, and anything that is personally identifiable."

EW (00:52:44):

What if they need that information for their own device management? Does that mean they have to split their stream of information?

TH (00:52:52):

Generally. And generally they do. Not many people use Memfault as their primary data pipe. They have some other auxiliary pipe that they basically pipe all of their product, or PII, or things that make their product completely function...

TH (00:53:12):

Memfault is currently ingesting debug and monitoring information and some sort of configuration management for some devices. A lot of times they even send all of our data to their own servers. And then they send over the Memfault specific stuff. They basically pass it over from server to server to our service.

TH (00:53:34):

And that's how they keep a lot of that stuff away from us. And yeah, at Pebble,... and at Fitbit too, we captured a lot of data, but I would say not much of it, if any of it at that time was identifiable.

TH (00:53:50):

It was just, how many times was a flash sector read or written to erased? How long did it take? How long was the heart rate task running? These things are critical to debug, but in no way useful information to identify a person or understand what they were doing.

EW (00:54:08):

I have some listener questions if you don't mind. Phillip Johnston of Embedded Artistry, when I said you were on, I think he was ready to write the whole outline for me. He asked really good questions. So let's see.

EW (00:54:25):

"In most orgs I've worked in, they hesitate to outsource device management and prefer to build it in-house. Is that simply not-invented-here syndrome, or are there factors with existing services that drive companies toward that decision?"

TH (00:54:43):

I think the most obvious reason why they want to build it in-house is I think what we talked about earlier, there just doesn't seem to be a great solution out there, at least for the factory line provisioning that they need to do.

TH (00:55:00):

Generally companies are just going to build that in-house, because that's what they had to do five, ten years ago anyways. And the same people are going to be working the lines, and they know what to do...

TH (00:55:10):

Yeah, I mean, and the other existing thing is, if you're trying to use a device management tool that you don't know if it's going to exist when your product is nearing its end of life or is going to continue, you're trying to support a product for ten years, I think in the consumer space, I wish it was longer.

TH (00:55:35):

But we want a product to maybe last two, three, four, five years. But if you're building a product for government, or a city, or a sensor that's supposed to stay in the same place for 20 or 30 years, you probably should build that system yourself so that you can at some point in time lock it in a closet and never touch it again.

TH (00:55:54):

And hopefully it just continues to work forever. Who knows if AWS is going to want to, I mean, probably not Google, but who knows if these companies are going to want to support their IoT platforms in five or ten years?

EW (00:56:07):

Yeah. I don't know if Google has an IoT device management system, -

TH (00:56:11):

Oh, they do, -

EW (00:56:11):

- and I wouldn't -

TH (00:56:12):

- but I wouldn't trust it.

EW (00:56:14):

...No, they burned me after their Google Reader. I've never trusting them again.

CW (00:56:22):

So that was it?

EW (00:56:22):

Okay. Phillip also asked, "What are the real challenges with managing a fleet of devices versus what people think are the challenges - "

CW (00:56:30):

Geez.

EW (00:56:30):

" - but turn out to be easy?"

TH (00:56:33):

Alright. Two-part question. The real challenges are what we talked about before. It's signal from the noise. I think most device management platforms today are truly built for 20 to 100 devices...

TH (00:56:52):

I think, on these dashboards that you see from these products that you're basically looking at, you're comparing your device management platforms, the dashboard that they show is a green or a red box for all of the devices in your entire fleet.

EW (00:57:06):

Yes.

TH (00:57:07):

And you're basically trying to look for the one red box, and you're like, "[Ooh], this device number 72 is offline. Let me go walk over and see what's up with it, or call the assembly line manager and ask them to go reboot it."

TH (00:57:19):

When you're doing thousands, hundreds of thousands, millions of devices, you're always going to have a thousand of them red if you're using this sort of device management tool. And so it becomes, "Is this number worse on previous release or worse in the new release? Was there a regression or an improvement?"

TH (00:57:44):

...Memfault is getting much better this. I think we're the only company that I've seen do it, is easily comparing release to release. So you just upgraded from 1.0 to 2.0. How do your metrics compare between them? How are your devices behaving? How did the battery life change?

TH (00:58:05):

Historically, six months ago, how was the battery life between 1.0 and 2.0? All of these things. I just don't believe these device management tools do well, if at all. And yeah, there's always going to be noise,...and there's always going to be a signal. It's just trying to figure it out.

EW (00:58:24):

...I think your statistics there and the noise definitely show your Fitbit and Pebble background. I mean, that's true on almost everything, that you have to figure out which of these bugs is important to spend your day on and which of them you have no chance of fixing until something else happens.

EW (00:58:49):

But the battery component is one of the wearables that just makes it that much harder. What about the other part of Phillip's question? What do people think is difficult but it turns out to be easy?

TH (00:59:05):

...Companies like to think that their product is actually the hard part...I mean, I'm just naming things randomly. It's like, "Let's go build a TV remote. You know what the hardest part is, is building that TV remote." That's what they think. And it turns out just not to be.

TH (00:59:23):

The problem is actually managing the firmware updates. It's managing customer support. And how do you get customer support to understand the low-level firmware enough to know what's a real bug, and what's not a real bug, and what's just go reset the device.

TH (00:59:41):

And yeah, I do believe that writing the firmware and building your product is probably the easy part, because you've probably hired or trained people to do that. You have not hired a bunch of people who know how to manage very low-level, very ancient-like devices in a modern world.

TH (01:00:06):

And one of the things that I think people struggle with as well is, you don't know what you don't know...And you probably have many stories about this as well...If a firmware engineer from five years ago tried to build a product in the firmware world today, they'd pull their hair out for sure.

TH (01:00:25):

They're like, "You mean I have to do what? I have to communicate to phones, routers, secure transport, firmware updates every single month, every single week, even nightly sometimes. And you have to have a beautifully crafted touchscreen display, all of it." It's just hard...

TH (01:00:49):

There's only been so much time where we've demanded these sorts of things from these low-level devices. And so I think those are the hard parts, because we've not done them before. We only did them at Pebble because we were really naive.

TH (01:01:03):

We were like, "Well, we think we need these things. We're generally software engineers. Let's learn how to write some firmware. And...if we can't find the tools that we needed in the software world, like building iOS and Android apps, we've got to build them ourselves, because that's what we know is required."

TH (01:01:22):

Whereas I think if you built hardware for a living, you don't know that these software tools are required.

EW (01:01:27):

So many of the tools that I've taken part in building up weren't designed like you're saying. They were the effect of 3:00 AM debug sessions, the realization that, "Oh, we have to monitor battery life, because if we don't, then we don't know that it's broken."

EW (01:01:56):

How do you get engineers to understand that...? I mean, that's really not something you worry about when it's on your desk or when it's in your lab. But when it turns into enough devices that people go to Reddit, I don't know why I'm picking on Reddit now.

TH (01:02:17):

Because it's noisy. It's great. I mean, it's great...Fan boys and girls.

EW (01:02:23):

I only go to the origami channel these days. It's not a channel, is it? What are the Reddits?

CW (01:02:31):

Subreddits.

EW (01:02:31):

Subreddits.

TH (01:02:35):

I think I know where your question is going. It's like, "How do you then train or get engineers to understand that they need to focus on these problems now, not when the customer support tickets come flooding in that -

EW (01:02:50):

Right.

TH (01:02:50):

- the battery life is now bad, right? Because then, as soon as you hear about it that time, then it takes you months to fix. And no one wants that two-to-three month debug session which we've probably all done.

CW (01:03:05):

Well, it's not even the two-to-three month debug session...Not, "We have to fix this problem and figure it out," but also, "Oops, we really should be tracking this since now we have to have a crash program to actually do the kind of logging and stuff that we weren't doing before," right?

EW (01:03:19):

And the bug only took -

TH (01:03:20):

Oh, yeah.

EW (01:03:20):

- two days to fix, but now you have your release process so that it doesn't have another bug in it that causes more problems.

CW (01:03:26):

Yeah, yeah. Yeah.

TH (01:03:27):

We're all forgetting the fact that you have to reproduce this issue first as well.

CW (01:03:30):

Right.

TH (01:03:30):

You have to understand -

EW (01:03:33):

Oh, geez.

TH (01:03:33):

...Oh man, I mean the amount of people that are interns or sad individuals that I've talked to that are just like, "Oh, I've been trying to reproduce the bug for two weeks and it still hasn't cropped up...

EW (01:03:48):

That's the thing with a million devices.

TH (01:03:50):

Yes

EW (01:03:50):

If they all run for a day, you can get -

CW (01:03:56):

Truly weird things happening.

EW (01:03:56):

- truly one in a million, sort of, yeah. Bugs get weird.

TH (01:04:01):

I've talked a lot about this. A plug for an Interrupt article. It is one of my favorites...I mean, it's such a clickbait article, but I love it. "Defensive Programming- Friend or Foe?" But...what I talk about in it is more of this concept of offensive programming.

TH (01:04:24):

...Yes, when you have a million devices,...you're going to get one of every single crash that's in that firmware pretty much, or one of every single issue per day. And the goal of that offensive programming is trying to surface as many bugs as possible as quickly and as loudly as possible.

TH (01:04:51):

And what that allows you to do is fix them early, and very quickly, and ideally very easily as well. Yeah, I mean,...if you get to that point though, you need a lot of systems in place before that. You need data that the devices are sending you that allow you to track down exactly what bugs exist.

TH (01:05:17):

"And how did my devices crash? And how did my battery life drop?" What are the different metrics that pertain to battery life and kind of contribute to it? [Ooh], there's so many more ant tunnels to talk about in this topic as well.

EW (01:05:33):

Yes. I mean, there's so much. Actually, so, I've done the role where I've monitored the devices. It's not one I'm particularly suited towards, but I've done it enough that especially as products come up and go from 100 inside a company to a couple, maybe 10,000 outside of a company. After that, I'm just not the right person.

EW (01:06:01):

I wouldn't say any firmware engineer really is, because it becomes more of a data science problem. Is there a new role, is there a new engineering title, for the person who monitors these and tries to prioritize what can happen?

TH (01:06:22):

It's called the enthusiastic firmware engineer.

EW (01:06:25):

[Ah], the intern.

CW (01:06:28):

[Ah], the under 30 set.

TH (01:06:30):

I mean, yeah, I just hit 30 this year.

CW (01:06:36):

You can turn off your enthusiasm now.

TH (01:06:38):

No, I will never. But seriously,...I mean, if we're going to be honest, that is the role that generally takes place, right? I very rarely hear about companies hiring a higher-level firmware engineer. I think that's the role -

CW (01:07:02):

Yeah.

TH (01:07:02):

- that I took at Pebble. I slowly morphed myself into higher-level firmware engineer slash Python and web app builder. I built a lot of web application tools at Pebble.

TH (01:07:14):

And at Fitbit, I kind of carved my way into this role, after nine months, that was developer productivity tools, where we built a CLI to kind of build and manage the firmware locally. And I built some web applications to parse a bunch of the data the device sent. It parsed a bunch of core dumps, parsed logs.

EW (01:07:37):

Got rid of my really bad Python script.

CW (01:07:39):

Which one was that?

TH (01:07:40):

Exactly.

EW (01:07:42):

The one that -

CW (01:07:42):

Oh.

EW (01:07:43):

- tracked the core dumps.

TH (01:07:46):

...But that role doesn't exist. It's usually the embedded engineer who spends some extra nights, or weekends, or has done it before, or yeah, who has done it for a previous company. And thankfully now there is Memfault. You integrate the SDK and you get most of this data.

TH (01:08:14):

But...you still need to understand what metrics to capture and what does it mean to have this metric be different on this release and this release. And that just happens through socializing, and talking to your community, and asking the hard questions. And you asking these questions on the podcast and hopefully people listening.

EW (01:08:36):

Well, and you are right. Because somebody who wasn't intimately familiar with the firmware, couldn't look at these trends and understand where the root causes might be.

EW (01:08:50):

They could write a bug that said, "Battery life is down...in some number of units," but it would take a firmware engineer to say, "Oh, those are all iPhones. Or those are all Android phones. Or those are all units we shipped in the first month," or something.

CW (01:09:09):

Well, and it's not just that. It's somebody who has knowledge enough of the product management or the project management. I always get those confused. But to see where you are in the feature set.

CW (01:09:22):

Because maybe you turned on a new...battery-hogging feature, and now everybody's using their GPS to track something, and they weren't before. Well, then that's why you're getting 30% less battery life every day.

EW (01:09:34):

Yeah. [Woo-hoo]. Heart rate works. Oh, now my battery dies.

TH (01:09:39):

Oh, we shipped that heart rate feature, but you probably shouldn't keep it on all the time.

EW (01:09:43):

Exactly. You also do tools -

CW (01:09:50):

I think we're going to have to have him back to do the tools -

EW (01:09:52):

Tools, because I -

CW (01:09:52):

- conversation, because it's a long conversation.

EW (01:09:56):

Well, because I had a lot of questions.

CW (01:09:57):

I know. And we're already, yeah.

EW (01:09:59):

Alright.

TH (01:09:59):

How much time is it?

CW (01:10:01):

We're at an hour and 15 now.

EW (01:10:03):

Yeah.

TH (01:10:03):

Oh my gosh. Sorry.

CW (01:10:05):

No, it's great.

EW (01:10:06):

No, it's not you. This is very good.

CW (01:10:06):

But I do want to talk about tools, and we would not do it justice if we were to try to do it now.

TH (01:10:11):

I'm happy to come back. Part two. There's so much more to talk about. There's so much. Yeah.

EW (01:10:17):

And I mean, this whole device management thing is going to become a bigger problem as we go on.

CW (01:10:23):

Forever. It's always going to be a bigger problem.

EW (01:10:25):

It's going to be bigger and bigger, and I'm still going to call them distributed systems, darn it.

TH (01:10:32):

It's a good term. I just haven't heard that before when talking about embedded devices. You are actually the first one -

EW (01:10:38):

I'm so old.

CW (01:10:39):

I mean, it's not like they're working together. It's not like all the Fitbits are working together. They're all individual systems.

EW (01:10:45):

That was never what distributed systems meant.

CW (01:10:49):

It isn't?

EW (01:10:49):

It doesn't imply a mesh of any kind.

CW (01:10:53):

It doesn't?

EW (01:10:53):

Tyler, I heard Memfault is hiring. Would you like to give us more information?

TH (01:10:59):

Yes. Currently we are hiring for a firmware solutions engineer, and that is building up our SDK, talking to customers, and generally being an evangelist for the company, and also a data engineer.

TH (01:11:14):

All these devices send us a bunch of data. We have to analyze it, store it and produce insights, and tell people how their devices are failing or succeeding in the field. And yeah, we're looking for a data engineer.

EW (01:11:28):

And Tyler, do you have any thoughts you'd like to leave us with?

TH (01:11:33):

Yes. It's more of a, "This is what I've learned over the last two years in COVID." But kimchi is very easy to make, and I suggest everyone try to make some kimchi at home if they like it.

EW (01:11:48):

Unexpected, but excellent. Our guest has been Tyler Hoffman, co-founder of Memfault. If you'd like to check out their blog, well, it will be in the show notes, but if you can't find that, type Interrupt and Memfault together, and you will definitely find it.

CW (01:12:04):

Thanks, Tyler.

TH (01:12:06):

Yeah. Thank you both. Have a great one.

EW (01:12:08):

Thank you to Christopher for producing and co-hosting. Thank you to our Patreon listener Slack group for questions, in particular, Phillip Johnston.

EW (01:12:17):

Which reminds me, if you've been considering supporting us in Patreon and you want to join that Slack, now is a really good time as the book club just started some really cool new stuff. Finally, thank you for listening.

EW (01:12:31):

You can always contact us at show@embedded.fm, or hit the contact link on embedded.fm. And now a quote to leave you with. This one's from Jack Kerouac. "My fault, my failure is not in the passions I have, but in my lack of control of them."