3 Commonly Overlooked Techniques for Developing Reliable Firmware

These tips could help improve the reliability of your embedded system’s firmware.

Jacob Beningo

March 5, 2024

6 Min Read
Techniques for Developing Reliable Firmware
Firmware for medical equipment and other critical types of machines needs to be reliable.Justin Paget/Stone via Getty Images

At a Glance

  • Consider using Error Correcting Code (ECC) peripherals and watchdog solutions
  • Other potential solutions include runtime assertions and static assertions in C and C++

Reliable software refers to a software category that consistently performs its required functions under specified conditions for a designated time. Reliable firmware often has characteristics like correctness, robustness, fault tolerance, and consistent behavior. You might think of reliable software as software you don’t want to have to perform a system reset to get it working again. For example, flight controllers, brake systems, medical equipment, etc. This post will explore three commonly overlooked techniques for developing reliable firmware. 

Tip #1: Use the Built-In Error Correcting Code (ECC) Peripheral

Most microcontrollers have a peripheral that developers easily overlook, the ECC. ECC is primarily used with memory types prone to bit errors, like flash memory or SRAM. Many ECC peripherals will connect to internal and external memories, such as those connected via QSPI. 

The ECC adds additional bits (parity bits) to the data that is being stored. As you can imagine, this means that to use ECC, you will need extra memory. If you have an ECC peripheral, you typically don’t have to worry about CPU overhead because the bits are calculated using hardware-accelerated algorithms that require little to no CPU intervention. When the data is read back, the ECC logic computes the parity bits again and compares them to the stored parity. If there is a mismatch, it means there are errors in the data.

Related:3 Techniques to Simulate Firmware

There are two types of errors that the ECC can detect: single-bit errors and multi-bit errors. When a single-bit error is detected, it can be automatically corrected. The ECC logic automatically corrects the data before the CPU or other peripherals use it. Multi-bit errors can be detected, but they cannot be corrected. In such cases, the ECC logic can signal an error condition to the system, which can then take appropriate actions, like flagging an error or initiating a system reset.

Developers often overlook ECC because devices deployed terrestrially have a lower probability of experiencing bit flips and single-event upsets than devices in orbit or at higher altitudes. It’s easy to assume that the system won’t experience these behaviors when you don’t see issues on a lab bench. The ECC also doesn’t appear in most configuration software because it is usually enabled by programming a single bit in the microcontroller configuration registers. Settings are often left to their defaults. 

In applications where reliability is critical, such as in medical, automotive, or space systems, using ECC can be a part of the overall strategy to achieve the required reliability levels.

Related:Unraveling Hidden Issues With Real-Time Operating System (RTOS) Tracing

Tip #2: Design a Robust Watchdog Solution

How is a watchdog a commonly overlooked technique for reliable firmware? Well, it’s widely overlooked because it’s the last thing teams often implement. Last-minute implementations are often poorly thought through and may not meet the system's needs. Teams often turn on their internal watchdog timer while creating a task that kicks it periodically and says they have a watchdog solution. That’s not the case. 

A reliable watchdog solution needs to integrate closely with the software being developed. It requires an overarching strategy to track memory, task execution, application code, and drivers. The watchdog in a reliable system carefully guards the system and looks for issues. If an issue is found, it could restart the system or the thread or process that is having an issue. There are countless ways to implement watchdogs in a system based on the desired level of reliability.

In some systems, a watchdog may be multi-tiered. It may utilize an internal watchdog timer, a thread that monitors software behavior, and an external watchdog that watches over the entire system. The reliability can be high in these systems, but so can the watchdog complexity. An ultimate goal is to design a watchdog that detects problems and recovers the system safely and reliably.

Related:5 Reasons You Need a Release Build

For example, in some satellite systems I’ve worked on, we have an internal watchdog that monitors the software within a hundred milliseconds. We might have a thread that then monitors software behavior over seconds, an external watchdog that monitors the system on the minutes to hours time scale, and, finally, a last-ditch emergency watchdog that monitors the entire system weekly. (You’d be surprised how often these save the day!). 

At the end of the day, if you want your system to behave reliably, you need to be able to detect when that system is starting to go on the fritz. Defining a robust watchdog strategy is a crucial ingredient. 

Tip #3: Use Assertions

In C and C++, assertions check conditions or assumptions made within a program and halt the program's execution if these conditions are not met. Assertions help with debugging and ensuring that your code behaves as expected. There are two types of assertions in C and C++: runtime assertions and static assertions.

Runtime assertions are implemented using the assert macro provided by the cassert (in C++) or assert.h (in C) header. The assert macro takes an expression as an argument. Suppose the expression is evaluated as false (i.e., the condition is not met). In that case, it triggers an assertion failure, typically resulting in the program getting stuck in an infinite loop with details about the failed condition.

Here's a simple example in C++:

#include <cassert>

int main() 

{

    int x = 10;

    assert(x == 5);  // This will trigger an assertion failure since x is not equal to 5.

    return 0;

}

Static assertions are used to perform compile-time checks on conditions or expressions. They are evaluated by the compiler during compilation and generate an error message if the condition is not met. Static assertions are implemented using the static_assert keyword, introduced in C11 and C++11.

The static_assert keyword takes a constant expression and an optional message as arguments. A compilation error is generated with the specified message if the constant expression is false.

Here's an example in C++:

#include <iostream>

#include <type_traits>

template <typename T>

void print_size() 

{

    static_assert(std::is_integral<T>::value, "T must be an integral type");

    std::cout << sizeof(T) << " bytes" << std::endl;

}

int main() 

{

    print_size<int>();         // This will compile successfully.

    print_size<double>(); // This will generate a compilation error.

    return 0;

}

Static assertions are particularly useful for catching potential issues at compile-time, while runtime assertions are used for runtime debugging and validation. I often notice that developers don’t make as much use of these features as they could. If you want to write more reliable software, using these assertions can help you find potential issues that otherwise might go unnoticed. 

Conclusions

Writing reliable firmware is an important skill set for many embedded systems professionals. Embedded systems are often deployed into the field and expected to work correctly for weeks or months without being reset. The techniques discussed in today’s post are essential for developing reliable software. They are often overlooked because they are apparent and straightforward. Yet, a lack of discipline often causes them to either go unimplemented or a poor strategy to be put in place. If you aren’t leveraging these techniques, I highly recommend that you do. You’ll find that they are low-hanging fruit, and using them can improve your system’s reliability if used properly. 

About the Author(s)

Jacob Beningo

Jacob Beningo is an embedded software consultant who currently works with clients in more than a dozen countries to dramatically transform their businesses by improving product quality, cost and time to market. He has published more than 300 articles on embedded software development techniques, has published several books, is a sought-after speaker and technical trainer and holds three degrees which include a Masters of Engineering from the University of Michigan.

Sign up for the Design News Daily newsletter.

You May Also Like