Mention machine safety and most people think of shatterproof
goggles and other protective gear. They might also think about physical safeguards
and emergency switches on machines, but are unlikely to consider the safety
aspects of the software that controls those machines.
When engineers think about the safety of software for
industrial devices, automotive electronics, medical equipment and consumer
devices, they should understand that their application code gets
"blended" with the real-time operating system (RTOS) to create a
single executable file. That combination occurs when any real-time operating
system, not just ThreadX, becomes part of an embedded system. (Some RTOSs also
support separate, dynamically loaded programs that get brought into memory on
demand.)
In an embedded device, the RTOS might occupy between 10 and
a few hundred kilobytes of code. Application code, on the other hand, might
consume over a megabyte. So just employing a rock-solid operating system isn't
sufficient to guarantee safe operation of the software within a device. To
properly address software safety, developers need a set of tools that permits
them to investigate, debug, and test the entire system, which includes the
RTOS.
In an
RTOS, Experience Counts
Engineers have used operating systems such as ThreadX in
hundreds of thousands of varied products, so RTOSs have experienced about every
imaginable circumstance; the types of things you can predict and test for, as
well as those you can't. Thus designers should choose a fully tested and
field-proven operating system that has seen a lot of applications in the real
world. This is the best way to avoid those unanticipated surprises that can
lead to system failure at just the wrong time.
Developers often ask, "What happens inside an RTOS to
makes it rock solid?" One example of the internal checks that a good RTOS
performs is consistency checking of function parameters. When application code
calls a ThreadX RTOS service through the application programming interface, or
API, we check the parameters sent to the RTOS to ensure consistency with the
rest of the software. This step detects when programmers have referenced a
nonexistent thread or specified a CPU time longer than that available. That
type of API checking helps programmers catch errors before they get too far
into their application code.
Developers have the flexibility to configure the RTOS to
perform many such internal checks, and to produce a lot of information that's
helpful in checking and debugging. Just before their code goes into
"production," they might remove the debug code or they might decide
to leave it in. Although the debug code in the RTOS requires some memory space,
many developers leave it in their final code to facilitate field debugging or
testing in the event of a problem with an application.
Don't
Go with the Overflow
Developers also must guard against stack overflows that can
corrupt memory and cause problems when the processor later uses the data at
those corrupted memory addresses. Then, when the problem surfaces, its symptoms
might look nothing like a stack overflow. Our StackX tool mathematically
calculates the stack use in the executable code, prior to execution, and
provides a large benefit in safety-related applications. Calculating a stack
size works better than making a rough stack estimate, adding a bit more memory
space to it, testing the application, and seeing if it crashes. StackX can
analyze any executable in a form such as the executable and link format (ELF),
an industry standard.
Many operating systems will detect a stack overflow and
generate an error warning, but those steps take memory and processor time, and
they occur after the damage has been done. Without StackX, the developer must
make a tradeoff between turning on overflow detection and degrading performance
a bit, or trusting that they've caught all of the demands for stack usage and
feel safe about the specified stack size.
Dump
and Analyze Trace Data
When developers test and debug their application code, they
can use a tool such as TraceX that acts like a logic analyzer for the software.
TraceX displays a horizontal time axis while the vertical axis shows all the
system threads or tasks and indicates what they're doing at each time division.
So, you can examine the events that lead up to a malfunction as well as what
happens after it. When a system crashes or at any time during execution, you
can dump the target system's "trace buffer" memory to the host and analyze it.
You might find that Thread 3 never gave up the CPU, so Thread 1 never reset a
watchdog timer and that caused the system to fail. You see a graphical flow of
program execution, all application thread activity, the operating-system
services used, measurements of the percentage of CPU time each thread used, and
so on. But TraceX doesn't look inside the operating system. It simply tells you
when your application called on the operating system and what it asked the RTOS
to do. Those tasks include message-passing, synchronization, context switches,
preemptions, suspensions, terminations and system interrupts.
A typical hardware logic analyzer has various trigger
capabilities and you can use similar "trigger" conditions with TraceX
to only trace events of a certain type or events for a specified thread. Thus, if
you know what to look for, you can eliminate a lot of extraneous information.
In addition, you can log user events that don't relate to
operating system services but occur at critical points in your code. When the
CPU reaches such a point, it puts a log entry into the trace buffer. Then, when
analyzing the uploaded trace buffer on the host with TraceX, you can search for
user events along the time scale, and TraceX will take you right to "Event
5," for example.
Many of the types of tools and operations mentioned here are
not unique to Express Logic. They represent a collection of software tools
developers should know about when they test application code to ensure its
safety. If similar tools don't come with the RTOS they use, they should try to
find a way to implement them on their own.
Use
Industry Standards and Guides
Businesses that produce software for medical, military or
aerospace devices must comply with standards that specify measures of safety
and how to validate the safety of system and application code. The US Food and
Drug Administration, for example, publishes the document, "General
Principles of Software Validation; Final Guidance for Industry and FDA
Staff," (1-11-2002). Because the RTOS forms such a small part of the final
application code, companies that produce the end product, not the RTOS vendors,
generally perform the validation tests, using the full RTOS source code provided
by the vendor.
The commercial-avionics world relies on the RTCA/DO-178B
document, "Software Considerations in Airborne Systems and Equipment
Certification." (RTCA stands for the Radio Technical Commission for
Aeronautics.) Safety certification includes four levels, A, the highest,
through D, the lowest. Again, certification and validation apply not to just
the operating system or the application code, but to the final product that
contains both.
Click here for information about FDA software validation
Click here for information about DO-178B.
John A. Carbone, vice president of marketing for Express Logic, has 35 years experience in real-time computer systems and software that include work as an embedded-system developer and field application engineer. Prior to joining Express Logic, Mr. Carbone was vice president of marketing for Green Hills Software. He has a B.S. degree in mathematics from Boston College.
