We have a mature product that has been shipping for 15 years. There have been various revisions, but it has remained largely unchanged in terms of its functionality. Recently, a unit was purchased by a new customer. She discovered a sometimes-reproducible failure mode that had never been reported before.
This device has the ability to record and play back digitized human speech. It plays back messages in the same sequence in which they were recorded. We’ve sold thousands of these, hundreds from this particular production run, and we’ve never seen this problem before.
Being local, our customer stopped by our office with her suspect device. I could not reproduce her problem, so I assumed she was doing something wrong. It couldn’t have been the device’s fault. But, I’ve learned my lesson a long time ago -- you give the customer some respect and leeway when experiencing a problem. So I exchanged her device for a new one from the shelf. Much to my frustration, a few days later, she let us know she was experiencing the same issue with the new device.
I decided to go to her facility to troubleshoot the problem, still convinced she was doing something wrong. The failure was this: When playing back a message, sometimes the device would stop in the middle of the message, and reset. That is, the sequence was reset to message one. Watching her record and play back messages, I noticed several things: She spoke very loudly and very close to the mic; she has a high, piercing voice; the volume setting was at or near full volume; and the room in which we were had poor acoustics (echo chamber). I feared the problem might lie within the speech engine itself. If so, there might be little I could do about it.
After 30 years of troubleshooting one’s own designs, an engineer acquires an almost sixth sense about such problems. I don’t know how, but I came to the belief that the messages were resetting because the embedded controller, itself, was physically being reset –- as in a cold start! The only way to reset the embedded controller would be with a power spike, and corresponding voltage drop on Vcc. But, why would this happen?
I wasn’t pushing the limits on any of the components nor on any one spec. It was a combination of multiple components, all operating within spec, but each at the edge of min/max limits. The speakers from this production run drew more power than previous ones. The audio amp was pushing hard, and last but not least, the LDO regulator was operating at its minimum allowed value.
Cranking the volume, and shouting into the mic added to the grief. But why did her voice cause a problem but not mine? Because her voice was at a higher frequency than mine. Higher frequencies contain more energy -- they require more energy to reproduce, resulting in greater current spikes. It was the perfect storm, with components both electronic and human. Additional bypass caps were marginally effective. The fix was to desolder the LDO regulator and replace it with a beefier one.
Troubleshooting a flawed design should never end with simply fixing the problem and moving on. The engineer (or engineering team) should critically evaluate why the flaw slipped through the initial design-review-debug process. Steps should be added to the design process to more thoroughly evaluate the next design. But, how on earth can one catch problems with so many tentacles? Sometimes you can’t. You just hope that your follow-up support is robust and responsive.
This entry was submitted by Jonathan Eckrich and edited by Rob Spiegel.
Jonathan Eckrich has been president of Adaptivation since 1998. Much of his job experience is with designing industrial (VME-based) computer systems. He holds a Master of Computer Engineering (1985) and a Bachelor of Computer Science (1982) from Iowa State University.
Tell us your experience in solving a knotty engineering problem. Send stories to Rob Spiegel for Sherlock Ohms.