A computer and digital signal processor for a military airborne radar system I worked on was based on mid-1970s technology. That meant multilayer printed circuit boards and through-pin DIPs, usually in 16-pin packages. In order to load, trace, and test the computer’s software, you had to connect a test console with a paper-tape reader to a test port on the outside of the computer box. The console port connected directly to the address and data pins of the CPU, so the cable was kept very short, less than eight inches, to prevent inductive/capacitive loading from degrading the signals.
In the early 1990s, our customer started to experience spurious computer resets. They were rare and would only happen on a test bench when the console was connected. The first assumption was a bad cable or circuit in the console, but swapping those produced no result. Eventually, they sent a computer that was intermittently failing back to the factory for further analysis. I was the lead test and integration engineer, but this problem strained my nascent troubleshooting skills.
I put a signal analyzer anywhere I thought we might find the problem, but that yielded no results. The signals were clean as a whistle. I rarely saw the reset condition, and it was nearly impossible to catch one on a logic analyzer.
Rare, intermittent problems like this are difficult to investigate under the best conditions. In a meeting to discuss the situation with some of our engineering managers, one of them asked me if I remembered a similar problem from the time when the system was in development. “When was that?” I asked. He replied, “Oh, around 1972 or 73.” “No”, I told him. “I was in fifth or sixth grade then.” I never quite caught the oncoming stream of epithets he muttered under his breath, but it included something about smart-Alec young engineers.
Since the problem only occurred when the console cable was connected, we decided to build a longer cable to see if that would make the problem happen more consistently. We had the technicians build a three-foot-long cable extender, and, bingo, with it connected, we had a nearly continuous string of resets. We quickly traced the problem to one of the lines that could trigger the reset pin on the CPU. The line was buffered and didn’t connect to the console, which confounded us some more.
We got the layouts for the printed circuit board and looked at the reset signal paths and the paths of nearby circuits. We found one of the CPU address lines (which WAS connected to the console interface) that ran across the board and made a U-turn right around a pin connected to the reset circuit. When the console was attached, the address line could sometimes trigger a reset. To test it, we isolated that trace on the address signal and reconnected the circuit with short point-to-point wiring soldered in place. Viola! No more resets, even with our long extension cable attached.
We also noticed the board was a recent revision, and this layout was unique to the revision. It was typical to re-layout boards as parts became unavailable and newer substitutes were inserted into the design. But board-level and assembly-level testing don’t generally use the test console, so we didn’t catch this new problem.
This entry was submitted by John Shepley and edited by Rob Spiegel.
Tell us your experience in solving a knotty engineering problem. Send to Rob Spiegel for Sherlock Ohms.
Ohm's Law on reset signals usually has something to do with signal impedance and stray ingress capacitance creating a lower impedance and thus over riding the pull-up current. Its often a good measure to buffer the long lines with a small cap to suppress the much smaller stray capacitance.. I used to get stray resets when my lab of many computers was located in a carpetted unused office space. Here the ingress signal was many KV of ESD from a stray engineer opening the metal cage door to the lab. Zap,,,, the fix was a cheap can of anti-static carpet spray every week or two and making the sure the cage was not grounded so that impedance could be high and thus reduce the dv/dt rate, if when high static charges were created. Slowing down the discharge rate with 1 Mohm bleed resistor helps reduce the induced voltage more than grounding the conducting metal cage.
Both solutions worked independantly and even better together. It's just a simple application of Ohm's law for leakage and static discharge time constant.
This underlies the need for automated test development and regression at all levels. From formal design verification to post silicon or board diagnostics, each aimed at various aspects of fault detection from design aspects thru manufacturing processes. These days even small changes can and do have very complicated and likely poorly understood consqueces.
Many companies, especially smaller one's, still have little or no program to empass this ever growing need. Even in one's that do, many times, the value of the diagnosic developer is poorly understood or under appreciated. Only when bugs are found are you valued. When no bugs are found, you are viewed as overhead.
As someone that has ben in this role at large companies from pre- silicon all the way down to board manufacturing for over a dozen years, I can attest to the above.
Little by little, the need for such comprehensive stragies is gaining traction, and is most prevalent in the chip industry. As illustrated by the article, many times, the mfr gets away without performing an regressions for years, and then a problem pops up out of no where. A problem such as this is more expensive than ever. And no cost is higher than the potential loss of confidence by the customer or the impact on normal schedules (is there any such thing???) when the kind of interrupt that a situation such as this generates when it raises its ugly head???
These issues are always tough to troubleshoot without a good investment of time. Kudos on finding it!
I know that these issues can still occur, even with stringent DRC and multiple "Live Eye" checks, and it makes testing new devices troublesome at best for us. Is it the board, or the device? Also a lot of "green" layout designers aren't always up to snuff with doing the "Live Eye" check as they design or layout a design, and instead rely only on the DRC that the layout software provides. It is an area that I wish was focused on more in training.
From Dell / Intel® New Paradigms in Design Work Scott Hamilton, vertical market strategist for Dell Precision workstations, 5/2/2013 5
Early in my career, I worked as a draftsman and remember the days of drawing on vellum with numbered pencils and Mylar with plastic lead. This was a fun experience in the sense that I ...
I've been using workstations for more than 10 years and love finding ways to get more performance from my system. With demanding professional applications that require more power each ...
A lasting memory from my first job as an engineer in an auto assembly plant is standing on hard concrete at six in the morning, vending-machine coffee clutched in hand, listening to ...
For industrial control applications, or even a simple assembly line, that machine can go almost 24/7 without a break. But what happens when the task is a little more complex? That’s where the “smart” machine would come in. The smart machine is one that has some simple (or complex in some cases) processing capability to be able to adapt to changing conditions. Such machines are suited for a host of applications, including automotive, aerospace, defense, medical, computers and electronics, telecommunications, consumer goods, and so on. This radio show will show what’s possible with smart machines, and what tradeoffs need to be made to implement such a solution.
To save this item to your list of favorite Design News content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.