In the early 1990s, our firm was hired to design and install several piping-fatigue monitoring systems on large-diameter gas and oil gathering piping in an arctic environment. The data acquisition system had to continuously acquire signals from 40 to 45 strain gauges at up to 500 samples per second, determine when a significant event occurred, then rain-flow the events to estimate fatigue damage. An indicator was generated if fatigue usage rates exceeded a 25-year time-to-failure.
The computational hardware available at the time was an 80386-based PC (12 MHz!) running MSDOS and various tutti-frutti memory and disk optimizers. To get the required channel count at a reasonable cost, an ISA card 12-bit A/D was used with an external 64-channel multiplexor board. This board had a cascade of 4-1 multiplexor chips that, when properly addressed, would select a channel, feed it to the A/D board, request a conversion, then switch to the next channel until all channels were scanned. A short ribbon cable was used to supply the necessary timing and control signals from the A/D board in the PC to the multiplexor board. Since we only needed data at a few hundred hertz, the 20 MHz scan rate of the multiplexor provided essentially a simultaneous scan.
Strain gauges were attached to piping, wiring was run, amplifiers were installed and calibrated, and the computer hardware was installed in various control rooms. We had the systems operating by late November, but by January, one of the monitors began to throw unrealistic fatigue estimates. A review of the data streams showed that at seemingly random intervals, several of the channels would exchange places. The vagaries of the fatigue calculation resulted in ridiculously high numbers for some locations, and near zero at other locations.
Our initial investigations focused on the various bits of software that unwrapped the multiplexed channels. To break the 640KB memory "barrier" in DOS (we needed a whopping 1.5MB of code and data space), there were several layers of memory extenders and virtual memory hacks whose memory addressing schemes were notorious for being one of the usual suspects. Not this time.
The next most likely culprit was the multiplexor board. This suspicion was intensified when a further data review indicated that the channels were rotating in groups of four channels -- which was the number of inputs for each multiplexor chip. A logic probe finally showed that occasionally, a spurious control signal would occur that caused the multiplexors to advance without the A/D board's knowledge.
But what was causing the glitch, and why on just this machine? The board vendor claimed it had never heard of such an issue, so that was no help. We started thinking that something in the hardware was bad, so we exchanged monitoring hardware, but the problem remained at the location, not with the hardware. We suspected there might be some local electrical noise, so we tried various shielding arrangements within the PC and the ribbon cable but were unsuccessful at finding the problem. Alternative grounding schemes for data and power supply seemed to help at first, but the problem ultimately returned.
In desperation, we decided on a workaround. We tied two spare adjacent channels in one group of four to +5v and to ground. We wrote a subroutine (we called it MuxFux) so that for every scan, the voltage of these channels could be checked to ensure that the multiplexor hadn't slipped a cog. If we discovered an error, that scan was discarded and the board was reset.
While implementing this solution, we noticed that the multiplexor error occurred only when someone walked into the control room from the arctic outside and took off their Parka! We subsequently proved that we could initiate the error at will by vigorously shaking a nearby Parka. At least we had a cause, but short of banning Parkas, we could not eliminate the issue.
Although not ideal, the software workaround proved satisfactory, and the monitoring systems functioned correctly until they were decommissioned years later.
This entry was submitted by Stephen Price, a senior staff engineer at Engineering Dynamic, and edited by Rob Spiegel.
Tell us your experience in solving a knotty engineering problem. Send to Rob Spiegel for Sherlock Ohms.
Static discharge certainly appears to be the initiator here but I don't agree that it is the cause. Static electricity is a normal and expected part of our environment and, while we could mitigate it somewhat by adding humidification, banning parkas on the premises, requiring the use of static dissipative devices, etc., none of these really addresses the true problem, which is that for whatever reason the multiplexer circuit was not designed to tolerate the higher level of static found in the arctic environment. What the author did by programming in error checking and automatic reset made the system fault tolerant, which is what the circuit designer should have done in the first place.
Incidentally, I would like to thank the author for the spelling lesson. The use of multiplexor (with an 'o') raised an eyebrow and caused me to look it up. Turns out multiplexor and multiplexer are both technically correct spellings.
I would bet that you might find the computer case wasn't really grounded to the mother board (paint on the screw bosses), or the I/O board shield wasn't tied to the computer shield, or something like that.
I agree that ESD is the likely cause, and it MAY have been cheaper to just build a mud room outside the entrance, so the parkas could be left there... but you ran the chance that you'd still have unexplainable transient events, with improper grounding and shielding. ♠
Extremely likely ESD was the source of the problem.
Had a similar problem with ESD on a system with just two boards mounted to a isolated common metal / unpainted chassis (connected directly together)... ESD to the chassis would corrupt the processor's operation and / or reset the system. Extensive shielding, grounding experiments, embedding all signals to inside layers on the pcb - with ground pours on external layers, etc.. nothing improved the system's ability to handle ESD events. The system just couldn't handle anything above 4kv discharges (we needed a min of 8-10Kv human body model for CE certification).
Final solution: a small amount of resistance (33-75 ohms) in series with all signals going between the two boards (except power / ground). Didn't effect intended signals (minimal capacitance on these lines). So... a bunch of very small SMT resistor packs (4 per 1206 size) were installed.
At the time, there were no small enough ESD devices, with low enough stray capacitance avalible to work in the space we had.
Apparently, the slight de-coupling (de-tuning/termination?) of these signal traces kept the energy (being dis-charged through the isolated chassis) from being picked up by the processor circuitry... and it is very hard to keep all grounds in a multiboard system at the same potential (with extreme rise times being involved).
After the change.. we could handle 15Kv!
Being where there is no humity (Artic or Arizona) I have seen ESD in excess of 100Kv. So even a well engineered system, can see extremes beyond excepted levels of abuse.
I appreciate everyone's comments. We, too, were not happy with abandoning the hunt before the root cause was found, but this was just one of many issues with these systems in that environment. Considering all of the constraints, we were forced to apply a bit of Engineering Triage and forge ahead.
I'd like to use this story (Extremely likely ESD was the source of the problem) for an upcoming Sherlock Ohms blog. You game? We would need your name and a short bio (two or three sentences) Also, it would help to give a bit of background on the setting -- what type of company, your role, etc.
If you're willing, please let me know at rob.spiegel@ubm.com
I am kind of bummed the root cause wasn't discovered as well. I know most everybody is looking at ESD as the likely cause but with so many other other options it would be interesting to know. What I find most interesting is how it appears to be just the one postion that was affected by ESD. Any idea why just that spot or did I miss it in the article?
Great sleuthing for sure. And I applaud the solution of using reference channels and checking software. That sounds a bit similar to one of the "data qualifying" steps that we once put into a system that could not tolerate being wrong, ever. It is a completely valid solution.
The problem sounds like a missing shield ground connection, most likely in the area of the multiplexer selection control lines, or perhaps in the selection signal logic area. Of course, it could also have been a "slightly bad" connection in the channel selection reporting area, which was probably in a less protected section of the board. That type of problem may also have been on the system backplane, if there was one. It would be interesting to look at that retired hardware now and see what the problem was.
From Dell / Intel® New Paradigms in Design Work Scott Hamilton, vertical market strategist for Dell Precision workstations, 5/2/2013 5
Early in my career, I worked as a draftsman and remember the days of drawing on vellum with numbered pencils and Mylar with plastic lead. This was a fun experience in the sense that I ...
I've been using workstations for more than 10 years and love finding ways to get more performance from my system. With demanding professional applications that require more power each ...
A lasting memory from my first job as an engineer in an auto assembly plant is standing on hard concrete at six in the morning, vending-machine coffee clutched in hand, listening to ...
For industrial control applications, or even a simple assembly line, that machine can go almost 24/7 without a break. But what happens when the task is a little more complex? That’s where the “smart” machine would come in. The smart machine is one that has some simple (or complex in some cases) processing capability to be able to adapt to changing conditions. Such machines are suited for a host of applications, including automotive, aerospace, defense, medical, computers and electronics, telecommunications, consumer goods, and so on. This radio show will show what’s possible with smart machines, and what tradeoffs need to be made to implement such a solution.
To save this item to your list of favorite Design News content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.