Great sleuthing for sure. And I applaud the solution of using reference channels and checking software. That sounds a bit similar to one of the "data qualifying" steps that we once put into a system that could not tolerate being wrong, ever. It is a completely valid solution.
The problem sounds like a missing shield ground connection, most likely in the area of the multiplexer selection control lines, or perhaps in the selection signal logic area. Of course, it could also have been a "slightly bad" connection in the channel selection reporting area, which was probably in a less protected section of the board. That type of problem may also have been on the system backplane, if there was one. It would be interesting to look at that retired hardware now and see what the problem was.
I'd like to use this story (Extremely likely ESD was the source of the problem) for an upcoming Sherlock Ohms blog. You game? We would need your name and a short bio (two or three sentences) Also, it would help to give a bit of background on the setting -- what type of company, your role, etc.
If you're willing, please let me know at email@example.com
I appreciate everyone's comments. We, too, were not happy with abandoning the hunt before the root cause was found, but this was just one of many issues with these systems in that environment. Considering all of the constraints, we were forced to apply a bit of Engineering Triage and forge ahead.
I am kind of bummed the root cause wasn't discovered as well. I know most everybody is looking at ESD as the likely cause but with so many other other options it would be interesting to know. What I find most interesting is how it appears to be just the one postion that was affected by ESD. Any idea why just that spot or did I miss it in the article?
Extremely likely ESD was the source of the problem.
Had a similar problem with ESD on a system with just two boards mounted to a isolated common metal / unpainted chassis (connected directly together)... ESD to the chassis would corrupt the processor's operation and / or reset the system. Extensive shielding, grounding experiments, embedding all signals to inside layers on the pcb - with ground pours on external layers, etc.. nothing improved the system's ability to handle ESD events. The system just couldn't handle anything above 4kv discharges (we needed a min of 8-10Kv human body model for CE certification).
Final solution: a small amount of resistance (33-75 ohms) in series with all signals going between the two boards (except power / ground). Didn't effect intended signals (minimal capacitance on these lines). So... a bunch of very small SMT resistor packs (4 per 1206 size) were installed.
At the time, there were no small enough ESD devices, with low enough stray capacitance avalible to work in the space we had.
Apparently, the slight de-coupling (de-tuning/termination?) of these signal traces kept the energy (being dis-charged through the isolated chassis) from being picked up by the processor circuitry... and it is very hard to keep all grounds in a multiboard system at the same potential (with extreme rise times being involved).
After the change.. we could handle 15Kv!
Being where there is no humity (Artic or Arizona) I have seen ESD in excess of 100Kv. So even a well engineered system, can see extremes beyond excepted levels of abuse.
I would bet that you might find the computer case wasn't really grounded to the mother board (paint on the screw bosses), or the I/O board shield wasn't tied to the computer shield, or something like that.
I agree that ESD is the likely cause, and it MAY have been cheaper to just build a mud room outside the entrance, so the parkas could be left there... but you ran the chance that you'd still have unexplainable transient events, with improper grounding and shielding. ♠
Static discharge certainly appears to be the initiator here but I don't agree that it is the cause. Static electricity is a normal and expected part of our environment and, while we could mitigate it somewhat by adding humidification, banning parkas on the premises, requiring the use of static dissipative devices, etc., none of these really addresses the true problem, which is that for whatever reason the multiplexer circuit was not designed to tolerate the higher level of static found in the arctic environment. What the author did by programming in error checking and automatic reset made the system fault tolerant, which is what the circuit designer should have done in the first place.
Incidentally, I would like to thank the author for the spelling lesson. The use of multiplexor (with an 'o') raised an eyebrow and caused me to look it up. Turns out multiplexor and multiplexer are both technically correct spellings.
The first Tacoma Narrows Bridge was a Washington State suspension bridge that opened in 1940 and spanned the Tacoma Narrows strait of Puget Sound between Tacoma and the Kitsap Peninsula. It opened to traffic on July 1, 1940, and dramatically collapsed into Puget Sound on November 7, just four months after it opened.
Noting that we now live in an era of “confusion and ill-conceived stuff,” Ammunition design studio founder Robert Brunner, speaking at Gigaom Roadmap, said that by adding connectivity to everything and its mother, we aren't necessarily doing ourselves any favors, with many ‘things’ just fine in their unconnected state.
Focus on Fundamentals consists of 45-minute on-line classes that cover a host of technologies. You learn without leaving the comfort of your desk. All classes are taught by subject-matter experts and all are archived. So if you can't attend live, attend at your convenience.