I have seen this happen on PCs when the power supply starts to fail. Users assume hard drives, memory and even motherboards are failing. I have learned that when lots of unexpected things start happening to suspect the power supply.
This error is so understandable. The design works, so it gets passed down again and again. Why not? Why redesign? At a certain point developments in what the design is supposed to do outgrow the basic dependable design. This has got to be happening all the time.
From my experience, Rob - I think you are correct in it happening all of the time. I look back at some of the test equipment that I worked on that was 15 years old and still expected to meet current test needs. The engineer was expected to add to the design to meet new needs rather than redesign the test set - all in an effort to save money. Every once in awhile we could shake our heads and go "no way, this just won't work" and get to do a total redesign - but that didn't happen very often. If it actually compromised the integrity of the product then we would insist on a redesign, but of course with the economy and lay offs being a fact of life back then - we didn't complain too loudly if we could make it work and it was "good enough." We did the best we could "within the parameters we were given."
Interesting story, Nancy. I would imagine that in certain critical areas -- like power supply -- you still had to test to make sure the existing design would work with the new features. That's the odd part about of this story.
You hit it right in the head, Rob - but you would be surprised at how much one can "scavenge" in a company that has been operating for years and has had a lot of test equipment designed and built. I remember on more than one occasion swapping power supplies around for that very reason...or IEEE cards, or memory...or video cards...sometimes we played "ring around the test set" to get the hardware specs we needed.
Yours was a problem I saw all the time when I was in the development tools business. The only companies I've encountered where this issue isn't endemic are those where somebody takes the time, on a module by module basis, to write up a spec sheet for what the module in question will and won't do. Then there's a chance that somebody can produce the document that says, "This won't do that" when his boss says, "Just reuse this old design." That, I think, is more a comment on human psychology. But it seems critical that the entire conversation be reframed in the context of, "What changes do we have to make to the existing design in order to make it appropriate for the new design?" Asking that question assures that the matter is at least addressed, and if the answer is, "So many changes it'd be easier to start over with a clean sheet of paper," that message can be taken to management with some chance of success.
Nowadays I'm out of the development tools business, but it's not as though I don't reguarly see our potential client companies doing exactly what these heavy equipment people did -- largely because management gets it in its head that everything is infinitely reusable unless there's some internal procedure for triggering a review of "reusability" and some kind of document that says, "Ah... Come to think of it... No, not at all."
Note that above I say, "potential client companies," and not, "actual clients." Because where this kind of cavalier behavior is standard practice, it's usually because the company in question is penny wise and pound foolish. Running, as I do, a high-end design shop where we have a total *value* proposition instead of a message that "we compete strictly on price," I like for the self-selection algorithms for our customers to run in such a way that the guy who regularly reuses too much to "save money" is somebody else's client -- not ours.
It wasn't *that* long ago that we turned away a client who refused to spend the money to buy an extra week of engineering time in order to figure out how to make his system run better with about half as many components in it. Had he only been building a handful of units a year, not spending the NRE dollars would have made sense. But he was bulding 15,000 units a year, and 40 hours of engineering time would have saved $8 of BOM cost on every one of them. When you do the math, in one year, he'd have saved $120,000 (and had a more reliable end product, since it'd have fewer things in it to break). Or dividing it through anohter way, for NRE costs to have outweighed BOM cost savings, I'd have had to have been charging him more than $3,000 an hour for engineering time. And as much as I think Focus Embedded is a really good shop with the best product in the business, that rate is a tad high even for us... ;-)
I feel your pain, Eric. Sometimes it just doesn't make sense that people can't see the obvious...but it happens all the time. Effective cost reduction does not equal inferior quality and sometimes you have to pay more up front to save in the long run, but sometimes you just can't convince people of that. I am with you - I would prefer that they are someone else's customers!
Nancy, I agree some people will never see the obvious. As a senior engineer once told me "always remember pay me now or pay me later and if you pay me later it will cost you more". That was one of things I hated about some decision makers they just wanted to get it done and worry about fixing it later.
Gsmith120, I would think that the practice of getting it done now and fixing it later is an expensive path. How long can a manager get away with that behavior before it comes to the attention of upper management and bean counters?
This is the risk companies take when they force out senior engineers for less expensive junior ones. The loss of corporate knowledge can be critical. This is not just a case of experienced vs. inexperienced, but that of knowing a product line well and remembering the reasons for design expansions.
Companies end up answering the question "Why is the design this way?" with the answer "Because that's the way we've always done it" and have no basis to support the answer.
Good points, TJ. I believe that inexperienced engineers may be part of the problem. Another is the rapid pace of new product development. So may products get updated on a schedule (think Apple). Engineers are tasked with adding new features with each release. It makes perfect sense that they would keep adding new features to the existing design without revamping the entire product design.
A big chunk of what happens -- particularly in the modern world where "software guys" and "hardware guys" are rarely one in the same -- is that when somebody goofs up on one side (in this case a hardware guy undersizing a power supply), it's the guy on the other side of the wall who sees it first and diagnoses it through the lens of his own world view. And to a software guy, a piece of code that crashes repeatably on the same instruction every time smells like a bad tool, not a bad power supply.
At the same time, there seems to be this dividing line between the "analog circuits guy" (who in this case got the design responsibility for the switchmode power supply) and the "digital circuits guy" (who got the job of designing the processor core and attaching its peripherals). It was those two not talking with each other that produced the problem of "undersized supply" in the first place.
Where it's good to have an "old timer" around is where you encounter those problems that cross boundaries between "hardware" and "software" as well as between "analog" and "digital." Engineers like that were being minted 30+ years ago. Now they're not, and the only way they come into being is with the accumulated on-the-job experience of a few decades behind them. In theory, project managers should have enough visibility into their project's activities that they should also be able to spot cross-functional problems like this. But in the last few years there's been a trend for engineers who don't like the profession after a few years to go get the MBA and come back to the organization with a "management" hat on. And when that happens, you're about guaranteed to have projects run by people with neither the breadth or depth to solve thorny problems like this.
The good news in the United States is that at least there are a few engineers who put in the 25 years it takes to get good at the business before taking on project management responsibilities. There are countries in this world where societal pressure is such that you're perceived a failure if you're still "just an engineer" after 25 years and you haven't moved into management. And the net result there is that *nobody* in that society puts in the time to get really good.
I logged 25 years as a "pure engineer" myself before founding Focus Embedded. And the average age of employees here is 54. But we're something of an anomoly in that regard. That said, we also get all of the really hard problems that nobody else can solve because nobody has the breadth of experience to see the entire picture when a beast such as "all F's change to all 0's" raises its head.
As several commenters here have pointed out, this could easily be traced to a human (i.e., personnel) problem. I suspect that these kinds of software/hardware glitches are commonplace, largely because companies too often can't manage to hang on to the design engineers who know why a product was designed that way in the first place.
Yes, I think you're right, Chuck. The design engineering staff turns over and the new person simply adjust the existing design to accommodate (though not really accommodating) the new features. Then time-to-market come in as a pressure on design. The features keep getting added on to the old design until the design breaks down.
I do agree, but if the SOP was followed for developement, there is no excuse for this to happen. ISO certification works when the employees follow the deveopement paper trail. Cut and paste should be caught by the ISO plan. There is something to say about QC Controls.... It sounds like a total QC failure to me, and a slap on the hand for the engineers for not following the QC plan.
This was an excellent explanation of the problem and the fix. And I can tell you exactly why it happened. The root cause is most probably because the designers were not given enough time to test the design. The design completion date was not set based on proper design and testing, but more likely set due to a customer PO.
That makes perfect sense, Didymus7. With the ever increasing time-to-market pressures on design, I would imagaine this is a common problem. Each year (maybe six months), you take your product, add new competitive features, and shove it back out the door.
I worked with aerospace software design engineers that expected their software to execute without power. In fact they specifically designed software for this condition. When there is the loss of electrical power in an airplane, which can happen any time, for a number of reasons, the electronic box needs to sense this and the software needs to execute shutdwon code for the few seconds left on the caps. Probably all data should be in nonvolatile memory all the time (but usually is not done that way), not registers.
When the electonic box starts up it needs to determine stable staus as soon as possible. Some times knowing the last state is at least somewhat helpfull. But again the box will not know the state that it is being power up in. There are some boxes on the airplane that get enought of the right kind of information to determine the mode of the airplane; ground power, engine power, taxi, climb, cruise, decent and land.
I had a running debate of the importance of immediately determining the present non-intermitent state of the box inputs so valid condition may be reported. The project manager insited in following series flight mode "states". They are sill having trouble because of this false series progression.
It has been my experience as an end-user that the primary failure mode of most embedded microprocessors has been the power supply.
Back in the early 1990s it was fashionable to include calculations for MTBF with such gear. The numbers we were given were clearly ridiculous. I would routinely point out that these numbers were based upon heat, not component aging. Sure enough, about 12 years later we began seeing a very high high failure rate of our field devices. The cause was traced to... the power supplies.The electrolytic capacitors were failing.
Naturally, the were no longer in warranty, though the MTBF numbers would have suggested that they should have continued to be useful for many more years.
It's not just the issue of ripping designs from someone eles's homework. Power supplies are one of the weakest links in keeping a system working. Remember that electrolytic capacitor scandal from around 2003? The lowly power supply deserves a great deal more attention than most engineers are willing to admit.
Festo's BionicKangaroo combines pneumatic and electrical drive technology, plus very precise controls and condition monitoring. Like a real kangaroo, the BionicKangaroo robot harvests the kinetic energy of each takeoff and immediately uses it to power the next jump.
Design News and Digi-Key presents: Creating & Testing Your First RTOS Application Using MQX, a crash course that will look at defining a project, selecting a target processor, blocking code, defining tasks, completing code, and debugging.
Focus on Fundamentals consists of 45-minute on-line classes that cover a host of technologies. You learn without leaving the comfort of your desk. All classes are taught by subject-matter experts and all are archived. So if you can't attend live, attend at your convenience.