Cabe, that reminds me of a messed-up system clock due to an obscure faulty chip (I forget what it was--not something obvious like memory or processor), which was causing very odd system behavior on my laptop. We went through tons of trouble-shooting routines before an online forum gave us the clue.
I've had a similar issue with my desktop, after hours of trying to figure out why it would crash after start-up (testing every piece of hardware) it turned out to be a memory timing issue. Just goes to show that whenever there is a problem, all possibilities should be considered.
Thanks, I enjoyed this one a lot. I once worked in marketing at FORTH Inc., which designed the version of chipFORTH used as the RTOS/HLL/compiler/IDE for Federal Express' first handheld tracking device. I was amazed to discover that code could directly influence power usage and management in the HW. HW/SW integration is key.
In 1995 I was leading design of a PCMCIA modem based on a reference design and reference firmware. We made some modifications for safety, EMC, and to improve performance of the analog front end which weren't related to the problem we encountered.
When we got prototype boards delivered we installed the latest firmware from the chip manufacturer. Once in a while the cards would boot completely but mostly they would hang when installed in a particular type of computer. The chip manufacturer would admit to no problems. I puzzled over this for a couple of weeks. One day, while taking a break from the lab to catch up on other tasks, my manager dropped by and asked "Why aren't you in the lab? I want you in the lab full time until this problem is solved. What is the problem, anyway?"
We went to the lab and I was going to demonstrate the problem--and the card booted successfully. I immediately powered it off and restarted it--and it hung, but exhibiting a failure mode I had not previously seen. I immediately restarted it and if failed in the usual manner--several successive times.
This was my clue! I asked the manager to stay right there, walked to the next room and returned with a can of freeze spray. I heavily sprayed the modem chipset and the card booted successfully several times. At last I had identified a way to make the problem come and go. Definitely a race condition of some sort, but it wasn't even clear whether the cause (or the cure) was hardware or software.
Since the manufacturer of the chipset declined to release source code I had to get him to admit to the problem and address it. The first step was to send him a computer where the problem occurred. Then I had to get him to actually look at it. Ultimately this took the threat of "If we don't hear from you by Friday, we'll be on your doorstep Monday morning to help you with it." Late that Friday night I got a call at home. "Don't come! Don''t come! We've fixed the problem and are sending you a software fix." It turned out that one of the initialization steps was to read an unused bi-directional I/O register. Unfortunately the default power-on state for this register was high-impedance with the internal pullup resistor disabled until the instruction after the read instruction. The fix was to reverse the sequence of these two instructions.
Shortly after I started at the company, the project manager asked me to sit in on a hardware design review. When I walked into the conference room, I was met with expressions of suspicion and contempt from the members of the hardware team. Apparently, software folks were not supposed to attend hardware design reviews. This was the complete opposite of my previous employer where hardware and software worked together from product inception to release. This tight working relationship, I believe, helped us avoid many integration problems. It also allowed us to build some extra "margin" into the design to deal with unexpected hardware or software problems.
Years ago whenb co,puters came in all kinds of flavors we worked with a rather enlightened software house. And of course there did arise failures to operate correctly. OUr agreement with the software folks was that as soon as we could adequately describe the problem to them, they would start looking as we were looking, and whoever found the problem immediately phoned the other to halt their searching effort. Thus the time wasted was minimal, and no fingers were ever pointed until after the project was successful And we did point out the faults in detail so that we could all learn from them. So all of us got better at what we did. And the wasted effort was kept to a minimum.
I think many of the problems I encountered were much more than configuration control issues.
The company produced (and still produces) immensely complicated processor chips and other supporting silicon. It also produced (and still produces) motherboards, server blades and other subassemblies for retail label manufacturers. The period of when I was working for this company was during the 'Internet Boom' of the late nineties early aughts. The company was making acquisitions left and right in an attempt to become a major player in the internet hardware space. While the company had execellent configuration control tools and systems in place, the adoption and use of these tools and systems by these newly acquired entities proved to be uneven at best. The company also had a difficult time fostering a culture of cooperation amongst many of its new acquisitions.
The division I worked for had been a medium sized company in a shrinking niche of internet technology. (I joined after the company had been acquired.) There was alot of bad blood between the hardware and software groups which sadly was not properly addressed after the acquisition. The division which produced the probematic processor had been part of a once great technology company which was in steep decline. Morale issues as well as a culture clash with the new owners lead, I believe, to communication failures.
This is as much a story of a business trying to rapidly enter a market by buying as many components of that market as possible and failing to integrate them as it is a story about insufficient or missing configuration control.
Thank you. There were many communication issues within the company. My group found out later that another division was using the exact same processor and was aware from the start of the memory timing issue. I don't know if that team was told about the problem or had discovered the modifications on the evaluation board before spinning their hardware. Ironically, the company prided itself on it flat management hierarchy leading to efficient communication.
In the projects I manage I have a rule that both software and hardware are guilty until proven innocent. Both groups are expected to work together to solve the problem. The SW guys can dismiss the HW team or vice versa if they feel the ball is in their own court, but one group can't claim the problem is on the other side and walk away on their own.
In the first week on this job I overheard some software guys discussing a problem. I went over an offered to help (it sounded like possibly a couple of shorted address lines). It turned out to actually be a software problem, but I gained a lot of respect from that incident. They were used to both sides pointing fingers at each other.
That's a good story! (and by a fellow Evans too...) I like how just noticing that it took a slightly longer time to fail from a cold start gave you the idea of how to narrow down the problem. I've never worked for a super-large company with all of those divisions, but I see how there could be communication issues.
At least all of your hardware came internally. I recall dealing with a hardware problem (not as severe as yours) when writing firmware, but the hardware was from an outside supplier who refused to accept that they had a problem. We eventually had to write a workaround in software. The old "can't you guys fix that in the software" solution.
Using a 3D printer, CNC router, and existing powertrain components, a team of engineers is building an electric car from scratch on the floor of the International Manufacturing Technology Show in Chicago this week.
In November, a European space probe will try to land on the surface of a comet moving at about 84,000 mph and rotating with a period of 12.7 hours. Many factors make positioning the probe for the landing an engineering challenge.
NinjaFlex flexible 3D printing filament made from thermoplastic elastomers is available in a growing assortment of colors, most recently gold and silver. It's flexible and harder than you'd expect: around 85A (Shore A).
Focus on Fundamentals consists of 45-minute on-line classes that cover a host of technologies. You learn without leaving the comfort of your desk. All classes are taught by subject-matter experts and all are archived. So if you can't attend live, attend at your convenience.