I’m Puzzled and Bugged: An odd mouse click problem on Win7

Updated 12th October: Problem has vanished! Read below.

I’ve come across a very annoying and strange defect on my Windows 7 laptop. It really bugs me!

Here are the patterns I’m seeing:

  • Sometimes, the minimize/restore/close buttons don’t light up when I hover the mouse above them. Clicking them has no effect. Whenever this happens, Alt-Tab’ing to another window and than Alt-Tab’ing back again restores them and they work again.
  • In Firefox, this happens as soon as I have clicked once somewhere inside the window.
  • The same applies to Thunderbird, my e-mail application.
  • Mouse clicks in popup windows sometimes end up in the window below the pop up window. E.g. clicking the close button on a popup window does not close the window, but does something in the window below it.
  • Other applications don’t seem to suffer from the problem, but has other types of odd mouse behaviour.
  • In Libre Office, I can’t click-and-drag to select text any more. Text selection is only possible by keyboard.
  • The scroll wheel doesn’t work in Internet Explorer
  • Capture One Pro 6 (a photo editing application), dragging sliders (to control photo appearance) no longer works.

I’m puzzled and bugged. But I’ll find out what’s wrong.

Oh and, by the way: I have tried restarting. It doesn’t help.

— Update 23rd September:

As per Michael Boltons suggestion, I connected a new mouse: A Wacom tablet. It didn’t change anything.

I ran Windows update and Sony’s own update tool to update various drivers. It also didn’t change anything.

I’m pretty sure the problem started in the middle of the week. At first, I just thought it needed a restart and I was too lazy to do so, so it might have been there for a few days before I reacted. I haven’t installed any software, but I have run Windows update. The update log shows this:

I’ll try to uninstall the Silverlight update, as I’m suspecting it as ‘new and fancy technology’ might put in hooks in odd places in Windows.

— Update 12th October

I have a confession: The problem went and came back a few times, and now it’s gone completely. I feel like a lousy tester since I haven’t been able to dig out what caused it. Uninstalling software didn’t help, updating drivers didn’t help either, and even updating various software from Firefox, Thunderbird and Capture One Pro did not change mouse behaviour. I’m let with a well working laptop (that’s good!), the memory of a very annoying problem, and some lousy theories about what caused the problem.

Is this a case of a system error, which cannot be understood in the sense that there was a single cause, but has to be analyzed using a system perspective: A system of thousands of components interacting in complex ways? Like I suggest we analyse Black Swans in IT systems?

In that case, did something turn up the heat? I was a bit stressed myself at the time, but no – I don’t think my mental state affected this. And I can’t think of something.

My best guess is that a Firefox plugin for html debugging in combination with a specific version of the application and possibly some driver issues somehow ‘turned up the heat’ causing the problem to materialize. But it’s only a guess.

(By the way: I could have done like most would probably do: Reinstalled the pc or bought a new one. But just a few weeks before I did actually reinstall it completely with a new SSD and more RAM.)

Reklamer

Death by Virtual Memory

Every pc user knows that pc’s become slower and slower over time until the point where they are almost unusable. This is where upgrading RAM will usually help – until eventually you have to buy a new pc. Apparantly that’s just the way pc’s wear out.

Actually, pc’s don’t wear out – readers of my blog probably know that, but users (knowingly and unknowingly) add software which consumes resources of which memory is the most important one.

RAM used to be very expensive and therefore a scarce resource. Programmers used to do all sorts of tricks to fit their increasingly complex programs in memory. To help them focus on the programming task and not worry too much about resource scarcity, operating system designers invented something called virtual memory or swap memory.

Swap memory allowed the operating system to remove running processes from the (expensive and therefor scarce) RAM and store the state of the process on disk (‘swap’ it out – hence the name), from where it could later be restored into RAM and start running on the cpu. The technique is still employed by all modern operating systems, and while the amount of RAM has grown considerably to a level where lack of it is usually not a problem, virtual memory techniques are still useful with long running processes that only need to run once in a while and where it is not a problem if the initial response time is a second or more – and when they’re not running, the RAM can be used for caching file system data and other important things.

But what happens if load increases, e.g. if the number of users grow or the system becomes otherwise loaded and the processes running on the system start competing for memory? The good news is that functionally nothing changes: Virtual memory is transparant to the process, so the code will execute the same as it did before. But the bad news is that execution time increases rapidly when real memory become exhausted and the OS has to start using VM. If this only happens during nighttime or at other times when users or external systems are’nt depending on the system, all is probably okay, but if not then you can be in real trouble. In fact, the problem can be so bad that the system becomes useless.

In fact, with much more RAM and larger programs in today’s computers, the relative performance penalty is much higher than it used to be. This is because when the OS starts swapping, the amount of data that needs to be transferred in and out of the hard disk(s) is probably a factor of 10 higher than it would have been say 10 years ago. During that time, however, hard disk access speeds has only doubled, so overall, the damage you risk of hitting the virtual memory “wall” is much higher now than it used to be.

An interesting factor which I have found useful to look for is the fact that anti virus systems installed on your servers often make the virutal memory problem worse. They do so because they install hooks into applications running on the system, monitoring all i/o. This monitoring performs well as long as the anti virus system can keep its database and code in memory, but when memory starvation starts occurring, it can turn into a real bad situation. How can we detect that situation (except by performance dropping)?

I’m not aware of any really useful tools that can sit in the background automatically detecting (or better: predicting) memory starvation problems on running servers or test systems. But there are ways to look for it: On Windows, I’ll be looking at the running processes, particularly focusing on the Page Faults Delta column, looking for processes consistently experiencing high numbers here:

This is an important performance testing subject. And one which is too often overlooked.

Peugeot’s Black Swan at Le Mans 2010

This blog post is  about motorsport. What does motorsport have to do with software engineering, you ask? Read on!

I’m a big fan of motorsport and Le Mans 24 hours in particular.  Le Mans is a 24 hours motor race with about 50 race cars of four different classes competing in the same race. Le Mans is also a legend, run first time in the 1920’s. To run a race over 24 hours is very challenging for teams and machinery. An F1 race is only 2 hours and cars are only a bit faster. We’re 240,000 spectators, and about 40,000 danes travel the 1500 km to get there – including me and two of my boys, so it’s also a big, great party.

My son Aksel at Le Mans 2010
My son Aksel (very focused - and a bit tired from a long drive) at Le Mans 2010

 

But to me as an engineer, Le Mans is also intellectually inspiring. Le Mans is a reminder that while we can do a lot with technology, there’s also a lot that we can’t do and that the laws of physics will always set a limit on the track. In order to try to win, race car manufacturers and teams will constantly try to push that limit, but it will always be there.

When cars are withdrawn from an F1 race, its typically because of an accident – drivers making mistakes. While driver mistakes are unavoidable over such a long race, withdrawals are actually more common due to technical reasons: The equipment breaking down, engines blowing up, or just electrical gremlins pulling the plug. The fascinating part is that it has been like this since the very beginning.

So failures are more or less expected. 50 cars at the start line, and usually only some 25 at the finish. But Le Mans 2010 was a little different: It was Peugeots ”Black Swan Year”.

The Peugeot 908’s were again extremely fast, perfectly tuned, and ready to race. Audi had gone through a challenging development process with their new R15, which turned out to not be as fast as they had had hoped it would be in 2009, but was improved in 2010, so we all thought that 2010 was to be a year where Audi would be able to compete with Peugeot on speed. But Peugeot again set impressive lap times never before seen at Le Mans. Couple that with the fact that their team finally seemed to be a well working machine now (proved by the 2009 overall win), so it seemed that Audi could only hope for a podium.

Until a conrod broke on the leadning Peugeot at Tertre Rouge on Sunday morning. I was there with my camera, enjoying the early morning and the race, but I left that area only 10 minutes before so I didn’t have a chance to catch the action (aren’t you always in the wrong place at the right time?).

It came as a shock to everyone. I watched the TV pictures on the big screens around the track showing the team completely in shock about what had happened, and I looked down into the pit area where the Eurosport TV crew was trying to get comments from the team which seemed to be paralyzed. But it seemed to be a coincidence at the time. Until a few hours later when another Peugeot failed in a similar way. We started wondering what was going on? And with only one hour remaining of the race, the customer entered Peugeot 908 failed and the race was lost. Audi won 1-2-3 with their three R15+ cars.

It was devastating. The Peugeot Sport director was seen crying on TV. The french spectators and press went home early from the race. This was a nation loosing a battle with their negihbors.

Of course we didn’t know the technical reason why all Peugeots had failed at the race, but it seemed as if they had been ‘programmed’ to fail. About a month later, Peugeot released a statement that the three cars had suffered from the same failure and that the fourth car (which retired before the others due to a broken suspension) would have suffered the same problem if it had still been running during Sunday. Peugeot said that the breakdown came as a surprise. That they had tested the cars and engines and never expected this. I’m sure it was a surprise. I’m also sure their sports director didn’t expect this embarrasing disaster in front a whole nation of supporters. I’m sure they thought everything was Hunky-dory.

But at the same time, I’m not in doubt that the problem was rooted in history: That an engineer somewhere knew that there was a risk, but for reasons which are probably rooted in group thinking and organisational behaviour, kept the knowledge to himself – or simply chose to ignore it. Conrods have failed in cars since the first reciprocating engine was built, but engineers have learnt to handle this so today we have reliable engines that can easily do more than 300,000 km. When engineering has made something inherently unreliable reliable, people tend to forget about it. Management expect it to be under control.

This is true even for competetion engines, even though they are of course pushed much more and designed to be minimal and as light as possible in order to promote power output: I’m sure Peugeot management thought the conrod supplier had everything under control, which they might have had – but they could have worked to meet the wrong specification. We won’t know the details, and it’s not important either.

To win Le Mans you have to be running at the end. The Peugeots didn’t. They obviously forgot what it takes to make something inherently unreliable reliable: It takes focus on what can possibly go wrong. Software is not different: When software fails, it’s often also because someone forgot to raise and issue or because someone chose to ignore it. Many disasters in systems are rooted in history, which also means that they could have been prevented.

 This is where professional pessimists on a team can help. Where testers’ negative attitude can mean the difference between success and failure.

For me, Peugeot’s black swan event at Le Mans 2010 is a reminder that we all have shared responsibility for seeking out and communicating these details. By testing, careful inspection, talking to developers and users, and by constantly focusing on problems. We’re on a mission to prevent disaster by making the risks known so managers can take informed descisions.

Le Mans 2011 will be interesting in a new way since the cars will be technologically different with hybrid engines. This is new technology, so we should probably expect it to fail more at random or just affect performance during the race: Longer pitstops and the like. But lets see, it’s a big and long race. Anything can happen! I’ve got our tickets booked, so let me know if you’ll be at Le Mans in June – and we can meet up and enjoy the cars. And perhaps discuss engineering and testing?

Leading Peugeot 908 photographed at Tertre Rouge Sunday morning at Le Mans 24h 2010 - just prior to failing
Leading Peugeot 908 caught by me and my Nikon on Sunday morning only about 30 minutes before it failed at the same location

Skype’s first Black Swan

Skype went out for about 24 hours just before Christmas. Skype management is embarrassed and promise this will not happen again, which of course is true. The particuar situation is now prevented. However, the question is: Will Skype never be out again?

Skype’s CIO explains what went wrong in this post mortem of what I’d call Skype’s first Black Swan.

To summarise, it was a high load on Skypes infrastructure which triggered a bug in a certain version of the Windows client for Skype which again increased the load on the infrastructure, thereby rapidly taking the entire network down and making it almost impossible to get it up again.

The bug was always there of course, and it was probably already known internally at Skype. It is also possible that the risk of server overloading and service degradation had been identified, but obviously not in the context of making a complete system crash a likely possibility (if so, they would have prevented it). Further, I’m quite certain that the risk of the client bug affecting the server load had not been identified. Humans are positive thinkers, as Taleb documents in his book: The Black Swan: The Impact of the Highly Improbable.

So Skype’s challenge now is to prevent outages in general by identifying and preventing Black Swans in general. This will involve a cross organisational backwards thinking process, which the innovation driven company has probably not been focusing on at all until now. (I may actually be wrong here, Janus Friis, one of the founders of Skype, used to work in a support function of an ISP so he may have been involved in preventing problems, but generally, startups think very positively, and even if Skype has millions of users, it’s still a very young company – a startup.)

One may think that this is going to be extremely expensive for Skype since they will have to predict every possible way their system can go wrong. It does not have to be that expensive, although it will cost money.

When securing a nuclear facility, engineers don’t have to analyze every possible way a disaster can happen, instead they think: How can we prevent failure at every level? This is what I mean with “backwards thinking” – start assuming something is failing, then work backwards identifying ways to prevent it becoming worse.

This is done on multiple levels: On component level, asking what can go wrong here and how can we prevent a bug or incident from affecting the rest of the system? And on system level, assuming that disaster is happning, how can we prevent it from developing.

I assume that’s what Skype is doing now.

Wishing everyone a happy 2011!