Make it go faster

Make it go faster

 

For the last couple of years I have been working on a large enterprise roll-out of a certain software product.  With my Infrastructure architect hat on I spent a great deal of time with specialists from the software vendor, the implementation partner and our hardware partner to create a system that would handle the expected load of around 1200 concurrent users.In addition to this expected load, the architecture was created in such a way that, should one aspect of the system see higher than expected load we could scale up or scale out “modules” of servers to cope. As we could not rely on the vendor benchmarks (as they referred either to the previous version of the software or to a workload that was completely different from ours) a degree of contingency was built into the day-0 design and we also had the benefit that this was phased deployment, allowing us to monitor capacity as the phases went live and to ensure we could predict the point at which we may need to add more resources before anything ground to a halt.  We were not expecting any issues around the capacity of the infrastructure in any shape or form……

Day 1 – All seems as expected with the exception of a single process, flagged as being performance critical but also repeatedly redeveloped and, due to one reason or another, the process was “sub-optimal”. Various crisis meetings were held in order to try and work around this single “performance issue” now the system was in production, but the sub-optimal process was running as quick as it ever could or had. The shouts for “more power” were heard and even with my best Montgomery Scot attempts I could not get this process to go any faster (back over to the software boys methinks). Over the course of 2 days I was repeatedly urged to go and buy more memory, servers, disk, high-caffeine drink or whatever would make it go faster – but no amount of infrastructure was going to make this process fast – there was a fundamental issue with the way the process was running. Despite being involved in many crisis calls I was unable to articulate this precise point, until I had a brainwave and came up with the following:

Simple Terms

We have a large pile of servers, storage and networking sat in the server room. Consider these as a large motorway (or freeway if you are thus inclined). Our motorway has the following:

  • Multiple lanes – this is the amount of processes that can run concurrently on the servers. The more that run in parallel the more are able complete in a given time.
  • A speed limit – this is how quickly processes could potentially go. The quicker they can go, the more complete in a given time.
  • Vehicles – These are the processes, that are running on the servers. Of varying sizes and speeds.

Our motorway is mostly empty. It has a dozen lanes and maybe one or two of these are ever in use. This there is plenty of capacity

Our speed limit is pretty high. These are, after all, modern servers on an optimised storage and network infrastructure.

We have a problem process (one of the vehicles). It is, compared with all the others on the motorway, perhaps a 2cv at best. It is slow and, on a busy system (not ours), might be clogging up one of the lanes. No matter how many more lanes (servers) we add to our motorway. No matter how high we raise the speed limit, the 2cv is going as fast as it can possibly go. If we want to go faster, we need to look at getting a faster car.

Through this magical combination of words I was able to convince the captains that throwing money at me to buy new hardware (although nice) was not going to make a jot of difference and was certainly not a quick fix for the slow process. The issue ceased to be once for me and, with a little thought and understanding the software guys re-architected their solution making it run 20-30x quicker and even managed to reduce the load on the servers running the process (which was not massive in the first place).

Conclusion

Finding the right words and putting them in the right order is the answer to making people understand. If you can closely approximate it to something in the real world that is understood then the rest is just substitution.