The case for continuous improvement


Actual Experience

Managing user demand for communications services has changed and grown in complexity as our methods of communication have developed. Back in the day, when infrastructure was provided to handle just one service (e.g. telephony), utilisation metrics - based on averages over time periods such as an hour - were the right tool for the job. No service provider can predict the precise time of a user’s demand for a telephone conversation, but they can track aggregate metrics, identify trends over weeks/months, and estimate how much, and where, shared resources need upgrading in the infrastructure. Digital infrastructure facilitates communication using a plethora of applications and services. If every instance of communication were supported by dedicated infrastructure, the cost would be astronomical, making shared infrastructure an economic necessity. Finite shared resources need to be appropriately sized to handle user demand for all those applications and services.

So, we have two “givens”: finite shared resources, and uncertain user demand.

Demand migration

Even in a single service environment, an additional control was needed to regulate user demand: pricing. Peak and off-peak tariffs encouraged “demand migration” - users shifting their communications behaviour to different times of the day or week. Congestion charging happens in plenty of other areas of economic activity, too. Just try driving through Central London on a weekday! However, demand migration isn’t only prompted by pricing. If the experience is always poor in the mornings, users will shift their communications behaviour to the afternoons (assuming they have enough flexibility in their jobs to avoid “digital slow time”).

Transitivity of behaviour

However, if demand can’t be migrated to different times of the day or week, daily commuters will seek out the “rat runs” of additional shared resources – ones that aren’t on a direct path, but can help them skirt round the congested highway. This is an example of “transitivity of behaviour”. If a congestion bottleneck shifts away from its regular location to a new location, there can be both winners (those who only have the regular location on their path) and losers (those who only have the new location on their path). Note that the bottleneck shift might be transient (emergency road repairs on a highway), or longer term (a week-long sports festival attracting extra traffic at a highway intersection).

To recap, we have two “givens”: finite shared resources, and uncertain user demand. Now we have two “patterns”: transitivity of behaviour, and demand migration.

So, having set the scene...

What does this mean for digital infrastructure?

Well, whose infrastructure these days provides only a single service or application? Who still uses metrics based on time averages to work out what’s going on in shared resources? Who thinks that making an improvement to resource capacity in one part of the digital supply chain will fix the problem?

From the data centre all the way to user devices, digital infrastructure supports a myriad of applications and services. That infrastructure comprises shared resources subject to transitivity of behaviour and demand migration. Conventional metrics may look fine, but they are averaging out the variability that impairs the human experience of those applications and services.

So, how can we keep digital infrastructure performing so well that humans don’t notice it?

First, we need to measure from the perspective of humans. Actual Experience’s analytics use the lens of human experience (HX) scores instead of single metric time averages. Using patented technology, HX scores derive from analysing multiple end-to-end metrics and are inherently non-linear in their behaviour. Each metric contributes to the characterisation of an application’s “experience cliff” - the point below which humans don’t notice, and above which they get progressively more frustrated.

Next, the HX correlation system identifies which parts of the digital supply chain are responsible for impairing the score, whether this arises in the office WiFi, WAN access, WAN, DC access, DC network, the servers, or the application itself. The system lists impairers in order of their impact on the score: some are bottlenecks (requiring more resources, whether in the data centre or in the network), and some are misconfigured (or improperly configured) devices.

Shifting bottlenecks necessitates continuous improvement

Suppose the DC access is at the top of the impairer list, with loss behaviour implicated. Second on the list is WAN access (in a couple of office locations), followed by office WiFi (in one location). Let’s assume that the DC access is upgraded, resulting in score improvements for most locations, but not throughout the whole of the working day. The DC access disappears from the impairer list, but a device in the DC network and the servers now appear in the list, and they are listed above the WAN access and office WiFi. What’s happening? These are examples of the shifting bottlenecks mentioned earlier, arising from uncertain user demand, transitivity of behaviour, and demand migration. In addition, the non-linearity (at the heart of what humans do/don’t notice) can mean the DC access had been partly masking impairment behaviour deeper into the DC infrastructure.

So, fix and forget is simply not fit for purpose. We need a new approach: monitor and improve (continuously). A human experience audit is the first step (quantifying the ROI) on the never-ending journey of continuous improvement.

Which straw broke the camel’s back? In the world of digital infrastructure that’s a continuous question, and the non-linearities of human experience can surface some counter-intuitive outcomes. In fact, it is the reason for structuring Actual Experience’s analytics as a service. Continuous improvement is based on an iterative remediation cycle: monitor - diagnose - upgrade/reconfigure - repeat.

It can be misused of course (audit, fix, and forget), but soon you’ll soon have frustrated staff and customers suffering digital slow time again (and complaining). Why? Today’s digital infrastructures are subject to complex behaviours, arising from uncertain user demand, transitivity of behaviour, demand migration, and the non-linearities at the heart of what humans notice.

One particularly counter-intuitive behaviour in the above example is that the DC access, when looked at using normal time average utilisation metrics, didn’t indicate a problem. This is a common situation: conventional metrics average out the behaviour that humans notice - the evidence seems to be missing.

The need to press on with remediation

Actual Experience’s analytics use the lens of HX scores to build the evidence base, and then to diagnose those shifting impairers. Upgrade/reconfigure the impairers at the top of the list and uncertain user demand, transitivity of behaviour and demand migration will shift the impairments elsewhere. Sometimes a remediation step doesn’t appear to result in any significant score uplift. It is important to press on with the iterative cycle rather than roll back the remediation step: the non-linearities may require the completion of multiple steps before real benefit is noticeable. Over time, applying the iterative cycle results in HX score improvements extending throughout the working day at all locations, and for all services and applications.

Of course, the enterprise environment never stays still. New applications, new users (whether staff, partners or customers), changing business patterns - these all contribute to the moving target of today’s digital supply chains. This is why continuous improvement is essential - HX analytics minimises the amount of time wasted by people waiting for the digital world to respond.