Understand the experience of your IT infrastructure


Actual Experience

 The word ‘Experience’ is bandied around ever more frequently in these times. ‘User Experience’, ‘Customer Experience’, ‘End User Experience’, ‘Employee Experience’

Across a range of products and services, this word can mean different things. In the world of Product Analytic tools such as Heap, it is an indicator of the design of a digital product, particularly those that are web based. It is a very valuable and automated way of tracking how well a service is meeting the commercial needs of a business – e.g. how much of a website footfall leads to purchase or service signup.

This gives valuable insight for developers into how effective a service is at meeting its customers’ needs, and where elements within the application design can be optimised. However what tools like Heap do not do is provide any indication as to what impact infrastructure, whether servers or network, has on the ability to deliver that service with consistency and reliability.

Let's ask people using Customer Response Systems

In the domain of reviews and customer feedback, customers are asked to complete a survey and this forms the basis of well-known approaches such as Net Promoter Score (NPS) and Customer Satisfaction (CSAT). These are clearly valuable, especially to sales and marketing, as it gives direct feedback straight from the customer as to the quality of the service delivered. However, these are often slow to produce and information of value may be out of date by the time it is published. Even where this is not the case, unless the poll is sophisticated, which will probably reduce the response rate, then the nature of the results will be a marker on a ‘Like’ / ‘Don’t Like’ scale; there will be little to indicate what it is that users do not like – e.g. Service design, slowness of response and so on.


Deep-dive with infrastructure management tools

The other domain is that of infrastructure management. There are many tools available to provide details of resource utilisation and provide error statistics on routers, servers, firewalls and other elements used to create public and private infrastructure. These are entirely necessary to be able to manage that infrastructure successfully, but do not provide visibility of how useable the applications are to the user population.

Knowing that a circuit is 50% utilised is important for capacity management, but does not identify the level of performance a user is receiving when accessing a service over the link in question.

One outcome of relying purely on this type of information is that such a link might be upgraded because a pre-defined threshold had been reached; without knowing whether users were impacted, it might be that cost was incurred unnecessarily early. Worse still, it might be incurred after users were already impacted, and probably complaining.

So, these types of parameters, (interface utilisation, buffer drops and so on), are insufficient to describe a real user’s experience. Additionally, in the world of cloud services, where organisations do not have access to systems performance metrics, it is increasingly unfeasible to try and rely on it. However, for owned infrastructure, once an organisation knows it has a problem, and as importantly, knows where to look, then these tools become invaluable.


You're not really looking at end user experience

Of course, infrastructure management and the associated operational teams do not purely rely on element management tools. There are also tools that provide detail on end-to-end network behaviour and on application behaviour. As we’ve mentioned in a previous blog, “It is time for IT to stop guessing”, these types of parameters provide a technical metric only. These values have little meaning in a business context. Knowing that the network latency to a service is 100ms or that the time to first page load is half a second gives no real indication of ‘subjective experience’; and what happens if the latency in the first example increases to 120ms? Consumers of such information have to try and infer the quality of digital services. This is not trivial, is made more complex by the fact that more than one variable may be changing (for example, loss and delay), and tends to ignore the fact that the behaviour of applications tends to be non-linear in the presence of different network conditions. Once it is known that there is an issue and where it exists, then element management tools can be targeted to provide deeper level root cause information.


Existing attempts at scoring experience

A number of infrastructure management tools aim to provide an experience score based on the technical metrics that are gathered. As an example, Apdex is an open standard that generates a score between 0 and 1 as an indication of Quality of Experience. It does require one or more thresholds to be defined to allow the score to be calculated, which in turn means that implementation can be quite complex and is tied into exactly what is being measured. Response times are compared with a ‘Satisfactory’ threshold which is user-defined, and a ‘Tolerable’ threshold which is automatically set to four times the ‘Satisfactory’ threshold. Used correctly, Apdex scores are a way to get insight into platform-wide trends as an operational tool, but they shouldn’t be used in isolation. From a business perspective, the score does not identify either how impacted a user is since response times to specific queries are being compared against very simple thresholds. More importantly, the action needed to be taken to resolve issues is not identified.

Other tools are capable of producing similar types of experience scores. There are tools that generate more granular values on their experience score scales and do so using multiple parameters (user transaction times, packet loss, host CPU utilisation and so on). These are useful operational tools, although significant care has to be taken over the handling of Personally Identifiable Information. Operational insight is provided into whether services are functioning better or worse, but do not provide any business level insight as to how productively staff are able to work. Additionally, knowing that service quality can be improved is useful, but very limited if it does not provide data to highlight where it needs to be improved. User experience tools that measure host performance parameters (e.g. CPU load) or transaction times can be capable of highlighting host or server side issues, but little is guaranteed beyond that.


It's subjectivity that makes up experience

The Actual Experience methodology is underpinned by years of academic research which address the two key business challenges highlighted above:

  1. quantifying the business efficiency of digital services and
  2. identifying actionable data in the presence of degradation

In common with some tools, a score is generated from the measurements made, so that Quality of Experience does not have to be inferred, but the key element is that the score values can be directly related to wasted time and hence to bottom line impact of underperforming services.

Secondly, the measurement of Quality of Experience is twinned with the identification of the devices impacting employee productivity. This then provides actionable data to different stakeholders:

  • Operational staff. The devices where root cause analysis is required are highlighted avoiding guesswork or extensive trial and error testing.
  • Supply chain managers. Where impairment is introduced by a third party, that third is identified and evidence provided to facilitate remediation discussions and avoid finger pointing.
  • Service delivery managers. Supplier ‘performance’ can be fed into monthly service reviews.


Experience becomes the bridge between impairments and business efficiency

This fault identification is automated via the systematic correlation of node by node behaviour against end behaviour – i.e. problem behaviour on a data path is highlighted if it consistently matches when the end to end behaviour is impaired. This addresses a number of potential problems

  • Fault isolation does not have to be inferred from base metrics such as traceroute.
  • Rather than an operational user trying to decide whether the values shown by traceroute definitively reflect a fault at a time that users are affected, the decision-making is automatically performed for the user in a process supported by millions of calculations.
  • The instrumentation allows data to be gathered and processed over a period of time and across an entire corporate infrastructure, where manual intervention would be impossible.
  • This means that digital services can be managed at scale
  • Issues that result from reduced performance at more than one link or device become visible even if individually they do not appear to be affected.
  • This is one of the complex cases that arise in troubleshooting, and is an example of the watermelon problem: ‘Green on the outside, red on the inside’, or ‘All the lights are green but the users are still complaining’. The sophisticated correlation techniques employed by Actual Experience are the key to rapid resolution of these types of problems.
  • Mathematically driven correlation allows complex issues to be identified that would not be immediately obvious to someone manually trawling through systems data.
  • Even with a single major problem, looking at manual traceroute data will not necessarily make the impairment location obvious, especially when a problem is intermittent.

Maintaining data over time allows Human Experience Management to quickly identify chronic or intermittent issues which might otherwise take months.

Many tools give an output which identifies when elements or services are sub-optimal in some way. Human Experience Management is one of a subset of those tools which presents the business impact of the delivery of digital services, does so in near-real time, and in a way which is simple to consume. Near real-time availability of information is important to ensure its timeliness.

Furthermore, Human Experience Management presents actionable insights when improvement to digital services is required to enhance user productivity, even in complex cases; this is the answer to the question “What do I do about it?”. This allows operational teams and supply chain managers to understand where problems exist and implement remediation plans. So Human Experience Management is complementary to a range of standard tools such as network element management, application performance management, and desktop management tools which will be needed for deep dive root cause analysis once a fault location has been isolated.


In conclusion

Employing Human Experience Management across an enterprise’s entire digital estate provides visibility in near-real time of the productivity levels enabled by key business applications and identifies the weaknesses that limit productivity. This is particularly important in current times, as significant numbers of staff are expected to work remotely and the question arises “Are our people enabled to work effectively from home?”. A single organisation will have multiple personas, each with digital services that enable them to fulfill their function. Maintaining Human Experience Management allows business leaders to be sure that those personas continue to be served as best as possible and understand what to do about it if that changes.