Some Thoughts on Test Strategy

At my last job, I found myself in the position of working on testing and test automation again, and the experience got me thinking about how to improve the process and prioritization of building and maintaining a testing effort. I wanted to capture that somewhere before it completely leaves my head, so I thought I’d put it here.

There are two main areas that I came up with to frame the exploration of this topic. First, value; how do tests deliver value to the development team and the business. Second, costs; what are the ways that building and running tests cost the team. All this within an eye toward DORA alignment, both looking at the testing effort as a self-standing development process, and in terms of how it ties in to those metrics for the product, and where it touches on the points of value and cost. Finally, the metrics are designed with a bias toward things that are concrete data points and can be measured with data available from existing tools, like ticketing systems and test result records, so the regime could be used practically to weigh priorities when deciding what to write, what to improve, and what to retire.

Value

The core factor in how tests deliver value is in giving confidence. Confidence that if the tests pass, the system works, and if they don’t, the system is broken. Confidence that if it is broken, they tell us how, and how to fix it. From that premise, there are some metrics to be derived. These metrics apply to automated testing, but also manual test documentation.

in service ratio – Out of every opportunity to generate a result from a test (roughly speaking, a test execution), what is the ratio of it generating a signal on the state of the product vs not. This can further be broken down into explicit (a test is taken out of service for maintenance, or generates a test failure) and implicit (a test is too hard, flakey, or long running to run frequently, or there’s an environment or data problem that precludes it running, etc). This should be collectable from test result data. Version control data for the test repo may also be useful.

product failure rate – Out of every potential execution of the test, how often does it surface a product issue, including if it is put out of service pending a product fix. Remember this is a value of the tests, not the product. This could also be weighted by the severity and/or priority of the issues identified. This should be collectable from ticketing/work management systems, with ticket linking and maybe a custom field or two on tickets.

target functionality value – How important is the functionality the test targets to the business. This can be because it is critical path, or because it blocks a lot of other functionality/tests. This would have to be judged by the team, with some way of maintaining consistent meaning for ranks/scores.

Cost

On the opposite side of the equation, there is the cost of building and maintaining a test case and suite, as well as the indirect cost of things like maintaining data, test hooks/harnesses, and even product features that support tests being run in the system, or that have restrictions placed on implementation or change to support testing efforts. Expanding out from that though, there are the costs that reduce confidence. Since the value of tests is providing confidence in the correctness of the product, anything that reduces that confidence is a cost borne by the organization as a whole.

initial development cost – How much will/did it cost to develop the definition, supporting data/processes, and automation if applicable. This should be available from ticketing systems and version control systems for automation.

execution time – How long does it take to run this test. This is relevant to manual tests, as it translates directly to work hours. It is also relevant to automation, as it determines how often the test can be run, and also how far back in the delivery chain (ex local unit/integration tests on every commit or file save vs run as a part of CI vs run as part of a periodic sweep) the tests can practically be included. This data should be available from a combination of test results, potentially CI logs, and ticketing systems.

maintenance cost – How much time is put in to maintaining the test, directly or indirectly. This includes updating manual definitions, adjusting tests that break, maintaining data the tests are dependent on, maintaining any automation, and maintaining any enabling hooks and test features in the product. This is a bit harder to track as it becomes harder to apportion these costs as many maintenance activities will have bearing on multiple tests, but with a little effort and diligence, it should be doable within a ticketing system, with linking and a custom field or two.

out of service ratio – Slightly different than the in service ratio, rather than being a measure of how many actual executions vs idealized possible executions, this metric measures the ratio of actual executions against actual attempted executions. Basically, this measures how long a test spends in maintenance. There are two variations here. The first is solely based on how often a test is not executed or fails due to a test issue, and tells us how consistently we can rely on the coverage the test provides. The second includes being out of service because of a blocking product issue. The second form is of course directly tied to the turn around time for bug fixes in the product, but it can provide a signal on what tests may need to be revised to narrow their scope (a test which unnecessarily couples features, so either feature having a bug blocks both features from being covered, for example) or change their approach to be more isolated around the targeted behavior. This data should be available from test result logs, provided they include failure categorization (ex product vs test issue). For automated tests, it could also potentially make use of data from version control systems.

uncaught issue rate/count – How often, or how many times, does a product bug get identified that is within the coverage area of a given test. This goes straight to confidence. There are also two variations here. The first is concentrated only on issues caught downstream of a test in the development process, to illuminate holes in coverage. The second looks upstream of the test, to get a signal on whether a test may be a good candidate for improvement or porting to move it earlier in the development process. This should be collectable from a combination of ticketing system, version control, and CI/CD logs/events.

time to fix total and average – How long does it take to fix test issues for a given test. The total (and the total vs the number of executions or the total amount of time it has existed) tells us what tests need attention for either reworking or review/retirement. The average per test issue tells us how hard, roughly speaking, it is to fix a given test. Variants here are time from first failure to fix and time from start of work on a fix to delivery. This can be collected from a combination of ticketing system, version control, and test result data.

average time to isolation – How long it takes to root cause a failure of a test. This can be sliced up a few ways as well. First, how long it takes to identify if a test failure is a product or test issue. Second, for product issues, how long it takes from first failure or from start of work until the reason for the failure is identified and work on a fix can start. Tests should be diagnostics that facilitate fixes, not just pass/fail gates, and this metric allows us to optimize the factor. This data can be made available from a combination of test results and ticketing system information.

DORA Explora

Development and test have a bidirectional relationship; development produces features and fixes for test to consume, but test also produces results that development consumes as work items and release approval steps. As a result, these metrics feed into DORA stats for both dev and test, though in different ways.

Deployment Frequency

On the test side, initial development and maintenance cost are going to impact this most. For manual testing, and less so for automated, runtime matters as well, but much less so.

On the dev side, it’s reversed from test, in that test runtime is going to directly impact how quickly a build can move through environments. Isolation time is also going to have a fair bit of impact.

Change Lead Time

On the test side, the same metrics apply as for deployment frequency. Also delivery speed for development and product, but those things are out of scope for this post.

On the dev side, again, pretty much the same as deployment frequency. Test failure rate and time to fix is going to play in as well though.

Change Failure Rate

For test, in and out of service ratios are going to be the main aligning metrics. Uncaught issue rate and count also track test deficiencies.

For dev, product failure rate is basically this entire metric.

Service Restoration Time

With testing, time to fix is directly this.

Finally, in development, this will be impacted by fix and isolation time.

Conclusion

If some or all of these metrics are adopted, it’s possible to start having better informed discussions about testing than with the common metrics of test count and feature/code coverage. It’s possible to start talking about meaningful SLAs for test development and quality. Capacity planning and work prioritization start to move from vibes based to empirical. Basically, the ROI of the testing effort can be much more effectively measured and optimized.


Leave a comment