Friday, June 10, 2016

Engineering Culture of stability: Flaky tests

 The scenario:

    We have an automation suites for our Apps that is set to run on every commit to master/deploy to Prod and for a long, long time (almost right after the beginning) we've been having issues trying to make it reliable enough.

    The tests are run in CI server (TeamCity) using Selenium WebDriver/Grid. We know the tests work because if we run them locally on our laptops (I and the team had tried it) they run perfectly every single time.

    But when they fail they don't always fail at the same spot. Sometimes it's a timeout while waiting for an Web element, sometimes the test ends up in an error page that shouldn't have reached in the first place and we have no idea how it got there... So yeah, it's frustrating.

    The team have tried a lot of different approaches to debug it. Re-writing the setup of each test to make sure everything is cleared up at the end of every single test so that the next one starts with a clean workspace/cache, making it so Selenium takes screenshots every time it fails to see what happened, tried different versions of chromedriver/chrome/selenium, added heavy logging of each action taken, put the tests to run several times in a row to see if there was any pattern...

The problem:

    Unfortunately, across our entire suites of tests, we see a continual rate of all test runs reporting a "flaky" result. We define a "flaky" test result as a test that exhibits both a passing and a failing result with the same code.  Root causes why you are getting flaky results are many: parallel execution, relying on non-deterministic or undefined behavior, flaky 3rd party code, infrastructure problems, etc. Some of the tests are flaky because of how test harnesses interact with the UI, sync timing issues, handshaking, and extraction of SUT state.

    Even if we have invested a lot of effort in removing flakiness from tests, overall the insertion rate is about the same as the fix rate. Meaning we are stuck with a certain rate of tests that provide value, but occasionally produce a flaky result.

Mitigation strategy:

    In my opinion even after tons of effort to reduce such problematic tests, flaky tests are inevitable when the test conditions reach a certain complexity level.  We will always have a core set of test problems only discoverable in an integrated End-to-end system. And those tests will be flaky. The main goal, then, is to appropriately manage those. I prefer to rely more on repetition, statistics and runs that do not block the CI pipeline.

    Just tagging tests as flaky is addressing the problem from the wrong direction, and it will lose potentially valuable information of the root causes. However, I think that there are some actions that can help us keep the flaky tests at their acceptable minimum. Consider introducing some of the below listed methods in your own context. They are split based on implementation difficulty, so you can plan your efforts accordingly:

[Easy] 
  • re-run only failed tests. Failed build should keep those tests, mark them and trigger second build to execute them. 
  • use combination of Exploratory testing and Automation runs. One of the basics for automation is to consider appropriate candidates (stable and are not changed too often).
  • do NOT write many GUI System Tests - they should be rare, when needed. You need to build a pyramid. There are almost always possibilities to write tests at lower level.
  • if you utilize parallel tests execution, consider moving some (few) tests into a single-threaded suite  
[Medium] 
  • re-run tests automatically when they fail during test execution. You can read the test status in the TearDown and if failed, start new Process to execute the test again. Some open-source testing frameworks/tools also have annotations (e.g. Android has @FlakyTest, Jenkins has @RandomFail/ flaky-test-handler-plugin, Ruby ZenTest has Autotest  and Spring has @Repeatto label flaky tests that require a few reruns upon failure.
  • quarantine section (separate suite/build job)  that runs all new tests added in a loop for a certain amount of executions (Fitness function) to determine if there is any flakiness in them, in that time they are not yet part of the critical CI path. Execute reliability runs of all your CI tests per build to generate consistency rates. Using those numbers, push product teams to move all tests that fall below a certain consistency level out of the CI tests.
  • consider advanced concepts like combination of xpath and Look&feel
  • refactor for Hermetic pattern, avoid global/shared state or data and rely on random test run order
  • proper Test Fixture strategy
[Advanced] 
  • tool/process that monitors the flakiness rate of all tests and if the flakiness is too high, it automatically quarantines the test. Quarantining removes the test from the CI critical path and flags it for further investigation.
  • tool/process that detects changes and works to identify the one that caused the test to change the level of flakiness 
  • test that monitors itself for what it does. If it fails, look at root cause from the available log info. Then, depending on what failed (for example, an external dependency), do a smart retry. Is the failure reproduced? Then, fail the test.

Conclusion:

    I know all of the above is far from perfect or complete solution, but the truth is that you have to constantly invest in detecting, mitigating, tracking, and fixing test flakiness throughout your code base. 


No comments:

Post a Comment