How do you compare samples and ensure that the results of a test have either improved or degraded?

I want to know your methods on how to compare performance testing samples and ensure that the results have either improved or degraded. 

For example, Sample A got a response time of 4.05 seconds. And sample B got a response time of 4.20 seconds. Could I say that the result has degraded because sample B has a higher response time? 

But the difference is only a point of a second. Not 1, 2, 10, 20, etc of a second? May I know your methods of comparison? 



  • Hi cstrike, 

    this is an interesting question...but with a difficult answer!

    May I say that depends on the application tested and the environment (so the technological schema): in my case, we test a lot of web navigations, web services, ecc. and our infrastructure is very huge and complex; if I come across your example, I will say that there's no difference between the two runs, and the minimum deterioration can be due physiological to the environment. 

    Below, I copied two report from two runs about the same script/application: the first one, is about the app before a modify on memory thread pool and virtualization; the second one refers to the situation after the changes. 


    In this case we can assert that there's an optimization. I suggest you to use also the Transactions Per Second metric and compare that with your test and the Production environment (making the appropriate proportions about the differences from different environment), and the Average Transactions Response Time graph to see how the performance, in terms of response time, stand the load of Vusers. 

    Let me know if this can help you with the issue, 



  • In addition to two important things play a role when comparing tests:

    1a the response times of your critical transactions. Identify a few based on your test case. Most of time those involve back-end processing.

    1b You can also classify transactions that just give a view on the browser up to the web-server like loading of (static) content.

    2 the amount of work that your application performs during a certain time window. This can be the number of HTTP transactions or overall transactions per second. These numbers should be more or less stable when you take a fixed iteration time (and fail your test when a users goes not complete its action within that time).

    The values you get from the above results are sometimes difficult to compare. You might consider to use harmonic mean to construct an single artificial number to see the impact for run to run and observe trends over time. Present each used number in the harmonic mean such that e.g. an increase is bad. Management will be happy that they have to look to a single number.