Scott Moore, Scott Moore Consulting
Live webinar: Performance Engineering Trends: An Industry Expert Discussion
March 2nd: 8 AM Pacific, 5PM CET
Where are we going in 2021?
This year we are moving into a new chapter of IT that began in 2020. One where digital transformation has been shoved into high gear by a global pandemic, and it has changed the way we work live. There is an even greater dependency on technology to enable us to get through our day. This means more software being built to enable us, and that means more testing. Performance and resiliency are key characteristics to successful software - now probably more than ever.
As the development method, cycles, and processes change, so must testing. How we approach performance testing and performance monitoring must evolve. I’ve observed several new trends and I want to share a few of them here so you can recognize them when you see them. If you are a performance engineer they will most likely affect your everyday work. They may also require you to acquire new skills to keep up with the pace of change.
For years, there has been talk of “shifting left” in the testing world. This is where testing happens earlier in the life cycle, preferably in parallel with development at the pace of Agile. Some have had more success than others, but those who have seen the most benefit have been those who understood that this approach takes time and discipline to implement. We have also heard about “shifting right”, as DevOps (or DevSecQAOps) attempts to finally blow away the silos from days of yore. Bi-directional consumption of performance data between the operations team (monitoring) and the performance/qa/dev group (testing) so that there are fewer surprise outages due to load is a great thing.
Now companies must get used gracefully handling continual change. This means continuous software development and integration. It means continuous delivery of updates, fixes, and new features. It also should mean continuous testing and continuous monitoring. This means performance testers and performance engineers must be part of the continuous cycle. This is basically getting “shift left” and “shift right” operating as a well-oiled machine. Like development, this will mean consuming and processing smaller chunks. It means testing smaller pieces and setting up smaller sets of monitors as you go and putting this into a continuous pipeline. The key to including performance as part of that pipeline is to make sure there is ownership of performance at every level (code level, feature level, product level). Someone should be making sure that performance factors are considered at those levels, building in those requirements into all user stories, and ensuring performance is a part of the definition of done.
Everything “as code”
The rise in use of containers with Docker and Kubernetes in production has been exponential. According to Cloud Native Computing Foundation - in 2016, only 23% respondents saw any use of containers in production for their company. Last year (2020), that number was up to 84%.
The way these applications and services are deployed on containers is by managing and provisioning machine-readable definition files (text files), instead of configuration tools. This is known as “infrastructure as code”.
As this process has matured, other areas are adopting the “as code” approach. This includes performance testing as code, monitoring as code, etc. At a basic level, it just means defining what you want to do in a text file (usually in a YAML or XML-like format) and running a command to do the work defined in that file. This is usually accomplished by executing a command in a command line interface (CLI). From there, it’s just a matter of automating the procedure to make it part of the continuous pipeline. While that’s an oversimplification, the goal is to reduce the amount of toil by automating as much as possible.
For the performance engineer, this means getting used to dynamic and ever changing states. Performance testing will need to account for infrastructure that is completely dynamic (autoscaling), and be able to test with a lab that is also dynamic to reduce the cost of running it in the cloud. It means paying attention to the individual cost of each service or even to the line of code. Cloud native means there is a financial cost to each execution and CPU cycle, and the meter is always running. It becomes more important than ever to tie the financial impacts of poorly performing applications on the company’s bottom line.
Observability over APM
Many companies are moving away from monolithic applications with just a couple of tiers, toward more complex and distributed systems made up of smaller microservices. I tend to think of it like building something with lego blocks. An individual lego can be updated, replaced, or removed without interrupting the overall application, which means better resiliency. As mentioned earlier, everything becomes more dynamic. How do you monitor that kind of system and understand the performance of it? Application Performance Monitoring solutions of the past won’t work. Enter the new buzzword of the day: Observability.
Observability is NOT the new APM, but it is an additional characteristic that is used with APM to more easily determine the source of an issue in these dynamic and complex systems. According to Wikipedia, “observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.”
In everyday lingo, it means you need to be looking at as many things as possible to determine what’s going on. Which means you still need APM, which provides the monitoring, but it needs to be able to see more than in the past. This is why we are seeing a rise in viewing not only infrastructure metrics like CPU and Memory usage, but pairing that with more detailed tracing, WC3 metrics from the browser, and tying it all together with various logs from the system and application. OpenTelemetry and OpenTracing are examples of attempts to create standardized API’s that can be used to get metrics from these dynamic systems. Think of it as shooting a flare gun into the architecture path and displaying various metrics as it hops along, sending all of that monitoring data back for visualization. The same request a few minutes later may travel a completely different route than another as containers failover or scale up and down. This provides a way to trace that path and the time spent on each segment/hop of the trip and visualize where a performance problem might be.
Performance engineers need to understand the differences between observability and APM, and what to do with that data to help them make better decisions in optimizing a system.
Machine Learning and AI
Observability means monitoring more things at a more granular level. This means a lot more data than ever before. Once the data is available, it can be visualized. The biggest challenge becomes how to make sense of the data in a timely manner so that performance issues can be recognized and eliminated quickly.
This is where machine learning and AI can help to spot patterns from all the data being received much faster than a human. It’s time to let the machines figure out what’s going on with the machines. There will always be a need for someone to view the results at a higher level to verify the patterns and deal with unpredictable outages. Machine learning tools should be able to make the time to resolution much faster, and provide some level of predictive maintenance to avoid outages when possible.
Performance engineers should be learning as much about how machine learning works, and be able to provide rule sets that can be used to make finding performance bottlenecks a breeze. We need to apply these to both performance testing by looking at test results data, and all of the APM monitoring and observability data. As we hone in on the patterns and anti-patterns that cause us headaches today, this is one area that can make our lives easier.
Time for Change
As you see these things emerge and make their way into mainstream companies, be prepared to keep learning. The main thing all of these trends have in common is that they are a result of the move to highly dynamic environments and systems. Change is the only constant. Over the next year, I recommend that you engage and communicate with the development and operations teams like never before. Learn how these trends are going to impact you as a performance engineer. Beef up on your knowledge and skills. That may mean learning Python or setting up your own Kubernetes cluster in a demo environment. It could mean writing some YAML and spinning up a few “Dockerized” load generators in your test lab. Get ahead by learning the skills you need to address performance issues that will arise as a result of these trends affecting your organization.