May 01, 2019
AWS just released a new set of features to X-Ray that can make debugging life much easier, whether it is to better understand complex serverless architectures, user flow, or to spot high error rates and outliers.
* For those of you who are already familiar with X-Ray and distributed tracing, well, if you insist - you can skip the intro straight to the "what new features are there?" paragraph
Deciding to go serverless with your applications is definitely a good choice. The advantages are clear: you pay as you go, scalability isn't a concern, and you literally can just focus on your business logic. By the nature of things, when you use a Function-as-a-Service platform (e.g. AWS Lambda), the service distribution of your application increases, optimally by having a set of functions per application development domain. A well distributed application has some clear benefits we all are already aware of, but its biggest downside is even clearer - understanding what's going on under the hood is VERY difficult.
Let's say, for instance - you run a data enrichment flow. Your flow isn't even very complex; there's a source input and about 50 enrichment attributes, each having different logic, implemented with a different set of functions. Some are enriched via static data sources, some run crawlers, some use external APIs and some even analyze a picture or two. Let's assume function number 26 failed. It failed because it relied on function 15's enriched attribute, and function 15 gave a wrong output because it relied on function 4's output, which was wrong because the input wasn't sanitized properly. Sounds like a pain? Well, it is. And finding that root cause isn't trivial at all when all you have are CloudWatch logs.
So, where is this article going? Well, the title probably gave a small hint - distributed tracing.
While there are many solutions out there, we will focus on AWS X-Ray, which was introduced only about 3 years ago and made troubleshooting such issues much easier I can't say it's an easy task, but it made the application monitoring experience much better.
Let’s move on to the interesting analytics features AWS just announced lately. These features answer many questions that couldn’t be answered before in quite an elegant way.
Some of those questions include:
Starting with the service map, which isn't a new thing at all, you can already get a feeling for your architecture, response times and error rates. A service map can be generated and filtered for different attributes and is definitely a good starting point. Not to mention that when using Lambda, X-Ray can be bundled in automatically. And with a bit of instrumentation to your functions, you can add some important metadata like source request ID, source input, username, country, or anything similar you would want to analyze and compare in the future. -Go ahead and start collecting data so you can play around with the service map to find out relationships.
Let's assume we found out user NuwebaIsAwesome wasn't hallucinating after all in our previous question and actually errored out a lot plus had some big latencies. How can we investigate this issue? And how is the flow caused by the error differs from other users’? Or even from the same user's experience from a day ago? Well, all these questions are great, and are much easier to answer now with X-Ray’s new analytics feature.
First of all, we can start by filtering out things like response time and error count for that user only. This will give us a group of aggregated traces that can then be compared with whatever other group you can come up with in whatever time frame you choose. I believe you understand now how strong this feature is and how much easier it is to answer the example questions we previously mentioned.
But let's get back to our initial case, our data enrichment flow. How can these analytics features benefit us? First of all, troubleshooting the user experience is finally possible (you can read about this in AWS' blog post mentioned above"Analyze and Debug Distributed Applications Interactively"), but can these features make troubleshooting easier for a serverless developer? Well, yes. It can. As long as you instrument your functions with the right (specific application) metadata, you can answer your questions in an easier way.
In the above mentioned case, the previous input, next input and the initial Request ID should have been traveling through the whole flow. We could then filter all the functions for that specific root invocation. Once we got them, it isn’t hard to notice function number 26 has failed. On that same screen we can also notice one of the attributes in function number 15's input was empty. But not only that it was function number 15’s input, it is actually the output of function 4, which seemed to then have our beloved extra apostrophe! ("'", right?) And well, now that the root cause is probably so clear to you - it just lead to a new BUG ticket in your issue tracker :)
To sum it all up, this might sound a bit theoretical, but it does really make your debugging life easier. Especially when you're a serverless developer chaining lots of functions and triggers, which makes it much harder to track the order of things.
X-Ray isn't perfect but it's progressing, and the new analytical features can help us answer more complex questions than ever before. However, remember, you are the only one that understands which questions should be asked, and which questions need to be answered. Make sure the necessary data is collected in the right places, whether it's by instrumentation, automatically by X-Ray, or any other monitoring solution. Once it's there, data can be filtered properly, and pinpointing a bug or any other cause for a problem becomes a much easier task.
Just as a side note, here at Nuweba, we bundle monitoring into our serverless infrastructure, allowing network level analysis. By doing so, data collection per invocation is always there, but data collection for external services also becomes possible without bundling it to the deployment. We believe serverless monitoring still has a long way to go, but seeing those features pop up keeps it interesting and shows we are getting better every day.
Hope you enjoyed the article, may your debugging and monitoring adventures be easier, more fun, and as analytical as possible. 👾