Ido Neeman
August 12, 2019
Cold starts are one of the biggest challenges to serverless adoption. Regardless of the serverless platform, cold starts, or invocation overhead, can add seconds to the response time and create noticeable latency for users. As explained by Erwin van Eyk, Lead Developer on Fission, “a cold start is, in its essence, the worst-case time that a function execution will take.”
And despite the sometimes severe performance problems that they incur, van Eyk maintains that “cold starts are currently a fundamental characteristic of serverless computing.”
The nature of cold starts varies by FaaS provider, but there are several factors that affect all serverless platforms. Addressing these factors can help you improve the performance of your serverless applications.
When a function is invoked, the FaaS platform needs to deploy an environment (i.e. a container/sandbox) by setting up a sandbox and loading the function code, for it to run in. This process adds latency and slows down the user experience. This can have a noticeable impact on user-facing applications such as eCommerce sites, ride-hailing services, and gaming platforms.
For example, 90% of online shoppers abandoned an eCommerce site due to poor performance, and nearly 25% would not return to a slow site. The threshold is much lower for online games, where low latency is essential. In one preview of Google’s cloud-based streaming platform, Stadia, just 200ms of latency made a huge impact on playability.
Now, let’s take a look at the main causes of cold starts.
Languages
The runtime is the environment that your function runs in, including the programming language and framework. In general, open source programming language, GO, will run faster than languages like C# and Java - which can take significantly longer to start than JavaScript or Python. A major factor is the use of just-in-time (JIT) compilation. In the case of .NET languages, AWS Senior Developer Norm Johansen explains that “one of the most intensive tasks at startup is jitting your machine agnostic .NET assemblies into machine specific [assemblies]” and should be considered “the primary culprit of longer cold starts”.
Function Size
Larger functions tend to remain passive. A 35MB JavaScript function takes an extra second to start on Azure compared to a 5MB function. This is problematic for applications that rely on large libraries, such as AI and machine learning applications.
Function Chains
Functions calling other functions create a chain of requests, with each one dependent on the previous. This concatenates invocation overhead across the chain, and if the chain consists of cold starts, the resulting latency can be very high. In addition, while cold start time won’t add to your operating costs, you are paying for the time your functions spend waiting for responses from other functions..
Virtual Private Clouds (VPC)
VPCs improve security by creating a network perimeter around resources. However, when running a Lambda function inside a VPC, each invocation creates an Elastic Network Interface (ENI), allocates an IP address for the ENI, then attaches the ENI to the function. This process adds as much as 10 seconds to cold start times.
The big three FaaS providers (AWS Lambda, Google Cloud Functions, and Azure Functions) vary in how they handle cold starts. Each one also has its own unique benefits and challenges, which we’ll include here when relevant. We’ll include two measurements for each provider: one from Mikhail Shilkov, who tested a variety of runtimes; and one from Serverless Benchmark, which tested the Node.js runtime. These measurements were retrieved on August 8, 2019.
AWS Lambda
Lambda is one of the faster FaaS platforms, and recently improved their performance, with Shilkov measuring most runtimes between 500–800ms. The exception is C#, which can take between 800ms and 5 seconds. Instances remain warm for between 25 and 60 minutes, and instances are almost always disposed of after 65 minutes. Serverless Benchmark places Lambda’s cold start times between 220ms and 4.6 seconds, with a median of 304ms.
Using a VPC significantly increases cold start times (by as much as 17 seconds), although the Lambda team is working to improve these speeds.
Google Cloud Functions (GCF)
GCF has noticeably slower cold starts than AWS. JavaScript functions take 1.5 seconds on average to start, roughly three times as long as AWS. This difference is compounded for larger functions, with a 35MB JavaScript function taking as much as 22 seconds compared to 5.5 seconds on AWS. However, GCF tends to keep instances around for much longer; more than 5 hours in most cases.
Serverless Benchmark, however, measures a lower cold start time for GCF Node.js functions. Cold start times measured between 50ms and 14 seconds with a median of just 188ms.
Azure Functions
A typical cold start for a JavaScript function on Azure is between 2 and 12.5 seconds, which is longer than both AWS and GCF. Java performance is even worse, taking between roughly 20 and 35 seconds This is using Azure Functions V2, whereas V1 offers a much faster time of just 1.9 seconds. Serverless Benchmark shows a similar increase, with times ranging from 3.8 seconds to over a minute, with a median of 5.9 seconds. However, Azure disposes its instances more quickly than other platforms, lasting just 20 minutes at maximum.
Scaling HTTP-based Functions
Responding to HTTP requests is a common use case for functions, but cold starts can greatly affect response times. For user-facing applications like websites, this added latency may drive away users. In addition, each provider handles traffic surges in different ways, with AWS Lambda scaling much more consistently than either Azure Functions or GCF. The problem is made worse when using relatively slower runtimes like .NET and Java, or when running a function behind a VPC.
Databases
Connecting to a database adds to the cold start time, since the function must wait for the connection to initialize before it can send a query. Functions can reuse the connection for future invocations, but only until the function itself is disposed of. Some DBMSes, such as Amazon Aurora Serverless Database, support querying over HTTP. This lets you send queries from functions to a database without having to manage a direct connection. In AWS, this also lets you access VPC databases without the overhead of running your function in a VPC.
No matter your FaaS platform, here are a few steps you can take to reduce your cold starts: