Having thousands, or millions, of real users on your application can be very exciting for a VP of Engineering.  However, this excitement can quickly turn into fear as you realize that the rate of product adoption may soon (or already has) overcome the capacity of your infrastructure to handle it.  Of course, if you are hosting on the cloud, spinning up new server instances is automated.  So, you are covered, right?  Well, with more server instances, costs will also increase, which may create a budget constraint. And not all scaling concerns are solved by simply adding more servers.  If you are provisioning your own hardware, then you definitely have to stay ahead of scaling requirements.

Conversations about scaling usually start with architecture approaches.  Patterns, like horizontal scaling and microservices, emerge.  I agree that these are important, but I think some teams overlook a more fundamental opportunity.  That is to ensure that your back-end code processing is as performant as possible.  For this purpose, I am focusing on the server-side of your application/infrastructure.  Assuming you are delivering a service over the Internet, your infrastructure likely receives a request from a set of client applications over HTTP, does back-end processing to assemble the data requested and then returns a response.  That back-end processing will usually encompass a number of different service layers, data stores and external calls.  While it may be inherently obvious, focusing first on performance tuning can reduce the growth of back-end server resources as usage increases.

In this post, I’ll review some basic performance tuning approaches that my teams have applied in that past. These didn’t eliminate the need to utilize sound architectural design patterns to our infrastructure.  For coverage of those, there are a number of great references. One example is The Art of Scalability.  I encourage you to check out those resources – I won’t try to cover them in this post.

Performance First

The amount of processing power available on a single server can be thought of in terms of its capacity.  If you are running a web-based service that receives requests from multiple client types over HTTP, then your architecture likely has a front-end tier of web servers (Apache, nginx, Tomcat, IIS, Tornado, etc.).  These web servers will parse the client request, apply some basic business logic, determine the set of back-end microservices or databases that need to be accessed, make calls to those, process the returned payloads and then assemble the overall response back to the client. It’s these web servers that become the visible choke point when scaling problems manifest.  Your clients will experience increasingly long response times, and ultimately time-outs.  At this point, your site or service will be considered “down”.

Let’s focus on the utilization pattern for these front-end web servers.  Loosely, utilization for a web server is determined by the average time to process each client request multiplied by the number of requests.  A thread-based web server will create more threads to process these requests in parallel, taking advantage of multiple cpu’s, but will still be bounded by maximum utilization on each cpu.  A key constraint of a thread-based web server is that it blocks processing on I/O calls.  Event-driven web servers (nginx, node.js, etc.) don’t block processing on I/O calls like the threaded model. Instead, they dispatch the I/O call and then process the results on a callback. In either case, as the number of requests to your web server increases, so does its utilization.  Depending on how many cpu’s and memory a web server has, its peak capacity will represent a certain amount of utilization.  A useful measure of utilization is server load.  Load is a common operating system metric that can be collected from your servers.  Basically, the amount of load your server can handle is equal to the number of cpu’s on the box.  A server with 24 cpu’s would have a target load value of 24.  A server can pass its max load value, but usually performance has noticeably degraded.

If a server’s capacity represents a peak utilization level, then you can generate more capacity by reducing utilization.  Since you don’t want to throttle or turn away client requests, this means focusing on the average cost per client request. If you can performance tune the processing of the client requests, and reduce the average time to process, then you are effectively increasing your server’s capacity.

Using a performance analysis tool, like New Relic or AppDynamics, you can get a detailed view of all the client requests being processed on your web server. The tool will show you the number of calls of each type per second and the average time to process each. You can then quickly see which client request types are the most expensive in terms of processing resources (calls per second x average time to process). These are the client request types that you should focus on first. Your performance analysis tool will then allow you to view the call stack for each request. You can view time spent in each major code function and all calls made to systems outside the web server.  External calls include resources like databases, third party APIs, file servers, message queues, etc.  By examining the call stacks for the most expensive client requests, you can identify the bottlenecks in processing. These bottlenecks provide opportunities for performance tuning by your team.  By repeating an exercise of gathering performance data and tuning performance over time, you can gradually reduce the utilization of your servers. This will result in a net increase in capacity. I’ll share some areas where I have found performance tuning opportunities in the past.

Code Bloat

As you look at the most expensive client requests, use your profiling tool to drill into the individual functions that are being run. At a function level, you should generally see very fast runtimes for most functions (less than 1ms).  If a function doesn’t make any calls to another server (database, file server, etc), it should easily execute in microseconds. If you find a function with a long runtime, without external calls, make note of it and ask a developer to review the logic. There is likely an opportunity to reduce the processing time through code simplification or a better algorithm for a complex operation. For example, the code might be using an inefficient sort operation or doing heavy string manipulation.  Also, look at the amount of memory consumed by the process. If your code is loading a large number of external libraries, this can slow down processing.  I have seen cases where a large class library is loaded in order to access a single function. You could consider just stripping out the code from that particular function.  Of course, this breaks object-oriented design, but might be worth it to reduce processing times for a heavily used code path. If your code makes use of file includes (mainly applies to interpreted languages, like PHP) , beware of cascading includes. This is the case where a single file include then references other file includes and so on, until a huge amount of code has to be loaded into memory in order to run just a couple hundred lines.

Data Access

Your profiling tool will show you the points at which code execution makes calls out to databases.  Examine two things here.  First, how long on average are the database queries taking?  If they seem long (more than 5-10ms is a good benchmark), then examine the query and the performance on the database itself.  There might be an inefficient data access pattern, like a full table scan.  Often, inefficient data access can be improved with a new index on the table, or even changing the order of operators in the WHERE clause.  If the query is performing table joins, pay particular attention. Make sure the join is done on a key, and that the tables being joined have a small number of columns. I try to avoid table joins for queries serving your front-end application, or have no more than two tables, with one table being small. To avoid a join, you could consider creating aggregate tables with overloaded columns. A single table look-up on an indexed column is going to give the best results.  A developer or DBA can use the EXPLAIN statement on your database to examine the data access pattern for the particular query that is taking a long time.  EXPLAIN statements provide useful detail on how the database will actually run the query, including whether an index will be used.

Second, look at how many queries are being made to the same database within a particular code path. If you see many instances of simple calls to the same database repeated, like in a loop, you could have the team consider if the code could be reworked to reduce the number of calls. The database query could be modified to retrieve all the necessary data in a single call, returning a larger set of results. These results would need to be stored and decomposed on the web server, but this should reduce processing time, as making a network call to a database will generally cost an order of magnitude more than processing results in local memory.

Finally, review your use of caching of commonly accessed, but infrequently changing, data.  For example, let’s say you have a user table that contains records with standard user data like name, email, address, etc.  You have an index on this table for the userID.  If a common code path is to pull down this user data by calling the database, then consider caching this user data somewhere instead of accessing the database.  You can set up a local cache on each web server, or make use of a distributed memory cache like memcached or redis.  I have seen external calls to memcached execute in 1/10 the time that a database call would take.

External dependencies

If making calls from your web server to another resource (like a database) in your local data center or cloud availability zone is expensive, then a call to a resource on the Internet will add another order of magnitude to that. Modern web applications are making use of an increasing number of APIs for third party services.  Examples might be authentication with Facebook or retrieving a places record from Foursquare.  Calls out to the Facebook API can take anywhere from 100ms to 1 second to respond.  With this kind of overhead, you definitely want to minimize the number of these calls that are made.  If a code path has to retrieve data from an external source, make sure there is only one call per client request (I have seen cases where these calls are done in a loop).  Most third party APIs, like Facebook, allow you to batch data requests into a single call.  Make sure your code takes advantage of this benefit.  Finally, as mentioned for database calls, cache frequently accessed, but infrequently changing, data locally.  A Place record for a restaurant or popular destination is not going to change very often.

Also, ensure you have a short time-out on these external calls.  I have used 1 second in the past.  If you don’t explicitly set a time-out, you might get the default for your code library or web server.  This could be as long as 30 seconds.  If the external service is down, like the Facebook API is occasionally, then threads on your web server will quickly stack up waiting for a response before timing out.  Alternately, on an event-driven web server, users will eventually give up waiting on a response.

Synchronous versus Asynchronous Processing

Another area to examine for optimization is whether any of the processing being done in heavily used code paths could be pushed off to a job and run asynchronously.  An example might be sending an email to a user.  Let’s say that you run a social media application, that allows one user to make a comment on another user’s post.  As part of your content posting functionality, users can opt to be notified that they received a comment through a number of different communication mechanisms, like email, mobile SMS, push notifications, etc.  In the code that processes the submission of a comment to another user’s post, you should include just the logic to update database state and construct a response to the submitting user in the synchronous (blocking) processing on the web server.  For the processing that determines what other communication mediums to use and the construction of the actual email/text/iOS content, that should be pushed off to a job that runs asynchronously on another server.

If you don’t already have a job processing system in place, you can create one with a few new systems.  First, you will need a high performance message queue server, like RabbitMQ. If you are hosting on AWS, you can use the Simple Queue Service. The message in this case contains the instructions for the job to be performed and the relevant metadata.  In our example, this would be the comment posting job, the comment ID and the content post ID (and possibly associated userIDs to prevent database look-ups on job processing).  This message would be put on a queue, that is maintained by your message queue server.  Second, you set up a tier of servers running code to process the jobs. Each process on the server is referred to as a worker.  You can run many of these worker processes in parallel on a worker server.  They can be scheduled to run on cron or in a continuous loop.  The workers will consume messages from the message queue server and then perform the job requested in the message.  Since the workers run on a separate set of servers, their processing cost is offloaded from your front-end web servers.  AWS provides support for a worker environment in Elastic Beanstalk.

Client-side processing

Finally, consider if any of the processing you are doing server-side can be done on the client.  This might include string manipulation, formatting or value calculation.  Basically, processing that doesn’t require proximity to the resources inside your data center or cloud availability zone.  Since you are using the computing resources of your clients, then you are getting this scaling for free.  Calls to third party APIs can also be made from the client application, but consider the security implications.