VP Engineering Playbook

A practical guide for new leaders in software development

Category: Infrastructure (page 1 of 2)

Migrating to a Microservices Based Architecture

In the last few years, there has been a proliferation of articles related to microservices.  Martin Fowler provided a helpful overview of the approach in a blog post in 2014.  Historically, microservices seem similar to past recommendations to adopt a service-oriented architecture. Fundamentally, these approaches advocate breaking up a large monolithic application into a set of discrete services, each running in its own process and supporting an interface based on a lightweight communications protocol, like HTTP.  These services usually run on an isolated server tier and can be deployed independently through automated processes. Each service normally supports a major functional area within an application, like search, fraud, messaging, etc.

In a previous article, I covered the challenges of maintaining a monolithic architecture.  Assuming you have made the decision to migrate to microservices, this article will provide some tactics to consider.

Getting Started

In planning your move to microservices, here is an initial set of recommendations:

  • Team. Designing back-end applications around a microservices approach works best if the team is organized in a similar way.  This means dedicating a set of individuals to each microservice.  As the number of microservices expands, you may find common groupings among them and assign the same team to own more than one microservice.  The supporting team should consist of several engineers, a product manager and a QA tester.
  • Start small. If your team is beginning its migration to microservices, you should start with a simple use case implemented as a single instance.  Building and supporting microservices will represent a new paradigm for your team. Selecting a single function initially to migrate will reduce the impact of learning mistakes. Create the first service and let it bake for a few weeks before migrating more.
  • Include common infrastructure components. For the first few services you build, you will want to validate your technology approach as part of a proof of concept. In order to make this proof of concept encompassing, ensure that your initial use case touches your major infrastructure components. These can include data storage, caching systems and message brokers.
  • Technology selection. It’s usually best to maintain a single technology framework for your microservice implementation.  This allows your team to develop expertise and share code across implementations.  It also simplifies DevOps and issue resolution.  If you find unique use cases that argue for more than one framework, at least try to maintain a minimal set.  Long term support of your applications shouldn’t be impacted by a few employee departures.

Overall Architecture

Most internet-based businesses deliver their functionality through some set of client apps (web, mobile, desktop).  The data and business logic required to make these apps function is usually provided by a single back-end application, hosted at a central location on the internet.  Apps will communicate over a secure, open communication protocol like HTTP. The back-end application will generally expose its logic through a set of RESTful APIs. In most companies, the back-end starts as a monolithic application that runs on a set of web servers. Business logic and data access code for every supported function is included in this single monolithic application. Once you decide to break up this application and migrate functions to microservices, you will still want to maintain a lightweight entry-point application, with the same  RESTful API structure. This API layer will handle the overhead of client app to server communications – authentication, routing of client requests, session management and packaging of responses.  After parsing a request, the API layer will determine which back-end microservices are needed to fulfill it.  It will dispatch remote calls to these services.  Ideally, this is done in parallel, to reduce end-to-end response times. The calls to each microservice should include logic to process failures and kill the connection after a set time-out period.  This failure logic is important to ensure that the overall response to the client is not hung on a single slow micro service.

Technologies

For your microservices implementation, you have a wide array of choices, much like selecting any back-end server infrastructure.  However, given that microservices are usually called as part of a response chain, rapid processing is important.  Therefore, you should select a technology that is highly performant with strong data retrieval capabilities.  Your choice boils down to a language and an associated framework with libraries that handle the overhead of processing HTTP requests/responses.

As you consider your choices, here are some guidelines to help. First, take your team’s experience into account. If your team has many members with expertise in a particular language, that should influence your decision.  Ramp up will be lower.  If not a particular language, then at least experience with a class of languages – interpreted versus compiled, functional, JVM based, etc. Second, your language and framework should support the components of your infrastructure, like your database technology, caching systems, any specialized functions like photo transcoding or statistical libraries. You wouldn’t want to select a language/framework and discover down the road that it has limited support for a critical use case. Finally, you should consider the future trajectory of the language/framework as part of your choice.  Is there a lot of momentum behind it currently? Does it have comprehensive documentation and an active support forum?

As you consider your options, here are two popular choices for high scale microservice implementation:

  • Go programming language.  Introduced by Google in 2009, Go is designed to make developers highly productive.  Its syntax is expressive and clean.  It supports concurrency, which allows your back-end service to take full advantage of multicore servers.  Go is statically typed and compiles to machine code, allowing it to run very fast.  While being statically typed, it was designed to be very readable, without too many mandatory keywords and repetition (concise variable declaration, for example), making it look more like a dynamic language. It also has some interesting additions, like allowing functions to return multiple values. Concurrency support is implemented through goroutines and channels.  The goroutine convention allows a function to be started with the go keyword, causing the function to be run in a separate operating system thread. This is ideal for remote operations, like retrieving data from a database, or background tasks.  Channels provide support for sending messages between goroutines, allowing for synchronization and blocking.  For frameworks, there are several that support a RESTful API implementation over HTTP.  The space is continually evolving, but these frameworks have traction: Revel, Beego and Martini/Gin.  As a technology choice for powering high-performance microservices, Go was recently mentioned on the Uber engineering blog.
  • Java with Play.  Java is a longtime choice for implementing back-end services.  There is a large developer community behind it, making recruiting straightforward. Java version 8 adds many of the features popular in functional languages (like Scala), including support for lambda expressions, default methods and parallelism.  The Play framework is built on top of Akka, which is an Actor-based runtime for building concurrent, distributed applications on top of the JVM.  Akka is geared towards making applications reactive – highly responsive, resilient to failure, elastic and message-driven. Play provides a web application framework on top of Akka.  It is stateless, supporting an asynchronous model with non-blocking I/O operations. It is also RESTful by default with full JSON support.  Play supports implementation in both Scala and Java.  The reason I advocate for Java is the lower developer learning curve. Recently, a few of the larger internet-based companies, including LinkedIn and Twitter, have expressed challenges with their migration to Scala and are moving some of that infrastructure back to Java.

Testing

Another advantage of a migration towards discrete microservices is testability.  Given the encapsulation of a microservice, your team can create a set of tests for each one in isolation.  Given that microservices typically do not involve a user interface, automation of testing should be straightforward.  You could employ a test automation framework geared towards generating RESTful API requests and checking responses. Some options include Chakram and Frisby.js. Automated tests can be integrated into your CI environment, allowing a set of tests to be verified against every code change.

Documentation

Ideally, coding against a microservice should be self-service for developers.  This is best accomplished through documentation. If an application function needs to retrieve data from a particular microservice, the developer should be able to read about it in a shared documentation repository. Even better, developers can reference a catalog of available services and decide what they need to utilize. Microservice documentation should include the inputs that can be sent to the service and expected outputs. Exception handling should be addressed as well.

Managing your Email Delivery

Reliable email delivery is critical for most companies that do business over the Internet. While users can opt to be contacted through other mediums, email is still the primary mechanism for providing transaction confirmations and the basis for account registrations. There are two categories of email. Transactional emails are those which a business sends to a single user in response to an action that the user performed, like a purchase, registration or friend request confirmation. In these cases, the user expects to receive the email. Transactional emails are contrasted with marketing emails, which are generally sent in bulk to a large number of users in order to promote something. Most marketing emails include a call to action that drives incremental business. If users aren’t receiving transactional emails, they will be missing important information or will be blocked from further action. If they aren’t receiving marketing emails, then they can’t generate incremental business for the sender.

Email deliverability is represented by the percentage of a sender’s email which makes it into the user’s inbox. Most consumers maintain a personal email account with one of the major email ISP’s – Google (Gmail), Microsoft (Hotmail, Outlook), Yahoo (Yahoo Mail), Comcast (comcast.net), etc.  These ISPs have developed sophisticated systems to examine all incoming email and determine which should be allowed into their users’ inboxes. With the widespread prevalence of email spam, this has become a necessity.  A very small percentage of all email is actually delivered into the average user’s inbox. Email not delivered to inbox is either discarded entirely (blocked) or directed to the Spam folder (also known as Junk or Bulk). Additionally, ISP’s will throttle the amount of email they allow into their network from a particular sender. If a sender is transmitting a large volume of email to an ISP and experiences throttling, then there will be a significant delay between the time their systems initiate the email send and the user actually receives it. As an example, this could impact new user registrations as users await their account validation email.

If your business depends on reliably delivering email to your users, then it is very important that you understand how to manage your email program. You can outsource this function to  a number of email service providers, like SendGrid, MailChimp or Amazon’s Simple Email Service.  These vendors will handle most of the work related to maximizing your email deliverability.  Even if you outsource this function, you should understand how email deliverability works in order to track the success of your email program.

This post will provide some background on the email distribution process and tips for maximizing deliverability of your company’s emails. While your peers in marketing will track the impact of email campaigns closely, the engineering team still needs to manage the mechanics of the email infrastructure.

Basics

In order for an email to be sent, it is transmitted by a mail server. Each mail server will have one or more associated IP addresses. The sending IP address is an important aspect of deliverability. For a particular sender, these IPs will be fixed and change infrequently.  This is because the sender IP address builds up a “reputation” over time.  Reputation can be thought of like a trust score, representing the aggregate of past email behavior for that IP. One email marketing vendor, ReturnPath, has established a Sender Score that is used by many ISPs to check the reputation of an IP address.  Sender scores represent a value between 0 and 100, where a higher value is better.  A Sender Score over 90 is considered good. Each ISP receiving email will build a profile of the source IP over time. They may check Sender Score initially and then supplement that with their own proprietary algorithms for reputation over time. The higher an IP’s reputation, the more likely email from it will be accepted at high volume by an ISP and delivered into users’ inboxes.

When a new IP address begins sending email to an ISP’s users, the ISP will have no history on this IP.  As a result, it will throttle the amount of email allowed from this IP.  This throttling will occur across all email sent from the source IP, regardless of the relationship between individual users and the company. Therefore, if you are just beginning to send emails to your users from a particular IP address, the allowed volume will be low. Alternately, if you have been sending a large amount of email from a particular IP historically and want to shift this traffic to another IP, you will need to “warm up” that new IP. Warming up an IP involves gradually increasing the volume of email sent from it, until you reach the normal send rate.

Most companies sending a large amount of email will maintain multiple source IPs. The number of IPs from which you send email can vary based on a few considerations. First, you should separate the IPs from which you send transactional versus marketing email. This is because you want your transactional email IPs to have the highest reputation and thereby the highest chance of landing in the inbox.  Since transactional email is expected by your users, they are less likely to mark these emails as spam.  Once you divide your sending IPs into two pools for transactional and marketing emails, then you will want more than one IP in each pool.  This is because a single IP can sometimes have a temporary issue with an ISP and be throttled.  I have seen ISPs throttle only one IP out of a pool of 5 IPs randomly, even when all IPs have similar reputation levels and send rates. When a single IP is throttled, your mail servers should automatically shift email sends to other IPs. Given this need for IP redundancy, each pool should have at least 2-5 IPs. Deciding to have more IPs than 5 will be influenced by your overall email volume.  You need enough email volume passing through the IPs to register with each ISP, including the smaller ones. I’d recommend at least 5MM emails a day on each IP.  If you aren’t near these volumes, then limit your sending IP pool to the minimum (2-3 IPs).

On the other end of the email send is the individual user. Every user has a relationship with each of the companies that send them email. ISPs will also track this relationship – whether they have observed email from your company (source IP) being delivered to each user in the past. As users form new relationships with companies, there is the risk that the ISP categorizes the new email as spam. This can be mitigated if the source IPs have a high reputation.  Beyond the reputation of the sender’s IPs, the ISP will track the engagement rate of each user with each sender’s emails.  They will track whether the user is opening the emails and clicking on any links inside.  This data is fed back into their algorithm for determining if a sender’s emails should be delivered to users’ inboxes consistently.

In addition to these influencers, there are a few other actions which will affect email deliverability:

  • Problems with the user’s email account.  The state of an individual’s email account can change over time. They can close the account, or their ISP closes it after a long period of inactivity. Alternately, the email account storage allocation can fill up, where there is no longer space available to receive new email. The ISP might even be experiencing a service issue with its email infrastructure. When there is an issue with an individual email account that prevents it from receiving new email, the ISP will generate a “bounce” and send that notification back to the sender. This bounce can be captured by your mail servers and processed. Bounces will have codes identifying the type.  As the sender, you can examine these codes and determine how to treat the email account in your system for future sends.
  • User unsubscribes from your email.  The unsubscribe action is usually initiated by the user as part of following a link on your email or site, where they are requesting to be unsubscribed from your email sends.  These unsubscribe requests should be processed as soon as possible.
  • User marks your email as spam. Users can mark your email as spam.  This indicates to the ISP that they no longer wish to receive it and provides an indication to the ISP that the user considers the email to be spam. Users have a wide range of definitions for spam, which makes handling of spam reports tricky.  Generally, an ISP will examine all the spam reports for a particular sender IP in aggregate and compare that to the amount of email delivered to inbox. Maintaining a low spam to received email ratio is very important.

These actions can all be tracked as part of the feedback loop that you set up with each ISP. Through the feedback loop, the ISP will send notifications back to an address you specify.  Your mail servers will receive these feedback messages. You can write scripts to process them and make appropriate changes to the status of each user’s email account.

Maintaining your Email Reputation

After the relationship between your source IPs, the end users and their ISPs has been established, there are actions you can take to maintain or improve your deliverability. This guidance boils down to two areas. First, keep your email list clean.  Second, make sure that users want the emails you send.

  • Use double opt-in for gathering email addresses. Sending email to addresses which actually exist is an important contributor to maintaining the reputation of a sending IP. Conversely, if an ISP sees an IP sending email to addresses that don’t exist or are inactive, it is strong indicator of spam activity. Confirming the validity of an email address can be accomplished through the double opt-in approach. Double opt-in is performed at the point of user registration, and involves requiring the user to click on a link in an email sent to the user’s email account. This confirms that the user has control of the email address.  Confirmation of email serves two purposes.  First, it indicates that the email address exists and wasn’t entered as gibberish by an unscrupulous user trying to get through the registration process quickly.  Second, it prevents someone from registering with a different person’s email address. In both cases, future emails sent to that address would impact IP reputation. Double opt-in eliminates these risks.
  • Process bounces. When the ISP generates a bounce for an email address, ensure it is processed. You can write scripts that examine each bounce and determine how to handle them. There are two types of bounces – soft and hard. Soft bounces indicate a temporary issue with the account, usually a problem with the ISP’s infrastructure or a full email account. Hard bounces represent an email address that has been deactivated or doesn’t exist.  Hard bounces should disable future sends immediately. Soft bounces can allow for a few repetitions before being disabled. As a follow-up, you could present an alert to the user the next time they log into your application, indicating an issue with their email account. Continuing to send email to a problematic address that is generating bounces will impact your IPs’ reputation.
  • Do not trust email addresses from another source. Email addresses for your core customer list must represent actual customers who have opted into your service. Do not import email addresses from another source into your primary email list. If you are acquiring another company and need to adopt their email list, then I would send test emails from another set of IP addresses, which request the users to confirm that they still want to be contacted for your company’s offerings.
  • Process unsubscribe requests. Make the unsubscribe request easy to locate and process these immediately. While most ISPs allow a few days for unsubscribe requests to be processed, you should stop sending emails to unsubscribe recipients as soon as possible. If a user goes to the trouble of requesting an unsubscribe and you continue to send them email, their next step is likely to mark your emails as spam.
  • Authentication. Standards are available that allow an ISP to verify the authenticity of the sender when an email is received.  These include DKIM and SPF.  Many ISPs are now using authentication as an additional filter criteria. Make use of these standards in your email transmissions.
  • Provide easy update to an email address.  It may go without saying, but you should provide an account maintenance section within your application, where users can update the details of their account. Within this, make it easy to find the section for updating one’s email address.
  • Allow users to specify what types of emails they want. If your company sends different types of emails, then you should allow your users to opt out of some of them. This applies particularly to the distinction between transactional and marketing emails.  They need to receive transactional emails, but may not want to receive your promotional emails. On a section of your application’s user account settings, list the types of emails a user can expect.  Allow the user to deactivate receiving certain emails. If you want to be more conservative, you can begin with all marketing emails deactivated and ask users to opt into the ones they want to receive.  A good time to set expectations around email types and frequency is within an initial welcome email following registration. The key to these approaches is make the user feel that they are in control, which will lower the chances that the user would mark your email as spam.
  • Taper sending email to inactive users. Your application likely collects data on the last time a user logged in. You should track this usage and apply it to your email send frequency. If a user becomes inactive, then you should reduce the frequency of promotional email sends to them. For example, if you normally send an email daily with some sort of news summary or a set of recommended purchases, then you could reduce the frequency to every other day or less, as you see their usage reduce. This reduction of frequency lowers the chance that an inactive user will mark your emails as spam, rather than unsubscribing.
  • Increase engagement levels. ISPs will measure the engagement level of your users with the emails you send. That engagement level will be applied to your reputation as a sender. In order to address this, you should try to maximize the likelihood that a recipient will open and interact with your emails. Therefore, you should carefully consider the content of each email you send to your users.  Is the content relevant for the recipient? Is it fresh on each send, or are you sending the same content repetitively, assuming the user hasn’t considered your offer in past emails?

Obviously, some of these recommendations can cut into your business. For example, by sending fewer promotional emails, you will get less response. You should monitor the health of your email program closely. Try to maintain a balance between engaging your users through email and the reputation associated with your sender IPs.  If you are managing your email program yourself, there are vendors who will help you monitor your email sender reputation.  Mentioned earlier, ReturnPath collects data on the reputation of your mail server IPs from the ISPs directly.  This allows them to produce reports showing your IPs’ reputation by ISP.  Additionally, they can monitor the delivery of your email to a set of test accounts they maintain at all ISPs.  You can be alerted if delivery of a particular email type drops unexpectedly.  They also provide feedback on email design, as a means of improving user engagement with your email messaging.

Other Danger Areas

  • Spamtraps.  A spamtrap is an email address that doesn’t represent a real user and is set up solely to identify spammers. Spamtraps are usually maintained by ISPs or third party spam-fighting organizations. The spamtrap email address is generated in one two ways. First, it can be created by the ISP and then sprinkled across the web in locations where spammers would normally harvest email addresses through scraping (forums, social sites). Second, the address can represent a real account that has gone inactive. In this case, the ISP will close the account and then return bounces for several months to senders. Finally, they will stop sending bounces and record any new sends to the email address. It is at this point that the spamtrap is active. Emailing to a spamtrap is considered to a grave offense.  Even a single send out of millions of good emails can trigger blocking of the sending IP or cause a significant impact to reputation with that ISP. Spamtrap addresses will get onto your email list by either adding unverified email addresses or not processing bounces.
  • Back-up infrastructure. Most large Internet-based businesses have redundant infrastructure to handle failures. If you are maintaining this kind of redundancy, ensure that email send volume is balanced across all mail server infrastructure.  Practically, this means you want to avoid a situation where a failure in primary infrastructure results in a dramatic increase to email volume for the back-up (like from 0). This surge will likely be met with throttling by the ISPs, which will take several days to work through. It’s best to try to route equal amounts of email through all IPs that might be active at some point.

Challenges of Maintaining a Monolithic Back-end

As new Internet-based companies get started, they will often organize all back-end code into a single codebase. This code is deployed onto one tier of servers, representing their primary back-end infrastructure. Using a single application back-end is often referred to as a monolithic approach.  This approach is contrasted with a service-oriented architecture, where application logic is divided into a suite of services (more recently referred to as microservices). Each service is independent from the others, with a separate codebase and deployment target.

Maintaining a monolithic back-end can offer some advantages to a small team, but will create challenges as the application grows. In this post, we will examine some of the considerations for maintaining a monolithic back-end, versus segmenting out logic into discrete services.

Duplication of Logic

A single application often contains duplicate logic in many places.  As changes are planned to the application, those changes will need to be applied to all places where the duplicate logic exists. By breaking the application up into services, this duplicate logic can be consolidated into a single service. Changes to that logic will then need to be made in only one place. In a monolithic application, this code duplication can be reduced through an object-oriented approach, defining independent classes for each major type of functionality. However, the functionality provided by these shared classes will still need to be reviewed and understood before the class can be consumed. This creates overhead for the team, requiring each developer to maintain an understanding of a vast set of classes that can be incorporated into the main application.

Clear Contracts between Services

Migrating to a service-oriented approach requires the definition of clear contracts between services. With a well-defined interface, service functionality becomes intuitive. In order to allow other teams to interact with the service, the owning team will generally publish documentation specifying the interface.  This will be structured like an API, with function calls, inputs and outputs delineated. This interface documentation will be very useful for new team members, allowing them to learn about the service independently. Consuming teams can also ignore the inner workings of the service, as long as they maintain clear interactions through the service’s interface. This will allow development for consuming teams to proceed more quickly.

Speed of Product Evolution

Organizing around services allows small, nimble teams to be assigned to each service. These small teams usually consist of a few developers, a product manager, a devops engineer and a tester. A team this small can enhance their service very quickly. Vetting proposed changes to an application with all affected parties is usually the largest encumberence to making rapid decisions. Decision making is slowed down as the organization involved becomes larger.  Having a small, autonomous team assigned to each service will allow this decision making to proceed quickly, as all of the individuals necessary to review and approve a decision is minimal. On a large monolithic application, the engineering team will usually be segmented into smaller sub-teams, like search, content management, community, etc.  This does offer some separation of concerns.  However, if all of these sub-teams are contributing to a common code base, then dependencies will still emerge and the decision making overhead will continue.

Code Sharing

Part of establishing a new service is to provide the boilerplate functionality to perform many common functions. These can include call marshaling, security, response formation, database access, etc. Once a single service is established, much of the service interface code can be templatized and re-used in other services. This will lower the initial roll-out cost for services over time. It is helpful to organize this shared code into a set of libraries or independent code modules by function. The shared code should be stored in a separate project within your code repository. You can also consider whether to assign a particular developer or team ownership over the shared code. Even if ownership is initially distributed, like an open source project, you should still assign a single individual to conduct code reviews of new contributions.

Testing

Testing is much easier on a set of individual services than within a large application. With a monolithic application, there will generally be more logic intertwined and linked dependencies between functions. The set of regression tests necessary to ensure that a change in the application hasn’t affected other parts of the functionality will be large. Your QA team will need to run through these regression tests before every major release. As the monolithic application grows in size, this regression testing overhead will increase. By breaking out the application into discrete services, code changes can be more isolated. A code change in a single service will then require only regression tests within that service’s codebase to be run.

Risks with Service Oriented Approach

One disadvantage of using a heavily service oriented approach is the processing overhead associated with making the remote calls to services from the application that provides the initial entry point into your back-end. A call to a service on a separate server tier will always take longer than running that code on the same server. Also, if a single service is down, that can impact performance of the entire back-end application, as incoming requests continually wait for a response from the slow service.

These risks can be mitigated.  First, the performance of each service should be measured and optimized as much as possible. Response times should be tracked over time, and flagged for refactoring when they cross a threshold. To mitigate the risk of cascading slowdowns stemming from a single service failure, the application receiving client requests should be designed with service failure logic. This would include a time-out on a service call and utilization of a default value in constructing the overall response to the client.

Team Size

With a small team, it is easier to maintain a single monolithic application. As we discussed, the main cost of maintaining a single application is the coordination overhead required when making design changes. With just a few engineers, this coordination overhead is minimal. As the team expands, this communication overhead will require more time. I have seen large teams working on a monolithic application even have to schedule meetings between sub-teams to vet proposed design changes.

I think a team size under 5 engineers can easily maintain a monolithic application.  As the team expands beyond this, it’s better to start dividing the monolithic application into separate code bases, deployed as services. This allows different developers (and eventually teams) to be assigned to each service, and their work to be isolated from the others.

Older posts