VP Engineering Playbook

A practical guide for new leaders in software development

Tag: monitoring

Establishing a Security Program

Security breaches and service interruptions have been abundant in the news over the past year.  As the leader of your growing technology organization, you need to take steps early to ensure that your customer data is safe and service continuity is maintained.  Leakage of sensitive data or a site outage due to an attack can quickly become an overwhelming issue.  In this post, I’ll outline a set of activities that should encompass a basic security program.  This is geared towards a small to mid-sized team (less than 30 engineers), supporting an application delivered over the Internet.

If you store or process credit card information through your application, you may already be familiar with PCI standards.  The PCI Security Standards Council is a body of companies from the payments industry.  They maintain a standard, called the PCI Data Security Standard (PCI DSS), which details security related requirements in order to achieve compliance for processing credit card data.  At Zoosk, we maintained a PCI level 2 compliance.  I found the PCI DSS to be a useful reference for structuring a security program.  Whether or not you have to be PCI compliant, the PCI DSS details many industry best practices to secure sensitive data. In the event you are breached, if you followed PCI DSS, it is fair to claim that you implemented reasonable security practices. I recommend reading through the spec.  Where you see references to cardholder data, simply replace that with what you consider to be your set of sensitive customer data.

If you aren’t sure what customer data would be considered sensitive, it’s best to be inclusive.  The formal definition of personal identifiable data is determined by each state, as part of their security breach notification laws.  California provides a pretty encompassing definition:

(e) “Personal information” means any information that identifies, relates to, describes, or is capable of being associated with, a particular individual, including, but not limited to, his or her name, signature, social security number, physical characteristics or description, address, telephone number, passport number, driver’s license or state identification card number, insurance policy number, education, employment, employment history, bank account number, credit card number, debit card number, or any other financial information, medical information, or health insurance information. “Personal information” does not include publicly available information that is lawfully made available to the general public from federal, state, or local government records.

With that introduction, here is a list of considerations for structuring your security program, loosely modeled on the requirements outlined in the PCI DSS.

Ownership

First, while not delineated directly in the PCI Spec, it is important that you establish clear ownership for the security program.  As the senior technology leader, you are ultimately responsible.  But, you will not spend the majority of your time focused on security.  In a fast-growing start-up, where the priority is on shipping product, trying to distribute ownership of security or instructing all your engineers to use “good security practices” will not work.  I recommend designating an individual on the team as the owner of the security program.  Ideally, this represents a full time job, in which case you are hiring a security lead.  If your team is small or budget-constrained, then you should assign this role to a software or devops engineer, and ensure that they are able to dedicate significant time to this effort.

I personally think that hiring a dedicated security professional onto your team is the best guarantee that security matters are addressed.  A team size of 10-20  engineers represents a good point to transition to a full-time security lead.  Of course, if your product is a SAAS application in the payment processing space or stores sensitive data for other companies, then this should be your first hire.

Secure the Network

The most basic element of your security posture is to secure the perimeter of the network which hosts your production systems.  This means examining all access points from the Internet to your production environment.  Create a detailed list of all the IPs, ports, protocols and services which are expected to be exposed to the Internet.  This exercise should also include outbound traffic.  Detail IPs, ports and protocols that would be used by your applications to connect to outside services.  This is important, as unusual outbound traffic is often an indicator of a compromised server (for data exfiltration, ping back to bot net, etc.).

Once you have a detailed list of allowed traffic patterns, you would want to apply these to the device that controls access into and out of your network.  Usually, this is a dedicated firewall or router, that allows authorized inbound/outbound traffic and blocks everything else.  If you are hosting through a PaaS or IaaS vendor, you will want to investigate what type of capabilities are available to accomplish the same function.

PCI requires that your network topology be documented, along with this list of allowed traffic patterns.  Changes to firewall rules must be formally approved.  Also, you should harden your actual network equipment.  Limit access to the equipment, shut down unnecessary services and maintain strong passwords.  No matter how much you restrict traffic to your production network, if your firewall or border router is compromised, then nothing else matters.

Access Controls

Once you have mapped out all of your production systems and limited network access to them, determine who should have access to your production systems.  “Who” is usually defined as a set of people, but can also be automated processes.  Access can represent all forms of connections, including at the operating system level (i.e. SSH) and services level (apache, mysql, ldap).  It is helpful to organize your people by role (sysadmin, developer, release engineer, etc.).  Then, you can generalize access controls by role.

The goal of access controls is to limit access to systems/services to only those individuals who absolutely need it.  While this may sound counter to having an open and agile culture among your engineers, it dramatically reduces the risk surface area in case of compromised credentials.  For example, a data analyst who only needs access to a reporting server should not have a credential that could be used to access the database.  It’s not that you don’t trust your people.  The problem is that anyone can accidentally fall prey to a phishing scheme or type their credentials into a compromised machine.

An access control process also includes ensuring that your users have strong passwords, which are changed frequently.  Consider mandating password vaults like 1Password for individuals or a shared repository like Onelogin.  Finally, define a process for granting access to new employees and shutting down credentials for departing employees.  I have been surprised at past companies to see active credentials for employees who left the company long before.

Storage of Sensitive Data

Personally identifiable customer data should be stored in a manner that offers more security than other types of data.  For credit card data, PCI requires the following:

  • The data be stored in a one-way encrypted form.
  • The key used to encrypt the data should be stored in a separate device from the database. Restrict access to the encryption key to the least number of personnel. Change the encryption key periodically.
  • Minimize the amount of sensitive data kept.  This primarily refers to data retention.  Maintain the fewest number of back-ups necessary for operational continuity and delete the rest.  There have been cases of sensitive customer data sourced from a forgotten database back-up.
  • Isolate the database containing the sensitive customer data onto a separate network with a firewall access point.

Even if you don’t process credit cards, you should seriously consider taking similar approaches with the storage of your sensitive customer data.

Monitoring and Testing

Access to the database storing your sensitive data should be monitored and logged.  This includes both user and service access.  For each request, you should log the query that was run.  This log data can be a very useful indicator of suspicious activity.  For example, if a normal user  pattern is to only access one record at a time, then a select query for all records in a table would be suspicious.  Also, monitor and log operating system or file system changes on the servers hosting your database, as well as the  application servers providing a path to it.  Installation of new files, for example, can indicate an active hack on the server.  Network activity should also be monitored.  As mentioned previously, outgoing traffic is often the best indicator of a potential security event happening.  Most system compromises will involve a call back to a server on the Internet to retrieve rootkit code, or to exfiltrate data.  Monitoring should be automated with logic to distinguish suspicious activity from normal.  Security events should be manually reviewed as soon as possible.

Your perimeter should be tested periodically.  This takes the form of penetration testing.  Basic automated scans can be run against your externally facing systems, looking for known vulnerabilities.  Usually, these automated scans perform a basic set of tests for service misconfigurations or open ports.  Many vendors, like McAfee, offer this service.  These automated scans can be supplemented by true penetration testing, in which a white hat attempts to break into your systems.  The white hat will initially run an automated scan across your IP space, followed by targeted attacks on any perceived vulnerabilities.  I have used Rapid7 for this type of penetration testing in the past with great results.  The output of penetration testing will be a list of possible exploits that you need to have your team address.

Vulnerability Management

You should have a process in place for tracking and acting upon vulnerabilities that are announced for the software packages that you are using.  An example of a type of publicized vulnerability is Heartbleed.  When a vulnerability like Heartbleed is publicized, your team should be able to quickly address the vulnerability through system patches.  Beyond closely monitoring security newsfeeds, you can automate this vulnerability detection with a vendor tool.  Products like AlienVault USM will scan all your production systems from the inside, looking for known vulnerabilities in the software versions you are running.  This differs somewhat from penetration testing, in that the focus for vulnerability management is from the inside of your network.

Incident Response

If a security event is reported, the incident response process kicks off.  Incident response encompasses the activities taken to assess and address the security issue. Incidence response involves a designated team of individuals, who follow a formal sequence of steps.  Those steps should be documented in advance in the Incident Response Plan.  An Incident Response Plan can include contact information for key department heads, protocols for communications and methods for collecting evidence.

Other Considerations

  • Distributed Denial of Service (DDOS).  A DDOS attack represents a condition where large amounts of network traffic are directed at your production systems in an effort to overwhelm them and take down your services.  Mitigating a DDOS attack involves analyzing all incoming network traffic and filtering out the attack traffic.  This can only be accomplished at scale by one of the DDOS mitigation services, like Akamai or Imperva.  In these cases, you will route your network traffic through their scrubbing center.  The clean traffic will then be sent back to your systems for processing.  This return path is usually though a private network link.  Set-up of this configuration requires several days.  If your company is hit by a DDOS attack, the last thing you want is to start the process of engaging with a DDOS vendor while your site is down.  Do not assume that your data center bandwidth provider or cloud vendor will handle this.  Often, they will simply blackhole your IP space (route to null) to get the traffic to stop.
  • Keep Perspective.  After you set up your security program, you will likely surface potential security events frequently.  In my experience, a high percentage of security events (over 90%) will be false positives.  However, they are all very scary.  So, it’s important to do the investigation work, but maintain calm until you have a full understanding of the event.

Elements of Application Performance Monitoring

In this post, we’ll cover how to plan your application performance monitoring program.  An effective application performance monitoring program allows you to do the following:

  • Get insight into how your applications are performing on your user’s devices.  Locate problem areas to address.
  • Measure the performance of your applications on your back-end infrastructure.  Identify bottlenecks for optimization.
  • Get alerted quickly if performance degrades, allowing for fast time to resolution.

All of these items ultimately contribute to a better user experience, by delivering a problem-free, responsive application.  The principles here primarily apply to your production systems, but can be used to monitor the performance of your development or test environments.  They are relevant whether you are hosting in a dedicated data center or on the cloud.

Key Business Metrics

First and foremost, you should have defined a set of metrics that reflect the business performance of your product.  These are generally the actions that a user is taking as they engage with your product, whether on the web or a mobile app.  Examples of user actions are registrations, logins, messages sent, product sku views, photos uploaded, searches performed, items added to shopping cart, sales, etc.  These metrics are usually the type of interactions that a product manager would care about.  If you are asking why the Engineering team would want to track these, it’s because they are the metrics that ensure your business will be open in a year and will be discussed at Exec team meetings. They provide a clear view of whether the product is performing as expected.

The reason you want to track as many of these metrics as possible is because they contribute to the overall view of the product’s performance. The actions selected shouldn’t just be at the end of the sales funnel, like making a payment.  They should capture every step in a set of user interactions.  If you track just the final outcomes, like sales, you might not be able to figure out why sales have suddenly dropped.  For example, sales could drop if users can’t register on the site, indicating an issue with your registration system (maybe the Facebook API is down, blocking registration with Facebook credentials). However, sales might also drop if your payment provider is offline (and registration counts are still inline).  Tracking as many business metrics as possible allows you to quickly make these distinctions.

These metrics need to be collected and graphed.  Don’t just rely on the summary BI Dashboards that the product managers examine.  This data is usually aggregated hourly or daily, and is too delayed to identify a problem in real-time.  You should collect these business metrics continuously (per minute ideally).  The key to graphing granularity is that you should be able to see a very distinct point in time where the metric changed.  This will aid immensely in troubleshooting, as you could associate the change with a planned event, like a code release or a scheduled maintenance.  Your graphs of these metrics can be generated in the same system you use for operational metrics (I have used Ganglia in the past for this). Just find out from the product managers or analysts what fields in your transactional database are being used as the source for each metric in their report. Your graphing system can query the same field.

Client Application Monitoring

Next, you should have a view into the performance of your applications from the client’s perspective.  This is usually accomplished by simulating the client application through a set of emulators.  These emulators can represent different browsers or mobile device runtimes, which are then loaded with scripts of transactions that mimic the path that a user would take through your application.  With these emulators, the load times and responsiveness of the application is measured and recorded. This approach to measuring client application performance is called synthetic monitoring.

The simplest way to get a view of this kind of data is to run a tool within your own browser.  An example is Chrome Developer tools, which provides an easy way to view web content loading on demand from your location.  In early development, this provides an easy way to start benchmarking and optimizing your content load times.

On a live product with many users, you want to gather this kind of performance data continuously from many geographically dispersed points.  This is where synthetic monitoring vendors can help.  Two examples are Dynatrace and New Relic.  These provide the ability to set up scripts of common user interactions or paths through your application.  Then, you can program the paths to be repeated by the emulator at some interval, say every 5 minutes.  You can also choose geographic locations across the globe where you want the emulators to run.  Good vendors will have more than 100 nodes to choose from.  However, in my experience you can glean enough insight from 5-10 nodes.  Depending on the distribution of your audience, you would want at least two nodes in the US, then one on each continent.  While the temptation would be to max out the count of nodes and test frequency, most vendors will charge you based on the number of measures.  So, more nodes and higher frequency will generate more cost.

Screen Shot 2016-01-12 at 10.56.00 AM

Once the data is collected, these services will provide aggregate reports showing overall load/traverse times for your user experiences, with the ability to drill into any one to troubleshoot problem areas.  This is usually accomplished through interactive reports with rich UI.  You can also set up monitors to alert you if a particular event or sequence exceeds a maximum expected time.  The idea behind this is to notify you if your application is down or severely degraded.  It may take some time to tune your monitors, though, and expect some false alerts at first.

If the vendor solutions don’t fully meet your needs or you want to track a more customized action, you can also accomplish this kind of monitoring by embedding code into your client applications that make callbacks to a central server.  The callbacks are usually made over HTTP and carry a payload of data indicating the user event, time, and other metadata associated with the interaction.  The central server will then parse this payload and write it to some sort of log file. These log files can be streamed to your data analytics system (like Hadoop), where they can be processed and aggregated for reports.  If you don’t want to set up your own callback collection server and Hadoop integration, you could send callbacks to a third party analytics system, like Google Analytics.

Application Servers

Monitoring your application servers is sometimes referred to as APM, or Application Performance Management.  This type of monitoring is focused on gathering data about two things – the transactions running on your app servers and the amount of time it takes to complete the transaction.  Transactions here are usually defined as an HTTP request and its subsequent response.  The transactions are then aggregated by type, usually the name of the script called to process the request. Response time is averaged.  This data is graphed, allowing an engineer to view server performance by transaction type.  Similar to client application monitoring, this approach also allows the viewer to drill into an individual transaction and see processing times by component.  Components can be code execution by function, database calls or other external dependencies.

Since this type of data is available in your app server’s logs, you could roll your own simple solution to this need.  I have seen teams write basic unix scripts that tail an app server’s logs, parse out the individual script names from URLs, retrieve the run time and then aggregate in a simple graphing tool.  This approach can get you off the ground, but will require a lot of maintenance and can become unwieldy quickly.

I have used two different vendors for server-side APM, New Relic and AppDynamics.  New Relic is a cloud-based service with code plug-ins for most server types, that is easy to install and configure.  They provide very rich, intuitive reporting interfaces.  You can view average response times aggregated for the entire application and also broken out by transaction type.  For each transaction, you can further expand the run times into individual components.  This allows you to quickly identify bottlenecks, whether in code execution, database access or external services.  The data and graph updates occur in near real-time. My team always kept the New Relic dashboard in view during deployments and other planned maintenance to allow us to quickly see if the event generated an impact on application performance.  This tool was invaluable in determining whether a release was stable or needed to be rolled back.  AppDynamics provides a similar set of features.  It does require a bit more configuration than New Relic, as we used the collector/aggregator software installed on our own servers in our data center.  However, we also were also able to get a very useful view in AppDynamics that showed a visual map of all back-end servers (app, database, file system, etc.) with links between them.  This allowed for quick diagnosis of application performance issues that were tied to a dependency on a particular degraded system.

In our configuration at Zoosk, we used New Relic to monitor performance on our front-end app servers, which were a combination of Apache and node.js. With AppDynamics, we monitored performance of services behind the front-end app servers. These ran on Tomcat and nginx.  Here are two screenshots from New Relic and AppDynamics, showing examples of different views you can get.

Screen Shot 2016-01-12 at 12.37.17 PM

Screen Shot 2016-01-12 at 12.36.57 PM

Other System Monitoring

You should also have performance monitors in place for other major components of your back-end infrastructure.  This measurement can be performed using APM tools as well in some cases, or be accomplished with individual solutions, depending on the system.  I have measured the performance of back-end systems in the past using Ganglia for data gathering and graphing. It has a client for most major Unix based systems.  The client runs on each system component, gathering data and forwarding it to a central Ganglia server.  The Ganglia server aggregates the data and then uses RRDtool for data storage and visualization. There are other solutions for this as well, and most cloud providers offer a performance monitoring service integrated into their stack (Cloudwatch on AWS).

Here is a list of some example back-end systems that can be measured.  Basic hardware and OS level metrics should be collected on all servers.  This includes CPU utilization, load, network throughput and disk usage.

  • Messaging systems.  If you are using message queues to pass work items between systems, you want to track how many messages are in each queue awaiting processing.  This will give you a sense for throughput.  If the number of messages in a queue continues to increase, this can indicate a capacity issue or an error condition.
  • Databases.  You definitely want to collect and graph common database processing activities.  This would include all query types – select, update, insert, delete.  Also, a view of slow queries helps identify when a particular query needs optimization or database performance on the whole is starting to degrade.  There are many more metrics associated with each individual data storage system that you can collect and graph.
  • Email servers.  If you run your own mail servers for sending customer emails, you will want to graph the outgoing email queues.  This is particularly insightful if split by ISP (Gmail, Yahoo, Hotmail, etc.).  Major email ISPs will sometimes throttle incoming mail and you will see this reflected in the outbound queue for that ISP. This will alert you that email to a particular customer set may be delayed.
  • Network equipment.  Most network hardware vendors allow you to pull basic performance data from their equipment.  If so, it is useful to graph network throughput and performance of hardware subsystems, like CPU.  This applies to your switches and routers.  As an example, seeing network traffic graphed on a 10gb switch may indicate if you are approaching saturation.

This represents just a subset of the type of systems that could be running in your back-end.  The basic idea here is to collect and graph a set of metrics that are relevant to measuring the performance of that system.  If performance issues are manifesting further up in the application stack, you have the ability to trace back to a particular back-end system that may be overloaded.

Alerting

Beyond graphing, you should have a system in place that is monitoring these same health metrics and will generate a warning if a metric crosses a certain threshold. A traditional solution for this is Nagios.  Nagios can monitor basic host health checks and network availability.  It also has a suite of plug-ins for most open source servers that allows for service checks relevant to that server’s function.  Your Ops personnel can set thresholds for generating an alert and how that alert is sent.  Alerts can be organized into levels, like warning and critical.  Alerts should be sent to your Ops personnel via SMS and email.  They can also be injected into your team collaboration software, like Slack or IRC.  There are helpful vendor services for managing the distribution of these alerts to an on-call rotation of Ops personnel.  The most popular of these is PagerDuty.