The Autopilot Pattern and the rise of app-centric micro-orchestration

A distinguishing feature of Autopilot Pattern applications is that they can be used almost anywhere you can run Docker containers. To make sense of how it’s possible to make a complex, even stateful application so portable, we need to start with the relationship between the scheduler and the application (a single Docker container). Consider the minimum requirements for each:

Scheduler	Application
Start containers Stop containers You’re looking for more, including discovery, configuration, orchestration, load balancing, monitoring and support for rolling upgrades? Read on...	Register itself in a service catalog for use by other apps Look to the service catalog to find the apps it depends on Configure itself when the container starts, and reconfigure itself over time

Many schedulers do a lot more than simply start and stop instances. They attempt to manage all aspects of the application, including discovery, configuration, state management, networking, and load balancing. These schedulers need to be tightly integrated with the applications they orchestrate and create a complex set of interdependencies that are cumbersome to manage. Schedulers that attempt loose integration—such as only automated service discovery and load balancing—can be even worse, because without tight integration they force a lowest common denominator solution that imposes its own frustrations and risks. How can the scheduler’s load balancer recover from instance failures mid-transaction, for example? This scheduler-centric approach to orchestration can be made to work, but the complexity and maintenance costs are prohibitive for most users.

By making our applications effectively stateless to the scheduler, we can make even the most complex applications simple enough to manage just by setting the scale of their components

The Autopilot Pattern offers an alternative to that complexity by moving the application orchestration into the container itself. Many of the same tasks are being done, but they’re easier to do and maintain inside the container where we have easy access to the application, rather than outside the container where we have to conform everything to the scheduler’s framework and pipe the details through the scheduler’s interface (and every scheduler implements a different interface and framework for that). By making our applications effectively stateless to the scheduler, we can make even the most complex applications simple enough to manage just by setting the scale of their components.

The result is that all an application really needs of the scheduler is that it can start and stop containers. Let’s walk through a few production examples to understand that better.

Scaling up

Let’s explore how this works using our Autopilot Pattern example application, made up of the following components:

Nginx, the front-end proxy and static content server
Customers: a microservice for customer information
Sales: a microservice for sales information

an example Autopilot Pattern application

Those three components are all individual containers that can be deployed, scaled, and upgraded independently. Let’s say we need to scale the Sales microservice up to meet demand. We need the scheduler to start additional instances of the Sales microservice. This can be as simple as using docker run:

docker run -d -p 80 -e CONSUL=consul.local myimages/sales

We can do that docker run as many times as is needed to start the number of instances needed for current demand. We could also use Docker Compose. The following example sets the total number of instances to five (which could be more or less than are currently running):

docker-compose scale sales=5

Assuming we don’t already have five or more instances of that service running, Docker Compose in this case is effectively doing the same thing as the docker run command above: starting new containers.

Both Marathon and Kubernetes can run Docker containers. The scale button in the Marathon web UI (and API) and Kubernetes replication controllers make it possible to scale the number of instances of each application up, but at heart, they’re just starting new instances in much the same way as we did with the docker run example above.

The scheduler starts the container, the application takes it from there

The problem is that not every scheduler can configure the new containers to connect to existing containers or update the existing containers to connect to the new containers. And among those that can do that, they’re all incompatible with each other. Worse, very few developers are running large, complex schedulers on their laptops for local development, increasing the risk that the containers will behave differently as they move from development to production and disconnecting the developer from the operational details that are required for production.

The Autopilot Pattern, on the other hand, works across all schedulers because it takes advantage of the common interface they provide to the application: the scheduler starts the container, the application takes it from there.

So, what’s going on in the containers as we start additional instances of the Sales service?

The Sales microservice depends on the Customers microservice, so when the newly launched Sales containers start, they’ll look to the service catalog to determine what instances of the Customers service are available and what their IP addresses are. And, because other services need to connect to the Sales microservice, each of the newly launched instances of the service will register itself with the service catalog. Just as importantly, the Sales microservice is monitoring its own health and reporting that to the service catalog as well.

We didn’t scale the Nginx containers, but they’re watching the service catalog for any changes in the Sales and Customers services that they proxy requests for. As the new Sales instances register themselves, the Nginx instances will automatically update their configuration. Looking to the service discovery catalog at startup, monitoring it during the life of the application, and reconfiguring as instances as added, removed, or become unhealthy is called active discovery.

The Customers microservice also has a dependency on the Sales service. Just like the Nginx containers, the Customers instances are watching the discovery catalog for any changes to their dependent services, and updating their configuration to accommodate the new Sales instances.

But my application isn't built to do this?

The Node.js microservices in the scaling example above are written with an awareness of service discovery and automatic configuration when scaling them, but relatively few applications are designed this way. How can we orchestrate all applications, including the trusted apps we're already using?

ContainerPilot is a small, open source helper that goes inside the container to automate all the discovery and configuration for an application according to the Autopilot Pattern. It uses a simple configuration file that specifies how the application should interact with other applications. Here’s the configuration file for our Nginx container in our scaling example:

{  "consul": "http://{{ .CONSUL }}:8500",  "preStart": [    "consul-template", "-once", "-consul", "{{ .CONSUL }}:8500", "-template",    "/etc/containerpilot/nginx.conf.ctmpl:/etc/nginx/nginx.conf"  ],  "services": [    {      "name": "nginx",      "port": 80,      "health": "/usr/bin/curl --fail -s http://localhost/health",      "poll": 10,      "ttl": 25    }  ],  "backends": [    {      "name": "sales",      "poll": 3,      "onChange": [        "consul-template", "-once", "-consul", "{{ .CONSUL }}:8500", "-template",        "/etc/containerpilot/nginx.conf.ctmpl:/etc/nginx/nginx.conf:nginx -s reload"      ]    },    {      "name": "customers",      "poll": 4,      "onChange": [        "consul-template", "-once", "-consul", "{{ .CONSUL }}:8500", "-template",        "/etc/containerpilot/nginx.conf.ctmpl:/etc/nginx/nginx.conf:nginx -s reload"      ]    }  ]}

Consider the four main sections of that config file:

consul: This is the information about what discovery catalog is being used and how to connect to it; ContainerPilot also supports etcd, and there's community interest in supporting ZooKeeper
preStart: This is an event in the lifecycle of the container; the preStart event defines the command to run before starting the main application; additional user-defined events include preStop, postStop, health, and periodic tasks
services: These are the services provided by the container, and the configuration here defines how to tell if they're healthy
backends: These are the services this container depends on, the configuration specifies how to monitor them and what to do in the application if they change

Take a look at the Autopilot Pattern applications tutorial for more information about how how ContainerPilot automates Nginx.

But, Nginx is a nearly stateless app with no persistent data, you say. “How can we do this with a complex app,” you ask?

Handling persistent data

The Autopilot Pattern and ContainerPilot even work for complex, stateful applications with persistent data, like MySQL. Our Autopilot Pattern MySQL implementation uses ContainerPilot to manage state and persistent data inside the container. The MySQL implementation supports running a single primary with multiple replicas, and automatically bootstraps the replicas when they’re started.

Using the same preStart, health, and onChange events as our Nginx example above, ContainerPilot triggers code that automates the operational details and error recovery for the application: The primary (master) MySQL instance automatically backs up its data to an object store. As additional instances are launched, they download the data from the object store to bootstrap themselves. If the primary should fail, the replicas will elect and promote a new primary from amongst themselves.

By moving the orchestration into the application, the Autopilot Pattern reduces the complexity of the interface between the application and scheduler, and makes it portable to any scheduler

We had to develop the automation code for this, it’s a Python script that lives in the container, and it has methods to support all the ContainerPilot events we depend on. The script takes advantage of its unlimited access to the MySQL app in the same container to read and manage its state. In addition to all our orchestration code running inside the container, all the MySQL containers use in-instance storage, so the scheduler doesn’t need to do any additional coordination with data stored elsewhere, it just needs to start the MySQL container. The first container up becomes the primary, and any additional containers the scheduler starts become replicas, making it easy to scale the service to meet demand.

We could have done this with a very sophisticated scheduler, but doing so would require interfacing the MySQL state to the scheduler and possibly coordinating with data stored outside the container¹. At best, this is just shifting complexity from the container to the scheduler, but in practice it typically adds complexity when trying to communicate details from the app to the scheduler. And that complexity is multiplied by all the schedulers that need to be supported, and multiplied again for every application. Optimizations made in scheduler A may not be easily implemented in scheduler B, and the orchestrator implementation for app X likely won't work for app Y, for example.

Scaling down

Scaling down the stateless components of an application is straightforward: the scheduler sends the SIGHUP signal to the container to gracefully stop it. When ContainerPilot receives that signal, it de-registers the services it's advertising for the container. For our example application from above, all the Customers and Nginx instances will recognize the change in the service catalog when we scale down the Sales microservice and reconfigure themselves to direct requests to the remaining instances.

Sometimes however, scaling events are unexpected, and less than graceful. Instances, physical compute nodes, and other equipment can fail. The health checks for each app can detect these failures, and if an app fails to report its health status to the service catalog, the other Autopilot Pattern components will automatically stop sending requests to it using the same updating mechanisms we use for planned scale downs.

Some applications, however, require more orchestration when scaling them down. Scaling down a Couchbase cluster (used in our recent blueprint) requires marking the node for removal, then rebalancing the data to remaining nodes. The preStop event in ContainerPilot can trigger that exact behavior².

Deploying upgrades

The features of containers that have made them so developer-friendly also make deployments easier. The dream of immutable infrastructure that was so hard to achieve without containers is within reach with them, now that the challenges of managing dependencies for application components and delivering them to the server with each update can be wrapped up and isolated in each container. Rather than upgrading servers and dealing with the configuration management of those servers and all the libraries the applications depend on, we're simply starting new containers running the newest image.

The Autopilot Pattern automates the configuration of the application as we deploy and scale it, so upgrading Autopilot Pattern software, whether it's a stateless service or a persistent database, is simply a matter of starting new instances of the container and stopping old instances. In this case, we do need some coordination with the scheduler, as we need the scheduler to scale up instances with the new image and scale down the instances running the old image. This is still a matter of starting and stopping containers, but it adds scheduler responsibility for tracking what image version is running in each container instance.

Rather than upgrading servers...we're simply starting new containers

Rolling deploys are easy in this context: scale up the new while scaling down the old. Canary deploys can be approached by adding a few new instances, then waiting before replacing all of them. If the canaries don't fare well, you can simply stop them.

Should you want to do a blue/green deploy, you can do that by adding a fleet of new (green) instances with the new version of the container image. You can then stop, but not delete, the fleet of old (blue) container instances. If it turns out things didn't work out as you'd hoped, you can restart the old/blue instances and stop the new/green ones. Through all these ups and downs, the Autopilot Pattern automation is connecting and reconnecting all the components of the application.

Upgrading a stateful service, like the MySQL and Couchbase examples above, is just as straightforward. Let's say you want to upgrade from Couchbase 3.x to 4.x? Launch one or more new containers running version 4.x. As they start up, they'll join the Couchbase cluster and rebalance the data to the new nodes. Once they've stabilized, you can mark the old nodes for removal and rebalance the data off them. This allows for zero-downtime upgrades for true continuous delivery.

No, really, deploying upgrades

You might wonder if the above is just theory, so it's worth considering how it works in practice with various schedulers. The following will explain how to deploy and scale Nginx instances. Nginx in this context could be part of our example from the top, or our recent Autopilot Pattern WordPress implementation.

Docker Compose

Docker Compose is a simple and developer-friendly way to start a handful of containers as a single app. It doesn't do any cluster management on its own, but in conjunction with Docker Swarm or Joyent Triton, it can get the job done. Here's the workflow:

Start Nginx and the larger application with docker-compose up -d
Scale the Nginx service with docker-compose scale nginx=3
Change the Nginx image tag in the Compose manifest from :1.0.0 to :1.0.1
Apply the update with docker-compose up -d

The result is that Compose will update the instances of Nginx one by one, for a seamless rolling deploy when there’s more than one instance of the service.

Mesosphere Marathon

Many of the earliest Docker production success stories were actually running on the open source dynamic duo of Mesos and Marathon. Mesos handled the cluster management while Marathon handled the scheduling of Docker containers. Other Mesos frameworks, like Mesos Elasticsearch can do more complex orchestration, but the Marathon framework is best known for its ability to schedule stateless containers. Thanks to the Autopilot Pattern, it's now possible to manage stateful applications on Marathon. The team at Container Solutions demonstrated this on Mantl, a Mesos+Marathon distribution sponsored by Cisco.

Here's how to deploy, scale, and update Nginx on Marathon:

Register and start an Nginx service using curl -X POST http:///v2/apps -d @<Marathon service manifest for Nginx>.json -H 'Content-type: application/json'
It will automatically scale to the number of instances specified in the manifest
Change the Nginx image tag in the Marathon service manifest
Update the app with curl -X PUT http:///v2/apps -d @.json -H 'Content-type: application/json'

This will result in a rolling deploy using the upgrade strategy in the manifest. That upgrade strategy offers control over how the update is rolled out, but the result with a good Autopilot Pattern application will be a seamless rollout.

Google Kubernetes

Kubernetes handles cluster management, container scheduling, and scheduler-centric orchestration as a tightly-coupled system. Like Marathon, it was designed for stateless services. There are efforts to overcome that limitation within Kubernetes and with PersistentVolumeClaims, but the Autopilot Pattern offers a portable solution for both stateless and stateful applications now.

Here's how to deploy, scale, and update Nginx as we did with the examples above:

Register and launch an Nginx replica set with kubectl create -f ./.yaml
Expose that Nginx replica set with kubectl expose deployment
Scale the service with kubectl scale deployment --replicas=3
Edit the replica set configuration with new image tag and trigger a rolling update of that change with kubectl edit deployment (not kubectl rolling-update..., as you might assume from the docs)

Here again, the result is a rolling seamless deploy of the updated Nginx image.

As you can see, the different schedulers each require different service manifests and syntax, but they all basically do the same thing when deploying, scaling, and rolling out updates to a containerized service. The schedulers do vary in the level of control they offer over how to start and stop containers as well as their complexity and learning curve, but they all support Autopilot Pattern applications.

App-centric micro-orchestration FTW

Entangling the application and scheduler, as is required for scheduler-centric orchestration, links them together in ways that complicate everything from uptime to upgrades. By moving the orchestration details into the application container, we simplify both the scheduler and application.

Consider these additional benefits of app-centric micro-orchestration:

It's visible and self-documenting to all developers and operators
It's portable so that it works on (almost) any infrastructure, including developer and operator laptops
It's testable and repeatable everywhere the application goes, making it easy to bring up isolated staging environments for every change

This simplification allows us to focus our attention where it belongs: on our apps. And it helps us get started improving our apps today, without waiting to remake the world to fit a scheduler-centric orchestration scheme. And, in this space of rapidly changing schedulers, the Autopilot Pattern allows us to select the scheduler with the best scheduling features, rather than being locked into a scheduler because our orchestration requires it.

To be clear: this MySQL implementation is connecting to an object store, but it's orchestrated by the application, not the scheduler. Connecting to services in an application is easy and completely avoids the complication of coordinating external storage services via the scheduler. ↩
It must be said, this isn't implemented in the Autopilot Pattern Couchbase image yet, but autopilotpattern/couchbase#14 is tracking the feature request. ↩

Post written by Casey Bisson