Minimize Enterprise Risk with Service Mesh

Any organization working in the cloud has to manage multiple, complex challenges to security and reliability, while keeping a tight rein on costs. Your brand’s reputation depends on managing these challenges with aplomb, ensuring that you handle threats and failures quickly, transparently, and efficiently. Increasingly, organizations are choosing an open-source service mesh to help avoid downtime—while driving potentially game-changing business benefits, including drastic reductions in cloud spend.

Service mesh

Eliminate Downtime

The shift to cloud-native technologies has fundamentally changed application development from a world where applications run on hardware and networks completely controlled by the developing organization to a world where that control is traded for lower costs and speed in the development cycle. In turn, this tradeoff requires the organization to embrace new cloud-native patterns, like microservices, Kubernetes, and the use of a service mesh, so that the application still has needed security and resilience.

These new patterns allow shifting the security boundary entirely from physical data-center security to application security, including ensuring that all data is encrypted both at rest and in transit. The service mesh plays a critical role in this shift, by adding security, reliability, and observability to the application in a way that minimizes developer involvement.

For example, Linkerd, the first open source service mesh to achieve the “graduated” status in the Cloud Native Computing Foundation, uses sophisticated techniques such as mutual TLS to safeguard both confidentiality (encryption) and authenticity (identity validation) of both sides of the connection for all traffic within an application. Linkerd does this completely transparently, without the application needing to change.

Additionally, the mesh’s observability features can allow the operations staff to see problems on a Friday night before they become an emergency. And its reliability features can prevent needing to call in a developer team to work the weekend—instead, the operations staff can simply configure the mesh for automatic retries, preserving the user experience and leaving the more intense problem hunting for Monday.

Realize Cost Savings

Service meshes offer direct and secondary cost savings. The mesh can reduce direct cloud costs by allowing organizations to get rid of load balancers in the cloud and reduce some network traffic. In some cases, organizations have been able to eliminate hundreds of paid IP addresses for microservices.

In cases where the organization is running clusters spanning multiple availability zones, some service meshes like Linkerd can even further reduce costs by carefully routing traffic (or handling outages) so that traffic stays within a zone, which costs less than traffic between zones. This can bring dramatic reductions in cloud network spend (millions of dollars a year) while still retaining the failure-resistant properties of multi-zone deployments, as in the case of Entain Australia, which 10x’d throughput and saved thousands of dollars a day by deploying Linkerd.

Service meshes also offer secondary cost savings resulting from the increased efficiency of developers. By delivering critical platform features like mutual TLS, latency-aware load balancing, retries, success rate instrumentation, transparent traffic shifting, and more, service mesh frees developers of these tasks, allowing them to focus on the business logic that drives the organization. These savings are significant. Critical platform maintenance can be incredibly difficult to get right in a large distributed system, placing undue pressure on application developers.

Protect your Reputation

Operational continuity—and the availability of online services—leaves organizations with no slack. Users have come to expect instant access at any time of the day. In a distributed system, IT outages that start as partial failures in one area can quickly escalate into major operational disruptions that impact the customer experience. Issues, errors, or delays reflect on the organization or the brand.

A service mesh delivers a sophisticated set of distributed system reliability features that can help prevent escalation in the first place, including request-level load balancing, timeouts, retries, rate limiting, circuit breaking, and traffic shifting. Some service meshes even provide powerful features like latency-based load balancing and retry budgets to tamp down on partial failures before they escalate.

Linkerd service mesh
image credit: blog

Service Mesh for All?

What kind of organizations can benefit from service mesh? Use cases suggest that nearly every organization creating cloud apps in Kubernetes could benefit—including small start-ups. Not only does the service mesh provide operational simplicity, but it can also enhance progress for application developers at every stage of an organization’s growth.

Some open-source service meshes have a reputation for complexity. Others were designed to be operationally simple yet powerful, allowing organizations to see immediate benefits. Choosing the right mesh, one that provides critical features “out of the box,” frees the engineering team to focus on fundamental applications that power the business, providing a competitive advantage.

Compare Notes

Uses cases for service mesh come from industry leaders that include Microsoft, Plaid, and Adidas. These companies, all with global users, have realized the business benefits of creating scalable systems with resilient infrastructure that includes automatic retries, circuit breakers to isolate faults, and seamless restore functionality. Service mesh helps them detect where failures are happening with advanced observability and provides Zero Trust security system-wide.

  • Microsoft’s Xbox, a gaming system, uses service mesh to enhance consistency across the platform, allowing multiplayer games in the Xbox Network.
  • Plaid, a global financial services provider, uses service mesh to accelerate their production and implement changes in as little as 30 minutes, an unheard-of speed in the financial world.
  • Adidas, a global athletic brand, uses service mesh for system redundancy, security, and automated prioritization of network traffic.

Measure The Impact

The business impacts of using service mesh can be seen in an organization’s overall uptime (increased), overall spending in networking and engineering (decreased), and developer/engineer productivity (increased). Other changes are more subtle but still measurable, including employee satisfaction for those in networking or development and positive shifts in the organization’s development philosophy.

And, of course, there’s one other important metric: how many problems your customers notice. When customers and users are unaware of issues because service mesh has things covered, your organization is meeting expectations, building trust, and enhancing your reputation.