As a wise cloud architect once said, “I’ve got 99 problems and the cloud ain’t one” (props to Jay-Z). The cloud made running applications and services on a massive scale much easier. Yet cloud computing brings its own problems.
For one, back in the on-premises days, some runaway code would cause merely performance degradation or an outage. Now AWS will turn out your pockets, pick you up, flip you upside down, and shake you until every last dime is gone—the bill for your bug.
Meanwhile, it is all too easy to use Amazon Kinesis or Azure Cosmos DB or Google Cloud Bigtable, but any one of them is a Hotel California where you can check out any time but you can never leave. While the pricing of raw infrastructure services has decreased over time, cloud provider pricing in general has been more stable (and incomprehensible).
And, good gosh, among all that complexity and a bunch of instances you are supposed to keep things stable and secure? And why is my Kubernetes config so dang long?
I could go on. Instead, I asked the people responsible for running some of the Internet’s most critical cloud-based services what problems they have faced, and how they solve or mitigate them.
1 - Cost management
Remember when people thought AWS was cheap? “When you actually have hardware that sits on-premises, you use it. It’s yours. You paid for it. You pay for electricity, but then you use it as much as you want,” Marc Sanfaçon, senior vice president of technology and co-founder of Coveo, told me.
“But when you have a company like ours with more than 200 developers,” Sanfaçon continued, “there are some policies in the company where they have to ask for authorisation to buy a new phone, or a desk, or a chair. But they can actually turn around and go into our Amazon Web Services [AWS] console and spin up a new machine that will cost the company 25 bucks an hour, and they leave that running for a month. At the end of the month, you’re like, oh my god, that’s a lot of money.”
Now Coveo turns off clusters or instances when no one is working, for example, 8pm to 6am and on the weekends. However, they have to make allowances for that developer who wakes up with inspiration at 2am and starts working on it.
Coveo already has someone working 75 per cent of their day on cloud cost optimisation. However, Sanfaçon notes a fledgling field of FinOps companies whose products help manage and optimise costs. Sanfaçon mentioned Cloudability and CloudHealth as examples of tools you can use to control cloud spending.
2 - Maintaining independence from cloud-specific services
Sanfaçon shared another cloud problem that Coveo has grappled with: Keeping Coveo’s services functioning when AWS' services fail.
“Just before Black Friday, AWS had two major incidents with Kinesis, which is one of the services that [Coveo is] using, but also one of the services that is the backbone of a lot of other services within AWS,” Sanfaçon noted. This outage didn’t affect Coveo’s main services but did affect their ability to onboard new organisations and record some types of events. Coveo is a search company, and the weeks around Black Friday are “go time” for many e-commerce customers.
Sanfaçon considered hosting Coveo’s own streaming service, but as troubling as the Amazon Kinesis outage was, he questioned whether Coveo could cost-effectively run a better messaging service with more uptime than AWS. Even if Coveo could, would that be an effective use of resources?
Another consideration: While there are many benefits to just consuming a service from a cloud service provider, it means they cannot just move to another provider like Google Cloud or Microsoft Azure, Sanfaçon noted.
A possible solution that cuts the difference is to use the managed Kafka from AWS. Then Coveo could just move over to Azure’s managed Kafka or Confluent’s managed Kafka on Google Cloud if there is a problem.
There is indeed a cost to cloud independence, as running Amazon Kinesis is cheaper than running Amazon’s managed Kafka. Still, there are also benefits—especially when something goes down before Black Friday, during a pandemic, and you are the search backbone for many e-commerce sites.
Saravana Krishnamurthy, the vice president of SkySQL product management at MariaDB, likewise recommended against relying on anything cloud-specific. “If you have a REST API built into your solution or any other API, ensure all communication is through those APIs that are cloud-independent,” Krishnamurthy said. “So that way, when you move from Amazon to Google or Azure, you actually have a better way of moving your applications and data.”
3 - Cloud provider differences for multi-cloud
Jim Walker, vice president of product marketing at Cockroach Labs, noted the challenges posed by the cloud providers all doing things a little differently. Cockroach Labs built out its CockroachCloud database service on both AWS and Google Cloud and learned a lot about those differences.
“They are basically completely different and create significant work for us to get the experience ‘right’ in each,” Walker said. “Containers and Kubernetes have definitely helped us simplify some of the complexity, but we still needed to think about the two platforms very differently.” He offered some details:
For instance, the Kubernetes managed service is very different in each cloud, and networking complexities are totally different. The way we work with load balancers across each is not the same. Further, one allows us to customise and set IOPS and the other doesn’t. When we deliver a feature like VPC [Virtual Private Cloud] peering for our customers, the approach within each (AWS PrivateLink vs. vanilla) is also completely different. The cloud providers are of huge value, but we do have a lot to do with each.
4 - Cloud security
MariaDB’s Krishnamurthy also underscored the importance of network security in the cloud. “We don’t want one customer’s traffic to interfere with another customer,” Krishnamurthy said. “So when a customer requires a Virtual Private Cloud, where they want to isolate the traffic from the public network and from other customers, we provide the VPC as a way to isolate them.”
However, this can be complicated when someone has standardised on, say, Active Directory and authenticates between VPCs. That can require some arduous configuration and mapping policies to roles between systems.
5 - Complexity, configuration, and compliance
Configuring even a few servers and keeping them consistent is a challenge. Devops promised to simplify our operations and deployment issues, but configurations drift. Further, it is hard to see “who” changed the configuration when it exists in a series of scripts and applies to potentially hundreds of servers. For some industries, especially financial services, this lack of an audit trail is a real problem for compliance purposes.
A new set of technologies and methodology called GitOps provides a solution. As the name implies, GitOps combines the versioning tool Git with devops. However, GitOps is more than that. It also makes configuration declarative while measuring drift. Moreover, Git maintains an audit trail. So who turned security off? You can answer that question by looking at the repo.
To quote a notorious cloud architect, “Mo servers mo problems.” Still, you can stay cost-effective with FinOps, fight complexity with GitOps, prevent your software from succumbing to a single-vendor outage by keeping it multi-cloud, and maintain your system’s security and privacy by isolating your services in your own VPC.
Gone are the days of feeling special because you use CVS to check in your Unix config files—and every Unix administrator who did that felt special. In this cloudy world, we have mo servers mo problems but also better tools.