Marcel Levy
February 24, 2025
The average enterprise manages many more secrets for machines than it does for humans – either 20 times (Enterprise Strategy Group) or 45 times (GitGuardian) as many. The number keeps growing. 52% of organizations expect their total number of non-human identities to increase by more than 20% over the next 12 months. This would just be an interesting statistic, except that most of those surveyed in these reports also indicated that they had suffered a breach due to compromised non-human credentials. According to the ESG report, a little over a third of these received detailed attention from the board of directors.
You’d be right to wonder what’s driving this. It’s not because your development and operations teams care little about security. Far from it! Bear with me, and let me explain.
Amazon was my second act. Well, maybe even my third. After starting my career in newspaper journalism, I had returned to school to study computer science. Getting a job as a software engineer at Amazon was, for me, a waking dream.
As always, it was the people that made the difference. One of the first people I met was Mark, my tech lead and mentor. His first rule: "Don't break prod."
It's easy to see how a single minute of downtime can cost millions for a company like Amazon. But the true cost of downtime is greater than any company can calculate, in terms of lost trust, missed opportunities, and downstream impacts on other business functions.
"Don't break prod." Easier said than done. But what Mark meant was that we should exercise great care before touching the systems that our customers relied on. We have to do this even though our job was to design and implement new features, fix bugs, and scale systems to meet new demands. It's the central paradox of distributed systems operations: Our changes to the systems are the primary cause of failure, and yet the systems will eventually fail without our constant changes.
None of this is news to anyone who's worked on distributed systems. Customers, internal and external, rely on these systems to be there when they're needed. So once the system is launched, we surround it with guardrails of all kinds -- testing, monitoring, automated updates and deployments. And we shy away from risky changes.
There's nothing quite so risky as a change to the secret material that a system depends on for access to its critical dependency. These could be database passwords, cloud service secret access keys, or the private keys for an X.509 client certificate. There are a lot of changes that can _degrade_ a system, but an authentication failure is a complete outage. If you have the wrong database password, you don't have a database.
As you might imagine, this sets up some... tension with strong security practices. These include practices like frequent key rotation, or ensuring that a particular access key has least privilege. Once we've set up the access and everything works, it's a lot of work to modify it or scope it down. And _of course_ that should have been done in the first place, and automation should have been added from the start. In a perfect world, this is just something everyone does when they build the system.
I'm here to tell you that the world is not a perfect place. It's not that security is an afterthought (although it sadly still is sometimes), but we are imperfect beings building novel systems with a limited amount of time. "Out of the crooked timber of humanity, no straight thing was ever made," Immanuel Kant said more than two centuries ago, and he wasn't saying anything new back then.
So, we build with static secrets. We build using identities with administrative or root access. We launch something that will require a rotation event in a year, and we figure we'll take care of it before then. Maybe the rotation even requires a coordinated effort across two different teams in the same company. Maybe it even requires coordination with a vendor. In effect, we schedule future outages. We schedule more of them every time we add new dependencies with different identities.
Secret management resembles Hogwarts' caretaker Argus Filch, perpetually jingling a massive ring of keys, each one a unique point of failure. It's medieval, and we deserve better. I don't want my applications to have to explicitly manage a collection of disparate secrets, passwords, signing keys, certificates, bearer tokens, API keys, access keys, macaroons, and tickets.
This is called "secret sprawl," and that’s how we get to 20-45 times as many secrets as we have passwords. On a personal level, we understand this all too well. It's why there's a market for password managers, and why there's a push to move toward technologies like passkeys.
Dear reader, I don't bring this up to make you feel bad about what your systems look like. I've done all of these things myself. But I'm here to add my voice to the growing chorus crying, "Enough!"
Identity and authentication should not be something we bolt on to our applications later. It should be baked in, like network access. Writing packets over the wire is something I leave to the internals of SDKs and service frameworks. I want my application to handle all the identity and authentication details in the same way.
We have better solutions now, like SPIFFE (Secure Production Identity Framework for Everyone). It's an open-source standard that allows systems to identify each other in a way that's cryptographically sound and has a strong root of trust. Clearly, I am biased when it comes to SPIFFE, but over the past fifteen years I've found it to be the best hope for getting out of our mess. The IETF's WIMSE (Workload Identity in Multi System Environments) working group is going even further to distill the experience of global experts into a set of standards.
If you're not familiar with SPIFFE, take some time to investigate it. It's a natural outgrowth of our shift to container-based deployment, which abstracts away details so that applications can focus on business logic. A SPIFFE implementation, like SPIRL, does the same for workload identities. SPIRL's designers took their experience running SPIFFE for critical business needs, and built a system that simplifies identity management.
It means development teams can stop spending time and effort on secret inventory and rotation. For architects, it means that their systems can interact in ways that are foundationally secure. For CISOs, it means reduced risk and an easier time meeting compliance goals.