Ryan Hurst
September 13, 2024
Windows uses Kerberos to authenticate machines and users; in this system, machine credentials are cryptographically random passwords—essentially secret keys—similar to those used in encryption protocols. These credentials are automatically changed every 30 days by default, and machines use them to get "tickets" for each service they communicate with, every 10 hours. A single value controls how often to change these passwords, and the passwords are changed automatically.
This contrasts with legacy certificate management approaches that are designed around manual processes. In these cases the certificates are typically valid for up to a year, leaving them exposed to theft far longer than the Kerberos credential. These legacy solutions often fail to integrate directly with workloads that use the certificate and often rely on manual processes and complex procedures to ensure these credentials get replaced on time. These legacy approaches focus instead on only part of the problem: delivering the certificate to the machine. It then becomes the customer’s responsibility to enable workloads to use the new certificate. This may require running multiple instances of the application and manually swapping keys and certificates on one instance to ensure continuous availability. It may also mean accepting the risk of downtime by manually restarting the application to load the updated keys and certificates. The fragility of these approaches has been a significant source of downtime.
The lesson learned is that certificate rotation should be routine and occur regularly just like in Kerberos. Modern approaches to certificate management do just this: A great example is Caddy, a web server that integrates certificate lifecycle management directly into the workload. Other systems use proxies like Envoy, to offload the use of the certificate from the workload and dynamically apply the new certificate without downtime. This shift from certificate and key management to management of the entire lifecycle of the credential ensures that certificate management is no longer a source of downtime or toil.
In the case of SPIFFE and Kubernetes, the consumer of the certificate and key is usually Envoy, but SPIFFE works with a wide range of alternatives. The question then becomes how long these certificates should be valid and how often they should be renewed. The answer to this question always depends on the business needs and capabilities of an organization, but I often recommend starting with certificates with a 7-day validity period and rotating the certificate every day, ensuring you have six days to detect and respond to an outage. With this baseline, you can monitor certificate issuance for failures and then, as you gain confidence, lower the validity period.
The key takeaways here are:
In short, the processes we automate completely and perform frequently are the ones we don’t have to worry about.