Key Highlights:
Here’s a concise HTML-formatted summary of the article in 3-5 bullet points:
- Staleness Issue: Kubernetes controllers may act incorrectly due to outdated cache data, leading to delayed or incorrect actions.
- Kubernetes 1.36 Improvements: Introduces new features to mitigate staleness, including atomic FIFO processing in client-go and staleness checks in key controllers (DaemonSet, StatefulSet, ReplicaSet, Job).
- New Capabilities: Controllers now check cache resource versions before acting, ensuring they have the latest cluster state information.
- Observability Enhancements: New metrics (
stale_sync_skips_total,store_resource_version) help monitor controller health and cache staleness. - Future Plans: Expanding these features to more controllers and integrating with controller-runtime for broader adoption.
Here’s a rewritten version of your article with improved clarity, structure, and readability while maintaining the original meaning:
Staleness in Kubernetes controllers is a common but often overlooked issue that can lead to unexpected behavior. Many controller authors only realize the problem after it’s too late—when a production controller takes incorrect action due to outdated assumptions. Staleness can manifest in several ways: controllers acting on stale data, failing to act when they should, or delaying actions unnecessarily. The good news? Kubernetes v1.36 introduces new features to help mitigate these issues while improving observability into controller behavior.
What Is Staleness?
Staleness occurs when a controller’s internal cache holds an outdated view of the cluster state. To optimize performance, controllers maintain a local cache populated by watching the Kubernetes API server for relevant object changes. When a controller needs to act, it first checks this cache. If the data is stale, it refreshes through a process called reconciliation.
However, certain scenarios can leave the cache outdated. For example:
- After a controller restart, it must rebuild its cache, leaving it temporarily unable to act.
- If the API server goes down, the cache won’t update, potentially causing missed actions.
These are just a few cases where staleness can creep in.
What’s New in Kubernetes 1.36
Version 1.36 brings improvements to both client-go and key controllers in kube-controller-manager, leveraging these client-go enhancements.
Client-Go Improvements
The update introduces atomic FIFO processing (enabled via the AtomicFIFO feature gate), building on the existing FIFO queue implementation. This ensures the queue remains consistent even when processing batched operations—like the initial object list an informer uses to populate its cache. Previously, events were processed in arrival order, which could lead to cache inconsistencies.
With this change, client-go users can now verify cache freshness using the new LastStoreSyncResourceVersion() function in the Store interface. This serves as the foundation for staleness mitigation in kube-controller-manager.
Kube-Controller-Manager Updates
Four high-contention controllers now leverage this capability by default:
- DaemonSet controller
- StatefulSet controller
- ReplicaSet controller
- Job controller
These controllers now check the cache’s latest resource version before acting. If the cache version is older than what the controller last wrote to the API server, the controller pauses action—preventing decisions based on stale data. You can disable this per-controller by setting their respective feature gates (e.g., StaleControllerConsistencyDaemonSet=false).
For Informer Authors
Client-go’s new ConsistencyStore interface helps informer authors implement staleness checks with three key functions:
- WroteAt: Records the latest resource version after API server writes.
- EnsureReady: Checks if the cache is up-to-date before reconciliation.
- Clear: Removes deleted objects from tracking.
This system uses UIDs to distinguish between objects with identical names (e.g., after deletion and recreation), ensuring accurate version tracking.
Enhanced Observability
Kubernetes 1.36 also adds new alpha metrics to monitor staleness:
stale_sync_skips_total: Tracks syncs skipped due to stale caches (per controller).store_resource_version: Shows the latest resource version per informer, useful for comparing against the API server’s state.
Looking Ahead
The Kubernetes SIG API Machinery team plans to expand these features to more controllers and collaborate with controller-runtime to bring these benefits to its ecosystem. Your feedback is welcome—share your thoughts via GitHub issues or discussions!
