导航：首页 > 服务器 >

Kubernetes的PDB怎么应用

发表于：2025-02-04 作者：千家信息网编辑

千家信息网最后更新 2025年02月04日，这篇文章主要介绍"Kubernetes的PDB怎么应用"，在日常操作中，相信很多人在Kubernetes的PDB怎么应用问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答"K

千家信息网最后更新 2025年02月04日Kubernetes的PDB怎么应用

这篇文章主要介绍"Kubernetes的PDB怎么应用"，在日常操作中，相信很多人在Kubernetes的PDB怎么应用问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答"Kubernetes的PDB怎么应用"的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

PDB的应用场景

大概在Kubernetes 1.4新增了PodDisruptionBudget Object（后面简称PDB），在1.5的时候升级到Beta，但是直到1.9 Released还是Beta。不过没关系，我们抛开这些，先来想想PDB是为了解决什么问题的。PDB Feature已经一年多了，以前没有研究过它，主要是没场景。最近在做基于Kubernetes的ElasticSearch as a Service(简称ESaaS)项目方案，要尽量保证任何ElasticSearch Cluster中始终至少要有一个健康可用的ES client pod, ES master pod和ES data pod。很多同学都学想到Deployment中可以设置maxUnavailable，那不就行了吗？再说了，还会有RS Controller在做副本控制呢？

等下！Deployment中的maxUnavailable是什么时候用的？-- 是用来对使用Deployment部署的应用进行滚动更新时保障最少可服务副本数的！RS Controller呢？-- 那只是副本控制器之一，它并不能给你保证集群中始终有几个副本的，它是负责尽快的让实际副本数跟你的期望副本数相同的，它才不管中间某些时刻的实际副本数呢。这个时候，你就可以考虑使用Kubernetes PDB了，它是用来保证应用的高可用的，对那些Voluntary（自愿的）Disruption做好Budgets(预算方案)。

前面提到了Voluntary Disruption，我们来捋一下，什么是Voluntary Disruption？什么又是Involuntary Disruption？

Involuntary Disruption及其应对措施

Involuntary Disruption指的是那些不可控的（或者目前来说难于控制的）外界因素导致的Disruption，比如：

服务器的硬件故障或者内核崩溃导致节点Down了。
如果容器部署在VM，VM被误删了或者Hyperwisor出问题了。
集群出现了网络脑裂。（Kubernetes通过NodeController来处理网络脑裂情况，但是evict pods时仍然没有考虑到保证应用的高可用）关于NodeController深度解析，请参考我的下面博文：

Kubernetes Node Controller源码分析之执行篇
Kubernetes Node Controller源码分析之创建篇
Kubernetes Node Controller源码分析之配置篇
Kubernetes Node Controller源码分析之Taint Controller

某个节点因为不合理的超配导致出现计算资源不足时，触发kubelet eviction时也没有考虑到保证应用的高可用。关于kubelet eviction深度解析，请参考我的下面博文：

Kubernetes Eviction Manager源码分析
Kubernetes Eviction Manager工作机制分析

PDB不是解决Involuntary Disruption的，我们如何在使用Kubernetes时尽量减轻或者缓解Involuntary Disruption对应用高可用的影响呢？

一个应用尽量使用Deployment,RS,StatefulSet等副本控制器部署，并且replicas大于1。
设置应用container的request值，使得即使在资源非常紧张的情况下，也能有足够的资源供它使用。
另外，尽量考虑物理设备上的HA，比如一个应用的不同副本要跨服务器部署，跨机柜跨机架部署，跨交换机部署等。

PDB是为了Voluntary Disruption时保障应用的高可用

Involuntary Disruption对立的场景，自然就是Voluntary Disruption了，指的是用户或者集群管理员触发的，Kubernetes可控的Disruption场景，比如：

删除那些管理Pods的控制器，比如Deployment，RS，RC，StatefulSet。
触发应用的滚动更新。
直接批量删除Pods。
kubectl drain一个节点（节点下线、集群缩容）

PDB就是针对Voluntary Disruption场景设计的，属于Kubernetes可控的范畴之一，而不是为Involuntary Disruption设计的。

Kube-Node项目上线后，可以支持对接Openstack，AWS，GCE等cloud provider实现Node的自动管理，因此可能会经常有HNA(Horizontal Node Autoscaleer)事件,工作流就有类似drain a node的逻辑，因此需要使用PDB来保障应用的HA。

PDB的使用方法及注意事项

使用说明及注意点

部署在Kubernetes的每个App都可以创建一个对应PDB Object，用来限制Voluntary Disruptions时最大可以down的副本数或者最少应该保持Available的副本数，以此来保证应用的高可用。

PDB可以用来保护由Kubernetes内置控制器管理的应用，这种情况下要求DPB selector等同于这些Controller Object的Selector：

Deployment
ReplicationController
ReplicaSet
StatefulSet

也可以用来保护那些仅仅由PDB Selector自己选择的Pods Set，但是有两个使用限制：

只能配置.spec.minAvailable,不能使用maxUnavailable;
.spec.minAvailable只能为整型值，不能是百分比。

因此，不管怎么说，PDB影响的Pods Set都是通过自己的Selector来选择的，使用时要注意同一个namespace下不同的PDB Object不要使用有重叠的Selectors。

在使用PDB时，你需要弄清楚你的应用类型以及你想要的应对措施：

无状态应用：比如想至少有60%的副本Available。

解决办法：创建PDB Object，指定minAvailable为60%，或者maxUnavailable为40%。

单实例的有状态应用：终止这个实例之前必须提前通知客户并取得同意。

解决办法：创建PDB Object，并设置maxUnavailable为0，这样Kubernetes就会阻止这个实例的删除，然后去通知并征求用户同意后，再把这个PDB删除从而解除这个阻止，然后再去recreate。单实例的statefulset的滚动更新一定会有服务停止时间，因此建议生产环境不要创建单实例的StatefulSet。

多实例的有状态应用：最少可用的实例数不能少于某个数N（比如受限于raft协议类应用的选举机制）

解决办法：设置maxUnavailable=1或者minAvailable=N,分别允许每次只删除一个实例和每次删除expected_replicas - minAvailable个实例。

批处理Job：Job需要最终有一个Pod成功完成任务。

Job Controller有自己的机制保证这个，不需要创建PDB。
关于Job Controller深入解读，请参考我的博文：Kubernetes Job Controller源码分析

定义PDB Object

进行了以上思考后，确定了要创建PDB，接下来就看看PodDisruptionBudget怎么定义的，下面是个Sample：

apiVersion: policy/v1beta1kind: PodDisruptionBudgetmetadata:  name: zk-pdbspec:  minAvailable: 2  selector:    matchLabels:      app: zookeeper

PDB的定义，其实就三项关键内容：

.spec.selector用来选择后端Pods Set，最佳实践是与应用对应的Deployment,StatefulSet的Selector一致；
.spec.minAvailable表示发生voluntary disruptions的过程中，要保证至少可用的Pods数或者比例；
.spec.maxUnavailable表示发生voluntary disruptions的过程中，要保证最大不可用的Pods数或者比例，要求Kubernetes version >= 1.7；这个配置只能用来对应Deployment，RS，RC，StatefulSet的Pods，推荐优先使用.spec.maxUnavailable。

注意:
同一个PDB Object中不能同时定义.spec.minAvailable和.spec.maxUnavailable。
前面提到，应用滚动更新时Pod的delete和unavailable虽然也属于voluntary disruption，但是实际上滚动更新有自己的策略控制（marSurge和maxUnavailable），因此PDB不会干预这个过程。
PDB只能保证voluntary disruptions时的副本数，比如evict pod过程中刚好满足.spec.minAvailable或.spec.maxUnavailable，这时某个本来正常的Pod突然因为Node Down(Involuntary Disruption)了挂了，那么这个时候实际Pods数就比PDB中要求的少了，因此PDB不是万能的！

使用上，如果设置.spec.minAvailable为100%或者.spec.maxUnavailable为0%，意味着会完全阻止evict pods的过程（Deployment和StatefulSet的滚动更新除外）。

创建PDB Object

kubectl apply -f zk-pdb.yaml创建该PDB Object；

$ kubectl get poddisruptionbudgetsNAME      MIN-AVAILABLE   ALLOWED-DISRUPTIONS   AGEzk-pdb    2               1                     7s

kubect get pdb zk-pdb -o yaml查看：

$ kubectl get poddisruptionbudgets zk-pdb -o yamlapiVersion: policy/v1beta1kind: PodDisruptionBudgetmetadata:  creationTimestamp: 2017-08-28T02:38:26Z  generation: 1  name: zk-pdb...status:  currentHealthy: 3  desiredHealthy: 3  disruptedPods: null  disruptionsAllowed: 1  expectedPods: 3  observedGeneration: 1

PDB的工作原理及源码分析

PDB Object定义是遇到voluntary disruption时用户的期望状态，真正去维护这个期望状态的也是一个由kube-controller-manager管理的Controller，那便是Disruption Controller。

Disruption Controller主要watch Pods和PDBs，当监听到pod/pdb的Add/Del/Update事件后，并会将对应的pdb object放到rate limit queue中等待worker处理，worker的主要逻辑就是计算PodDisruptionBudgetStatus的currentHealthy, desiredHealthy, expectedCount, disruptedPods,然后调用api更新PDB Status。

pkg/controller/disruption/disruption.go:498func (dc *DisruptionController) trySync(pdb *policy.PodDisruptionBudget) error {        pods, err := dc.getPodsForPdb(pdb)        if err != nil {                dc.recorder.Eventf(pdb, v1.EventTypeWarning, "NoPods", "Failed to get pods: %v", err)                return err        }        if len(pods) == 0 {                dc.recorder.Eventf(pdb, v1.EventTypeNormal, "NoPods", "No matching pods found")        }        expectedCount, desiredHealthy, err := dc.getExpectedPodCount(pdb, pods)        if err != nil {                dc.recorder.Eventf(pdb, v1.EventTypeWarning, "CalculateExpectedPodCountFailed", "Failed to calculate the number of expected pods: %v", err)                return err        }        currentTime := time.Now()        disruptedPods, recheckTime := dc.buildDisruptedPodMap(pods, pdb, currentTime)        currentHealthy := countHealthyPods(pods, disruptedPods, currentTime)        err = dc.updatePdbStatus(pdb, currentHealthy, desiredHealthy, expectedCount, disruptedPods)        if err == nil && recheckTime != nil {                // There is always at most one PDB waiting with a particular name in the queue,                // and each PDB in the queue is associated with the lowest timestamp                // that was supplied when a PDB with that name was added.                dc.enqueuePdbForRecheck(pdb, recheckTime.Sub(currentTime))        }        return err}

下面是PodDisruptionBudgetStatus的定义：

pkg/apis/policy/types.go:48type PodDisruptionBudgetStatus struct {        // Most recent generation observed when updating this PDB status. PodDisruptionsAllowed and other        // status informatio is valid only if observedGeneration equals to PDB's object generation.        // +optional        ObservedGeneration int64 `json:"observedGeneration,omitempty" protobuf:"varint,1,opt,name=observedGeneration"`        // DisruptedPods contains information about pods whose eviction was        // processed by the API server eviction subresource handler but has not        // yet been observed by the PodDisruptionBudget controller.        // A pod will be in this map from the time when the API server processed the        // eviction request to the time when the pod is seen by PDB controller        // as having been marked for deletion (or after a timeout). The key in the map is the name of the pod        // and the value is the time when the API server processed the eviction request. If        // the deletion didn't occur and a pod is still there it will be removed from        // the list automatically by PodDisruptionBudget controller after some time.        // If everything goes smooth this map should be empty for the most of the time.        // Large number of entries in the map may indicate problems with pod deletions.        DisruptedPods map[string]metav1.Time `json:"disruptedPods" protobuf:"bytes,2,rep,name=disruptedPods"`        // Number of pod disruptions that are currently allowed.        PodDisruptionsAllowed int32 `json:"disruptionsAllowed" protobuf:"varint,3,opt,name=disruptionsAllowed"`        // current number of healthy pods        CurrentHealthy int32 `json:"currentHealthy" protobuf:"varint,4,opt,name=currentHealthy"`        // minimum desired number of healthy pods        DesiredHealthy int32 `json:"desiredHealthy" protobuf:"varint,5,opt,name=desiredHealthy"`        // total number of pods counted by this disruption budget        ExpectedPods int32 `json:"expectedPods" protobuf:"varint,6,opt,name=expectedPods"`}

PodDisruptionBudgetStatus最重要的元素就是**DisruptedPods和PodDisruptionsAllowed**：

DisruptedPods：用来保存那些已经通过apiserver pod eviction subresource处理的pods，但是还没被PDB Controller发现处理的Pods，是Map类型，key为Pod Name，value是apiserver接受eviction subresource请求的时间。加入里面的Pod有2min的超时时间，如果2min后Pod仍然没有被删除，则会将该Pod从队列中剔除。
PodDisruptionsAllowed：表示当前允许Disruption的Pods数。

Disruption Controller的主要逻辑就是更新PDB.Status，那么问题来了，到底是谁去控制voluntary distribution时eviction的maxUnavailable或者minAvailable的呢？

要再次提醒的是，PDB Controller只处理那些通过pod eviction subresource请求对应的pods，因此上面的这个问题就要到对应的Pod的evictionRest中去找了。

pkg/registry/core/pod/storage/eviction.go:81// Create attempts to create a new eviction.  That is, it tries to evict a pod.func (r *EvictionREST) Create(ctx genericapirequest.Context, obj runtime.Object, createValidation rest.ValidateObjectFunc, includeUninitialized bool) (runtime.Object, error) {        eviction := obj.(*policy.Eviction)        obj, err := r.store.Get(ctx, eviction.Name, &metav1.GetOptions{})        if err != nil {                return nil, err        }        pod := obj.(*api.Pod)        var rtStatus *metav1.Status        var pdbName string        err = retry.RetryOnConflict(EvictionsRetry, func() error {                pdbs, err := r.getPodDisruptionBudgets(ctx, pod)                if err != nil {                        return err                }                if len(pdbs) > 1 {                        rtStatus = &metav1.Status{                                Status:  metav1.StatusFailure,                                Message: "This pod has more than one PodDisruptionBudget, which the eviction subresource does not support.",                                Code:    500,                        }                        return nil                } else if len(pdbs) == 1 {                        pdb := pdbs[0]                        pdbName = pdb.Name                        // Try to verify-and-decrement                        // If it was false already, or if it becomes false during the course of our retries,                        // raise an error marked as a 429.                        if err := r.checkAndDecrement(pod.Namespace, pod.Name, pdb); err != nil {                                return err                        }                }                return nil        })        if err == wait.ErrWaitTimeout {                err = errors.NewTimeoutError(fmt.Sprintf("couldn't update PodDisruptionBudget %q due to conflicts", pdbName), 10)        }        if err != nil {                return nil, err        }        if rtStatus != nil {                return rtStatus, nil        }        // At this point there was either no PDB or we succeded in decrementing        // Try the delete        _, _, err = r.store.Delete(ctx, eviction.Name, eviction.DeleteOptions)        if err != nil {                return nil, err        }        // Success!        return &metav1.Status{Status: metav1.StatusSuccess}, nil}

通过EvictionREST去请求evict pod的时候，会检查pod只有一个对应的pdb，否则报错。关于Eviction API的使用，请参考The Eviction API,下面只给出简单的Sample：

{  "apiVersion": "policy/v1beta1",  "kind": "Eviction",  "metadata": {    "name": "quux",    "namespace": "default"  }}$ curl -v -H 'Content-type: application/json' http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json

然后通过checkAndDecrement去检查是否满足PDB的manUnavailable或者minAvailable，如果满足的话对pdb.Status.PodDisruptionsAllowed减1处理。
checkAndDecrement成功的话，就真正去delete对应的Pod。

// checkAndDecrement checks if the provided PodDisruptionBudget allows any disruption.func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) error {        if pdb.Status.ObservedGeneration < pdb.Generation {                // TODO(mml): Add a Retry-After header.  Once there are time-based                // budgets, we can sometimes compute a sensible suggested value.  But                // even without that, we can give a suggestion (10 minutes?) that                // prevents well-behaved clients from hammering us.                err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0)                err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: "DisruptionBudget", Message: fmt.Sprintf("The disruption budget %s is still being processed by the server.", pdb.Name)})                return err        }        if pdb.Status.PodDisruptionsAllowed < 0 {                return errors.NewForbidden(policy.Resource("poddisruptionbudget"), pdb.Name, fmt.Errorf("pdb disruptions allowed is negative"))        }        if len(pdb.Status.DisruptedPods) > MaxDisruptedPodSize {                return errors.NewForbidden(policy.Resource("poddisruptionbudget"), pdb.Name, fmt.Errorf("DisruptedPods map too big - too many evictions not confirmed by PDB controller"))        }        if pdb.Status.PodDisruptionsAllowed == 0 {                err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0)                err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: "DisruptionBudget", Message: fmt.Sprintf("The disruption budget %s needs %d healthy pods and has %d currently", pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)})                return err        }        pdb.Status.PodDisruptionsAllowed--        if pdb.Status.DisruptedPods == nil {                pdb.Status.DisruptedPods = make(map[string]metav1.Time)        }        // Eviction handler needs to inform the PDB controller that it is about to delete a pod        // so it should not consider it as available in calculations when updating PodDisruptions allowed.        // If the pod is not deleted within a reasonable time limit PDB controller will assume that it won't        // be deleted at all and remove it from DisruptedPod map.        pdb.Status.DisruptedPods[podName] = metav1.Time{Time: time.Now()}        if _, err := r.podDisruptionBudgetClient.PodDisruptionBudgets(namespace).UpdateStatus(&pdb); err != nil {                return err        }        return nil}

checkAndDecrement主要检查pdb.Status.PodDisruptionsAllowed是否大于0，并且DisruptedPods包含的Pods数不能超过2000（Disruption Controller性能可能不足以支撑这么多）。
检查通过，就对pdb.Status.PodDisruptionsAllowed减1，然后将该Pod加到DisruptedPods这个Map中，map的value就是当前时间（apiserver接受该eviction request的时间）。
更新PDB，PDB Controller因为监听了PDB的Update Event，接着就会触发PDB Controller的逻辑，再次去维护PDB Status。