导航：首页 > 服务器 >

Kubernetes对Critical Pod的资源抢占原理是什么

发表于：2025-02-04 作者：千家信息网编辑

千家信息网最后更新 2025年02月04日，本篇内容介绍了"Kubernetes对Critical Pod的资源抢占原理是什么"的有关知识，在实际案例的操作过程中，不少人都会遇到这样的困境，接下来就让小编带领大家学习一下如何处理这些情况吧！希望

千家信息网最后更新 2025年02月04日Kubernetes对Critical Pod的资源抢占原理是什么

本篇内容介绍了"Kubernetes对Critical Pod的资源抢占原理是什么"的有关知识，在实际案例的操作过程中，不少人都会遇到这样的困境，接下来就让小编带领大家学习一下如何处理这些情况吧！希望大家仔细阅读，能够学有所成！

Kubelet Predicate Admit时对Critical的资源抢占处理

kubelet 在Predicate Admit流程中，会对Pods进行各种Predicate准入检查，包括GeneralPredicates检查本节点是否有足够的cpu,mem,gpu资源。如果GeneralPredicates准入检测失败，对于nonCriticalPod则直接Admit失败，但如果是CriticalPod则会触发kubelet preemption进行资源抢占，按照一定规则杀死一些Pods释放资源，抢占成功，则Admit成功。

流程的源头应该从kubelet初始化的流程开始。

pkg/kubelet/kubelet.go:315// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.// No initialization of Kubelet and its modules should happen here.func NewMainKubelet(...) (*Kubelet, error) {        ...   criticalPodAdmissionHandler := preemption.NewCriticalPodAdmissionHandler(klet.GetActivePods, killPodNow(klet.podWorkers, kubeDeps.Recorder), kubeDeps.Recorder)        klet.admitHandlers.AddPodAdmitHandler(lifecycle.NewPredicateAdmitHandler(klet.getNodeAnyWay, criticalPodAdmissionHandler, klet.containerManager.UpdatePluginResources))        // apply functional Option's        for _, opt := range kubeDeps.Options {                opt(klet)        }        ...        return klet, nil}

在NewMainKubelet对kubelet进行初始化时，通过AddPodAdmitHandler注册了criticalPodAdmissionHandler，CriticalPod的Admit的特殊之处就体现在criticalPodAdmissionHandler。

然后，我们进入kubelet的predicateAdmitHandler流程中，看看GeneralPredicates失败后的处理逻辑。

pkg/kubelet/lifecycle/predicate.go:58func (w *predicateAdmitHandler) Admit(attrs *PodAdmitAttributes) PodAdmitResult {        ...        fit, reasons, err := predicates.GeneralPredicates(podWithoutMissingExtendedResources, nil, nodeInfo)        if err != nil {                message := fmt.Sprintf("GeneralPredicates failed due to %v, which is unexpected.", err)                glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message)                return PodAdmitResult{                        Admit:   fit,                        Reason:  "UnexpectedAdmissionError",                        Message: message,                }        }        if !fit {                fit, reasons, err = w.admissionFailureHandler.HandleAdmissionFailure(pod, reasons)                if err != nil {                        message := fmt.Sprintf("Unexpected error while attempting to recover from admission failure: %v", err)                        glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message)                        return PodAdmitResult{                                Admit:   fit,                                Reason:  "UnexpectedAdmissionError",                                Message: message,                        }                }        }        ...        return PodAdmitResult{                Admit: true,        }}

在kubelet predicateAdmitHandler中对Pod进行GeneralPredicates检查cpu,mem,gpu资源时，如果发现资源不足导致Admit失败，则接着调用HandleAdmissionFailure进行额外处理。前提提到，kubelet初始化时注册了criticalPodAdmissionHandler为HandleAdmissionFailure。

CriticalPodAdmissionHandler struct定义如下：

pkg/kubelet/preemption/preemption.go:41type CriticalPodAdmissionHandler struct {        getPodsFunc eviction.ActivePodsFunc        killPodFunc eviction.KillPodFunc        recorder    record.EventRecorder}

CriticalPodAdmissionHandler的HandleAdmissionFailure方法就是处理CriticalPod特殊的逻辑所在。

pkg/kubelet/preemption/preemption.go:66// HandleAdmissionFailure gracefully handles admission rejection, and, in some cases,// to allow admission of the pod despite its previous failure.func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(pod *v1.Pod, failureReasons []algorithm.PredicateFailureReason) (bool, []algorithm.PredicateFailureReason, error) {        if !kubetypes.IsCriticalPod(pod) || !utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) {                return false, failureReasons, nil        }        // InsufficientResourceError is not a reason to reject a critical pod.        // Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist.        nonResourceReasons := []algorithm.PredicateFailureReason{}        resourceReasons := []*admissionRequirement{}        for _, reason := range failureReasons {                if r, ok := reason.(*predicates.InsufficientResourceError); ok {                        resourceReasons = append(resourceReasons, &admissionRequirement{                                resourceName: r.ResourceName,                                quantity:     r.GetInsufficientAmount(),                        })                } else {                        nonResourceReasons = append(nonResourceReasons, reason)                }        }        if len(nonResourceReasons) > 0 {                // Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons.                return false, nonResourceReasons, nil        }        err := c.evictPodsToFreeRequests(admissionRequirementList(resourceReasons))        // if no error is returned, preemption succeeded and the pod is safe to admit.        return err == nil, nil, err}

如果Pod不是CriticalPod，或者ExperimentalCriticalPodAnnotation Feature Gate是关闭的，则直接返回false，表示Admit失败。
判断Admit的failureReasons是否包含predicate.InsufficientResourceError，如果包含，则调用evictPodsToFreeRequests触发kubelet preemption。注意这里的抢占不同于scheduler preemtion，不要混淆了。

evictPodsToFreeRequests就是kubelet preemption进行资源抢占的逻辑实现，其核心就是调用getPodsToPreempt挑选合适的待杀死的Pods(podsToPreempt)。

pkg/kubelet/preemption/preemption.go:121// getPodsToPreempt returns a list of pods that could be preempted to free requests >= requirementsfunc getPodsToPreempt(pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {        bestEffortPods, burstablePods, guaranteedPods := sortPodsByQOS(pods)        // make sure that pods exist to reclaim the requirements        unableToMeetRequirements := requirements.subtract(append(append(bestEffortPods, burstablePods...), guaranteedPods...)...)        if len(unableToMeetRequirements) > 0 {                return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", unableToMeetRequirements.toString())        }        // find the guaranteed pods we would need to evict if we already evicted ALL burstable and besteffort pods.        guarateedToEvict, err := getPodsToPreemptByDistance(guaranteedPods, requirements.subtract(append(bestEffortPods, burstablePods...)...))        if err != nil {                return nil, err        }        // Find the burstable pods we would need to evict if we already evicted ALL besteffort pods, and the required guaranteed pods.        burstableToEvict, err := getPodsToPreemptByDistance(burstablePods, requirements.subtract(append(bestEffortPods, guarateedToEvict...)...))        if err != nil {                return nil, err        }        // Find the besteffort pods we would need to evict if we already evicted the required guaranteed and burstable pods.        bestEffortToEvict, err := getPodsToPreemptByDistance(bestEffortPods, requirements.subtract(append(burstableToEvict, guarateedToEvict...)...))        if err != nil {                return nil, err        }        return append(append(bestEffortToEvict, burstableToEvict...), guarateedToEvict...), nil}

kubelet preemtion时候挑选待杀死Pods的逻辑如下：

如果该Pod的某个Resource request quantity超过了现在的所有的bestEffortPods, burstablePods, guaranteedPods的该Resource request quantity，则podsToPreempt为nil，意味着无合适Pods以释放。
如果释放所有bestEffortPods, burstablePods的资源都不足够，则再挑选guaranteedPods（guarateedToEvict）。挑选的规则是：

规则一：越少的Pods被释放越好；
规则二：释放的资源越少越好；
规则一的优先级比规则二高；

如果释放所有bestEffortPods及guarateedToEvict的资源都不足够，则再挑选burstablePods(burstableToEvict)。挑选的规则同上。
如果释放所有burstableToEvict及guarateedToEvict的资源都不足够，则再挑选bestEffortPods(bestEffortToEvict)。挑选的规则同上。

也就是说：Pod Resource QoS优先级越低的越先被抢占，同一个QoS Level内挑选Pods按照如下规则：

规则一：越少的Pods被释放越好；
规则二：释放的资源越少越好；
规则一的优先级比规则二高；

Priority Admission Controller对CriticalPod的特殊处理

我们先看看几类特殊的、系统预留的CriticalPod：

ClusterCriticalPod: PriorityClass Name是system-cluster-critical的Pod。
NodeCriticalPod:PriorityClass Name是system-node-critical的Pod。

如果AdmissionController中启动了Priority Admission Controller，那么在创建Pod时对Priority的检查也存在CriticalPod的特殊处理。

Priority Admission Controller主要作用是根据Pod中指定的PriorityClassName替换成对应的Spec.Pritory数值。

plugin/pkg/admission/priority/admission.go:138// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName.func (p *priorityPlugin) admitPod(a admission.Attributes) error {        operation := a.GetOperation()        pod, ok := a.GetObject().(*api.Pod)        if !ok {                return errors.NewBadRequest("resource was marked with kind Pod but was unable to be converted")        }        // Make sure that the client has not set `priority` at the time of pod creation.        if operation == admission.Create && pod.Spec.Priority != nil {                return admission.NewForbidden(a, fmt.Errorf("the integer value of priority must not be provided in pod spec. Priority admission controller populates the value from the given PriorityClass name"))        }        if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) {                var priority int32                // TODO: @ravig - This is for backwards compatibility to ensure that critical pods with annotations just work fine.                // Remove when no longer needed.                if len(pod.Spec.PriorityClassName) == 0 &&                        utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&                        kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) {                        pod.Spec.PriorityClassName = scheduling.SystemClusterCritical                }                if len(pod.Spec.PriorityClassName) == 0 {                        var err error                        priority, err = p.getDefaultPriority()                        if err != nil {                                return fmt.Errorf("failed to get default priority class: %v", err)                        }                } else {                        // Try resolving the priority class name.                        pc, err := p.lister.Get(pod.Spec.PriorityClassName)                        if err != nil {                                if errors.IsNotFound(err) {                                        return admission.NewForbidden(a, fmt.Errorf("no PriorityClass with name %v was found", pod.Spec.PriorityClassName))                                }                                return fmt.Errorf("failed to get PriorityClass with name %s: %v", pod.Spec.PriorityClassName, err)                        }                        priority = pc.Value                }                pod.Spec.Priority = &priority        }        return nil}

同时满足以下所有条件时，给Pod的Spec.PriorityClassName赋值为system-cluster-critical,即认为是ClusterCriticalPod。