cluster-proportional-autoscaler源码分析及如何解决KubeDNS性能瓶颈
cluster-proportional-autoscaler源码分析及如何解决KubeDNS性能瓶颈,很多新手对此不是很清楚,为了帮助大家解决这个难题,下面小编将为大家详细讲解,有这方面需求的人可以来学习下,希望你能有所收获。
工作机制
cluster-proportional-autoscaler是kubernetes的孵化项目之一,用来根据集群规模动态的扩缩容指定的namespace下的target(只支持RC, RS, Deployment),还不支持对StatefulSet。目前只提供两种autoscale模式,一种是linear,另一种是ladder,你能很容易的定制开发新的模式,代码接口非常清晰。
cluster-proportional-autoscaler工作机制很简单,每隔一定时间(通过--poll-period-seconds配置,默认10s)重复进行如下操作:
统计一次集群中ScheduableNodes和ScheduableCores;
从apiserver中获取最新configmap数据;
根据对应的autoscale模式,进行configmap参数解析;
据对应的autoscale模式,计算新的期望副本数;
如果与上一次期望副本数不同,则调用Scale接口触发AutoScale;
配置说明
cluster-proportional-autoscaler一共有下面6项flag:
--namespace: 要autoscale的对象所在的namespace;
--target: 要autoscale的对象,只支持deployment/replicationcontroller/replicaset,不区分大小写;
--configmap: 配置实现创建好的configmap,里面存储要使用的模式及其配置,后面会有具体的示例;
--default-params: 如果
--configmap
中配置的configmap不存在或者后来被删除了,则使用该配置来创建新的configmap,建议要配置;--poll-period-seconds: 检查周期,默认为10s。
--version: 打印vesion并退出。
源码分析
pollAPIServer
pkg/autoscaler/autoscaler_server.go:82func (s *AutoScaler) pollAPIServer() { // Query the apiserver for the cluster status --- number of nodes and cores clusterStatus, err := s.k8sClient.GetClusterStatus() if err != nil { glog.Errorf("Error while getting cluster status: %v", err) return } glog.V(4).Infof("Total nodes %5d, schedulable nodes: %5d", clusterStatus.TotalNodes, clusterStatus.SchedulableNodes) glog.V(4).Infof("Total cores %5d, schedulable cores: %5d", clusterStatus.TotalCores, clusterStatus.SchedulableCores) // Sync autoscaler ConfigMap with apiserver configMap, err := s.syncConfigWithServer() if err != nil || configMap == nil { glog.Errorf("Error syncing configMap with apiserver: %v", err) return } // Only sync updated ConfigMap or before controller is set. if s.controller == nil || configMap.ObjectMeta.ResourceVersion != s.controller.GetParamsVersion() { // Ensure corresponding controller type and scaling params. s.controller, err = plugin.EnsureController(s.controller, configMap) if err != nil || s.controller == nil { glog.Errorf("Error ensuring controller: %v", err) return } } // Query the controller for the expected replicas number expReplicas, err := s.controller.GetExpectedReplicas(clusterStatus) if err != nil { glog.Errorf("Error calculating expected replicas number: %v", err) return } glog.V(4).Infof("Expected replica count: %3d", expReplicas) // Update resource target with expected replicas. _, err = s.k8sClient.UpdateReplicas(expReplicas) if err != nil { glog.Errorf("Update failure: %s", err) }}
GetClusterStatus
GetClusterStatus用于统计集群中SchedulableNodes, SchedulableCores,用于后面计算新的期望副本数。
pkg/autoscaler/k8sclient/k8sclient.go:142func (k *k8sClient) GetClusterStatus() (clusterStatus *ClusterStatus, err error) { opt := metav1.ListOptions{Watch: false} nodes, err := k.clientset.CoreV1().Nodes().List(opt) if err != nil || nodes == nil { return nil, err } clusterStatus = &ClusterStatus{} clusterStatus.TotalNodes = int32(len(nodes.Items)) var tc resource.Quantity var sc resource.Quantity for _, node := range nodes.Items { tc.Add(node.Status.Capacity[apiv1.ResourceCPU]) if !node.Spec.Unschedulable { clusterStatus.SchedulableNodes++ sc.Add(node.Status.Capacity[apiv1.ResourceCPU]) } } tcInt64, tcOk := tc.AsInt64() scInt64, scOk := sc.AsInt64() if !tcOk || !scOk { return nil, fmt.Errorf("unable to compute integer values of schedulable cores in the cluster") } clusterStatus.TotalCores = int32(tcInt64) clusterStatus.SchedulableCores = int32(scInt64) k.clusterStatus = clusterStatus return clusterStatus, nil}
Nodes数量统计时,是会剔除掉那些 Unschedulable Nodes的。
Cores数量统计时,是会减掉那些 Unschedulable Nodes对应Capacity。
请注意,这里计算Cores时统计的是Node的Capacity,而不是Allocatable。
我认为,使用Allocatable要比Capacity更好。
这两者在大规模集群时就会体现出差别了,比如每个Node Allocatable比Capacity少
1c4g
,那么2K个Node集群规模时,就相差2000c8000g,这将是的target object number相差很大。
有些同学可能要问:Node Allocatable和Capacity有啥不同呢?
Capacity是Node硬件层面提供的全部资源,服务器配置的多少内存,cpu核数等,都是由硬件决定的。
Allocatable则要在Capacity的基础上减去kubelet flag中配置的kube-resreved和system-reserved资源大小,是Kubernetes给应用真正可分配的资源数。
syncConfigWithServer
syncConfigWithServer主要是从apiserver中获取最新configmap数据,注意这里并没有去watch configmap,而是按照--poll-period-seconds
(默认10s)定期的去get,所以默认会存在最多10s的延迟。
pkg/autoscaler/autoscaler_server.go:124func (s *AutoScaler) syncConfigWithServer() (*apiv1.ConfigMap, error) { // Fetch autoscaler ConfigMap data from apiserver configMap, err := s.k8sClient.FetchConfigMap(s.k8sClient.GetNamespace(), s.configMapName) if err == nil { return configMap, nil } if s.defaultParams == nil { return nil, err } glog.V(0).Infof("ConfigMap not found: %v, will create one with default params", err) configMap, err = s.k8sClient.CreateConfigMap(s.k8sClient.GetNamespace(), s.configMapName, s.defaultParams) if err != nil { return nil, err } return configMap, nil}
如果配置的
--configmap
在集群中已经存在,则从apiserver中获取最新的configmap并返回;如果配置的
--configmap
在集群中不存在,则根据--default-params
的内容创建一个configmap并返回;如果配置的
--configmap
在集群中不存在,且--default-params
又没有配置,则返回nil,意味着失败,整个流程结束,使用时请注意!
建议一定要配置--default-params
,因为--configmap
配置的configmap有可能有意或者无意的被管理员/用户删除了,而你又没配置--default-params
,那么这个时候pollAPIServer将就此结束,因为着你没达到autoscale target的目的,关键是你可能并在不知道集群这个时候出现了这个情况。
EnsureController
EnsureController用来根据configmap中配置的controller type创建对应Controller及解析参数。
pkg/autoscaler/controller/plugin/plugin.go:32// EnsureController ensures controller type and scaling paramsfunc EnsureController(cont controller.Controller, configMap *apiv1.ConfigMap) (controller.Controller, error) { // Expect only one entry, which uses the name of control mode as the key if len(configMap.Data) != 1 { return nil, fmt.Errorf("invalid configMap format, expected only one entry, got: %v", configMap.Data) } for mode := range configMap.Data { // No need to reset controller if control pattern doesn't change if cont != nil && mode == cont.GetControllerType() { break } switch mode { case laddercontroller.ControllerType: cont = laddercontroller.NewLadderController() case linearcontroller.ControllerType: cont = linearcontroller.NewLinearController() default: return nil, fmt.Errorf("not a supported control mode: %v", mode) } glog.V(1).Infof("Set control mode to %v", mode) } // Sync config with controller if err := cont.SyncConfig(configMap); err != nil { return nil, fmt.Errorf("Error syncing configMap with controller: %v", err) } return cont, nil}
检查configmap data中是否只有一个entry,如果不是,则该configmap不合法,流程结束;
检查controller的类型是否为
linear
或ladder
其中之一,并调用对应的方法创建对应的Controller,否则返回失败;linear --> NewLinearController
ladder --> NewLadderController
调用对应Controller的SyncConfig解析configmap data中参数和configmap ResourceVersion更新到Controller对象中;
GetExpectedReplicas
linear和ladder Controller分别实现了自己的GetExpectedReplicas方法,用来计算期望此次监控到的数据应该有的副本数。具体的看下面关于Linear Controller和Ladder Controller部分。
UpdateReplicas
UpdateReplicas将GetExpectedReplicas计算得到的期望副本数,通过调用对应target(rc/rs/deploy)对应的Scale接口,由Scale去完成target的缩容扩容。
pkg/autoscaler/k8sclient/k8sclient.go:172func (k *k8sClient) UpdateReplicas(expReplicas int32) (prevRelicas int32, err error) { scale, err := k.clientset.Extensions().Scales(k.target.namespace).Get(k.target.kind, k.target.name) if err != nil { return 0, err } prevRelicas = scale.Spec.Replicas if expReplicas != prevRelicas { glog.V(0).Infof("Cluster status: SchedulableNodes[%v], SchedulableCores[%v]", k.clusterStatus.SchedulableNodes, k.clusterStatus.SchedulableCores) glog.V(0).Infof("Replicas are not as expected : updating replicas from %d to %d", prevRelicas, expReplicas) scale.Spec.Replicas = expReplicas _, err = k.clientset.Extensions().Scales(k.target.namespace).Update(k.target.kind, scale) if err != nil { return 0, err } } return prevRelicas, nil}
下面是对Linear Controller和Ladder Controller具体实现的代码分析。
Linear Controller
先来看看linear Controller的参数:
pkg/autoscaler/controller/linearcontroller/linear_controller.go:50type linearParams struct { CoresPerReplica float64 `json:"coresPerReplica"` NodesPerReplica float64 `json:"nodesPerReplica"` Min int `json:"min"` Max int `json:"max"` PreventSinglePointFailure bool `json:"preventSinglePointFailure"`}
写configmap时,参考如下:
kind: ConfigMapapiVersion: v1metadata: name: nginx-autoscaler namespace: defaultdata: linear: |- { "coresPerReplica": 2, "nodesPerReplica": 1, "preventSinglePointFailure": true, "min": 1, "max": 100 }
其他参数不多说,我想提的是PreventSinglePointFailure
,字面意思是防止单点故障,是一个bool值,代码中没有进行显示的初始化,意味着默认为false。可以在对应的configmap data或者dafault-params中设置"preventSinglePointFailure": true
,但设置为true后,如果schedulableNodes > 1
,则会保证target's replicas至少为2,也就是防止了target单点故障。
pkg/autoscaler/controller/linearcontroller/linear_controller.go:101func (c *LinearController) GetExpectedReplicas(status *k8sclient.ClusterStatus) (int32, error) { // Get the expected replicas for the currently schedulable nodes and cores expReplicas := int32(c.getExpectedReplicasFromParams(int(status.SchedulableNodes), int(status.SchedulableCores))) return expReplicas, nil}func (c *LinearController) getExpectedReplicasFromParams(schedulableNodes, schedulableCores int) int { replicasFromCore := c.getExpectedReplicasFromParam(schedulableCores, c.params.CoresPerReplica) replicasFromNode := c.getExpectedReplicasFromParam(schedulableNodes, c.params.NodesPerReplica) // Prevent single point of failure by having at least 2 replicas when // there are more than one node. if c.params.PreventSinglePointFailure && schedulableNodes > 1 && replicasFromNode < 2 { replicasFromNode = 2 } // Returns the results which yields the most replicas if replicasFromCore > replicasFromNode { return replicasFromCore } return replicasFromNode}func (c *LinearController) getExpectedReplicasFromParam(schedulableResources int, resourcesPerReplica float64) int { if resourcesPerReplica == 0 { return 1 } res := math.Ceil(float64(schedulableResources) / resourcesPerReplica) if c.params.Max != 0 { res = math.Min(float64(c.params.Max), res) } return int(math.Max(float64(c.params.Min), res))}
根据schedulableCores和configmap中的CoresPerReplica,按照如下公式计算得到replicasFromCore;
replicasFromCore = ceil( schedulableCores * 1/CoresPerReplica )
根据schedulableNodes和configmap中的NodesPerReplica,按照如下公式计算得到replicasFromNode;
replicasFromNode = ceil( schedulableNodes * 1/NodesPerReplica ) )
如果configmap中配置了min或者max,则必须保证replicas在min和max范围内;
replicas = min(replicas, max)
replicas = max(replicas, min)
如果配置了preventSinglePointFailure为true并且
schedulableNodes > 1
,则根据前面提到的逻辑进行防止单点故障,replicasFromNode必须大于2;replicasFromNode = max(2, replicasFromNode)
返回replicasFromNode和replicasFromCore中的最大者作为期望副本数。
概括起来,linear controller计算replicas的公式为:
replicas = max( ceil( cores * 1/coresPerReplica ) , ceil( nodes * 1/nodesPerReplica ) )replicas = min(replicas, max)replicas = max(replicas, min)
Ladder Controller
下面是ladder Controller的参数结构:
pkg/autoscaler/controller/laddercontroller/ladder_controller.go:66type paramEntry [2]inttype paramEntries []paramEntrytype ladderParams struct { CoresToReplicas paramEntries `json:"coresToReplicas"` NodesToReplicas paramEntries `json:"nodesToReplicas"`}
写configmap时,参考如下:
kind: ConfigMapapiVersion: v1metadata: name: nginx-autoscaler namespace: defaultdata: ladder: |- { "coresToReplicas": [ [ 1,1 ], [ 3,3 ], [256,4], [ 512,5 ], [ 1024,7 ] ], "nodesToReplicas": [ [ 1,1 ], [ 2,2 ], [100, 5], [200, 12] ] }
下面是ladder Controller对应的计算期望副本值的方法。
func (c *LadderController) GetExpectedReplicas(status *k8sclient.ClusterStatus) (int32, error) { // Get the expected replicas for the currently schedulable nodes and cores expReplicas := int32(c.getExpectedReplicasFromParams(int(status.SchedulableNodes), int(status.SchedulableCores))) return expReplicas, nil}func (c *LadderController) getExpectedReplicasFromParams(schedulableNodes, schedulableCores int) int { replicasFromCore := getExpectedReplicasFromEntries(schedulableCores, c.params.CoresToReplicas) replicasFromNode := getExpectedReplicasFromEntries(schedulableNodes, c.params.NodesToReplicas) // Returns the results which yields the most replicas if replicasFromCore > replicasFromNode { return replicasFromCore } return replicasFromNode}func getExpectedReplicasFromEntries(schedulableResources int, entries []paramEntry) int { if len(entries) == 0 { return 1 } // Binary search for the corresponding replicas number pos := sort.Search( len(entries), func(i int) bool { return schedulableResources < entries[i][0] }) if pos > 0 { pos = pos - 1 } return entries[pos][1]}
根据schedulableCores在configmap中的CoresToReplicas定义的那个范围中,就选择预先设定的期望副本数。
根据schedulableNodes在configmap中的NodesToReplicas定义的那个范围中,就选择预先设定的期望副本数。
返回上面两者中的最大者作为期望副本数。
注意:
ladder模式下,没有防止单点故障的设置项,用户配置configmap时候要自己注意;
ladder模式下,没有NodesToReplicas或者CoresToReplicas对应的配置为空,则对应的replicas设为1;
比如前面举例的configmap,如果集群中schedulableCores=400(对应期望副本为4),schedulableNodes=120(对应期望副本为5),则最终的期望副本数为5.
使用kube-dns-autoscaler解决KubeDNS性能瓶颈问题
通过如下yaml文件创建kube-dns-autoscaler Deployment和configmap, kube-dns-autoscaler每个30s会进行一次副本数计算检查,并可能触发AutoScale。
kind: ConfigMapapiVersion: v1metadata: name: kube-dns-autoscaler namespace: kube-systemdata: linear: | { "nodesPerReplica": 10, "min": 1, "max": 50, "preventSinglePointFailure": true } ‐‐‐apiVersion: extensions/v1beta1kind: Deploymentmetadata: name: kube-dns-autoscaler namespace: kube-systemspec: template: metadata: labels: k8s-app: kube-dns-autoscaler spec: imagePullSecrets: - name: harborsecret containers: - name: autoscaler image: registry.vivo.xyz:4443/bigdata_release/cluster_proportional_autoscaler_amd64:1.0.0 resources: requests: cpu: "50m" memory: "100Mi" command: - /cluster-proportional-autoscaler - --namespace=kube-system - --configmap=kube-dns-autoscaler - --target=Deployment/kube-dns - --default-params={"linear":{"nodesPerReplica":10,"min":1}} - --logtostderr=true - --v=2
总结和展望
cluster-proportional-autoscaler代码很简单,工作机制也很单纯,我们希望用它根据集群规模来动态扩展KubeDNS,以解决TensorFlow on Kubernetes项目中大规模的域名解析性能问题。
目前它只支持根据SchedulableNodes和SchedulableCores来autoscale,在AI的场景中,存在集群资源极度压榨的情况,一个集群承载的svc和pod波动范围很大,后续我们可能会开发根据service number来autoscale kubedns的controller。
另外,我还考虑将KubeDNS的部署从AI训练服务器中隔离出来,因为训练时经常会将服务器cpu跑到95%以上,KubeDNS也部署在这台服务器上的话,势必也会影响KubeDNS性能。
看完上述内容是否对您有帮助呢?如果还想对相关知识有进一步的了解或阅读更多相关文章,请关注行业资讯频道,感谢您对的支持。