Tech Notes: HPA based on external metric in cloudwatch

CA(Cluster Autoscaler)

EKS worker node in general:

https://blog.gruntwork.io/comprehensive-guide-to-eks-worker-nodes-94e241092cbe

One thing to note is that while Managed Node Groups provides a managed experience for the provisioning and lifecycle of EC2 instances, they do not configure horizontal auto-scaling or vertical auto-scaling. This means that you still need to use a service like Kubernetes Cluster Autoscaler to implement auto-scaling of the underlying ASG.

EKS managed node groups for Cluster Autoscaler:

https://itnext.io/amazon-eks-managed-node-groups-87943e3f3360

Enable CA for managed node group setup (v1, works):

https://aws.amazon.com/premiumsupport/knowledge-center/eks-cluster-autoscaler-setup/

Enable CA for managed node group setup (following its instruction and use code from v1 below):

https://dev.to/wingkwong/autoscaling-an-eks-cluster-with-cluster-autoscaler-30e9

Working code:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-one-asg.yaml

HPA (horizontal pod autoscaler)

This is the best link I have so far talking about HPA and custom metrics

https://learnk8s.io/autoscaling-apps-kubernetes

Limitations

Terraform does not support external metrics for HPA:

https://github.com/terraform-providers/terraform-provider-kubernetes/issues/555

MSK does not support consumer group lag in CloudWatch as Dec 2019:

https://forums.aws.amazon.com/thread.jspa?threadID=303952

Install Metrics server:

https://docs.aws.amazon.com/eks/latest/userguide/horizontal-pod-autoscaler.html#hpa-install-metrics-server

DOWNLOAD_URL=$(curl -Ls "https://api.github.com/repos/kubernetes-sigs/metrics-server/releases/latest" | jq -r .tarball_url)

DOWNLOAD_VERSION=$(grep -o '[^/v]*$' <<< $DOWNLOAD_URL)

curl -Ls $DOWNLOAD_URL -o metrics-server-$DOWNLOAD_VERSION.tar.gz

mkdir metrics-server-$DOWNLOAD_VERSION

tar -xzf metrics-server-$DOWNLOAD_VERSION.tar.gz --directory metrics-server-$DOWNLOAD_VERSION --strip-components 1

kubectl apply -f metrics-server-$DOWNLOAD_VERSION/deploy/1.8+/

LT-2018-9999:custom_hpa_metrics jzeng$ kubectl get pod -n kube-system

NAME READY STATUS RESTARTS AGE

aws-node-2m8s5 1/1 Running 0 5d2h

aws-node-bdbzq 1/1 Running 0 5d2h

aws-node-drcs5 1/1 Running 0 5d2h

coredns-74dd858ddc-64jl6 1/1 Running 0 5d2h

coredns-74dd858ddc-f9s8t 1/1 Running 0 5d2h

kube-proxy-2snmp 1/1 Running 0 5d2h

kube-proxy-q8qfz 1/1 Running 0 5d2h

kube-proxy-w9vr6 1/1 Running 0 5d2h

metrics-server-7fcf9cc98b-9fqx9 1/1 Running 0 41s

External Metrics

AWS CloudWatch Metrics Adapter for K8s (k8s-cloudwatch-adapter) for external metrics

https://aws.amazon.com/blogs/compute/scaling-kubernetes-deployments-with-amazon-cloudwatch-metrics/

Generate external metric:

LT-2018-9999:dev jzeng$ cat metrics_kafka_lag.yaml

apiVersion: metrics.aws/v1alpha1

kind: ExternalMetric

metadata:

spec:

resource:

resource: "deployment"

queries:

- id: metrics_kafka_lag

metricStat:

metric:

namespace: "co-ec-eks-cluster-vpc-05b52a0b999999999"

metricName: "KafkaTopicTotalLag"

dimensions:

- name: QueueName

value: "task"

- name: metric_type

value: "counter"

period: 600

stat: Average

unit: None

returnData: true

Generate hpa for such external metric:

LT-2018-9999:dev jzeng$ cat hpa_kafka_lag.yaml

kind: HorizontalPodAutoscaler

apiVersion: autoscaling/v2beta1

metadata:

spec:

scaleTargetRef:

apiVersion: apps/v1beta1

kind: Deployment

minReplicas: 1

maxReplicas: 10

metrics:

- type: External

external:

metricName: metrics-kafka-lag

targetAverageValue: 3

Custom metrics with Promethus:

https://docs.bitnami.com/kubernetes/how-to/configure-autoscaling-custom-metrics/

https://blog.jetstack.io/blog/resource-and-custom-metrics-hpa-v2/

Multiple metrics for same pod’s HPA:

https://unofficial-kubernetes.readthedocs.io/en/latest/tasks/run-application/horizontal-pod-autoscale-walkthrough/

Troubleshooting:

Check external metrics:

LT-2018-9999:dev jzeng$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" |jq

{

"kind": "APIResourceList",

"apiVersion": "v1",

"groupVersion": "external.metrics.k8s.io/v1beta1",

"resources": [

{

"name": "metrics-kafka-lag",

"singularName": "",

"namespaced": true,

"kind": "ExternalMetricValueList",

"verbs": [

"get"

]

{

"name": "hello-queue-length",

"singularName": "",

"namespaced": true,

"kind": "ExternalMetricValueList",

"verbs": [

"get"

]

}

]

}

List metrics from cloudwatch:

LT-2018-9999:dev jzeng$ aws cloudwatch list-metrics --region us-east-1 --namespace co-ec-eks-cluster-vpc-05b52a0b999999999

{

"Metrics": [

{

"Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",

"Dimensions": [

{

"Name": "metric_type",

"Value": "counter"

{

"Name": "QueueName",

"Value": "task"

}

"MetricName": "KafkaTopic#TotalLag"

{

"Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",

"Dimensions": [

{

"Name": "metric_type",

"Value": "counter"

{

"Name": "QueueName",

"Value": "task"

}

"MetricName": "KafkaTopicTotalLag"

}

]

}

Query a metric (if--unit is not passed as parameter, all data that was collected with any unit will be returned):

LT-2018-9999:dev jzeng$ aws cloudwatch get-metric-statistics --start-time 2020-04-15T04:00:00Z --end-time 2020-04-19T04:00:00Z --region us-east-1 --namespace co-ec-eks-cluster-vpc-05b52a0b999999999 --metric-name KafkaTopicTotalLag --period 300 --statistics Average --dimensions Name=QueueName,Value=task Name=metric_type,Value=counter

{

"Datapoints": [

{

"Timestamp": "2020-04-18T20:55:00Z",

"Average": 1211.0,

"Unit": "None"

{

"Timestamp": "2020-04-18T05:20:00Z",

"Average": 1211.0,

"Unit": "None"

Or:

LT-2018-9999:dev jzeng$ cat cwquery.json

[

{

"Id": "metrics_kafka_lag",

"MetricStat": {

"Metric": {

"Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",

"MetricName": "KafkaTopicTotalLag",

"Dimensions": [

{

"Name": "QueueName",

"Value": "task"

{

"Name": "metric_type",

"Value": "counter"

}

]

"Period": 600,

"Stat": "Sum",

"Unit": "None"

"ReturnData": true

}

]

LT-2018-9999:dev jzeng$ aws cloudwatch get-metric-data --metric-data-queries file://./cwquery.json --start-time 2020-04-18T04:00:00Z --end-time 2020-04-18T05:00:00Z --region us-east-1

{

"Messages": [],

"MetricDataResults": [

{

"Timestamps": [

"2020-04-18T04:50:00Z",

"2020-04-18T04:40:00Z",

"2020-04-18T04:30:00Z",

"2020-04-18T04:20:00Z",

"2020-04-18T04:10:00Z",

"2020-04-18T04:00:00Z"

"StatusCode": "Complete",

"Values": [

12110.0,

12110.0

"Id": "metrics_kafka_lag",

"Label": "KafkaTopicTotalLag"

}

]

}

https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/get-metric-statistics.html

https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-getmetricdata-api/

Important thing to remember:

The maximum number of data points returned from a single call is 1,440. If you request more than 1,440 data points, CloudWatch returns an error (actually reuturn nothing instead of error). To reduce the number of data points, you can narrow the specified time range and make multiple requests across adjacent time ranges, or you can increase the specified period.

Tech Notes

Saturday, April 18, 2020

HPA based on external metric in cloudwatch

External Metrics

No comments:

Post a Comment