CA(Cluster Autoscaler)
EKS worker node in general:
One thing to note is that while Managed Node Groups provides a managed experience for the provisioning and lifecycle of EC2 instances, they do not configure horizontal auto-scaling or vertical auto-scaling. This means that you still need to use a service like Kubernetes Cluster Autoscaler to implement auto-scaling of the underlying ASG.
EKS managed node groups for Cluster Autoscaler:
Enable CA for managed node group setup (v1, works):
Enable CA for managed node group setup (following its instruction and use code from v1 below):
Working code:
HPA (horizontal pod autoscaler)
This is the best link I have so far talking about HPA and custom metrics
Limitations
Terraform does not support external metrics for HPA:
MSK does not support consumer group lag in CloudWatch as Dec 2019:
Install Metrics server:
DOWNLOAD_URL=$(curl -Ls "https://api.github.com/repos/kubernetes-sigs/metrics-server/releases/latest" | jq -r .tarball_url)
DOWNLOAD_VERSION=$(grep -o '[^/v]*$' <<< $DOWNLOAD_URL)
curl -Ls $DOWNLOAD_URL -o metrics-server-$DOWNLOAD_VERSION.tar.gz
mkdir metrics-server-$DOWNLOAD_VERSION
tar -xzf metrics-server-$DOWNLOAD_VERSION.tar.gz --directory metrics-server-$DOWNLOAD_VERSION --strip-components 1
kubectl apply -f metrics-server-$DOWNLOAD_VERSION/deploy/1.8+/
LT-2018-9999:custom_hpa_metrics jzeng$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
aws-node-2m8s5 1/1 Running 0 5d2h
aws-node-bdbzq 1/1 Running 0 5d2h
aws-node-drcs5 1/1 Running 0 5d2h
coredns-74dd858ddc-64jl6 1/1 Running 0 5d2h
coredns-74dd858ddc-f9s8t 1/1 Running 0 5d2h
kube-proxy-2snmp 1/1 Running 0 5d2h
kube-proxy-q8qfz 1/1 Running 0 5d2h
kube-proxy-w9vr6 1/1 Running 0 5d2h
metrics-server-7fcf9cc98b-9fqx9 1/1 Running 0 41s
External Metrics
AWS CloudWatch Metrics Adapter for K8s (k8s-cloudwatch-adapter) for external metrics
LT-2018-9999:dev jzeng$ cat metrics_kafka_lag.yaml
apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric
metadata:
name: metrics-kafka-lag
spec:
name: metrics-kafka-lag
resource:
resource: "deployment"
queries:
- id: metrics_kafka_lag
metricStat:
metric:
namespace: "co-ec-eks-cluster-vpc-05b52a0b999999999"
metricName: "KafkaTopicTotalLag"
dimensions:
- name: QueueName
value: "task"
- name: metric_type
value: "counter"
period: 600
stat: Average
unit: None
returnData: true
LT-2018-9999:dev jzeng$ cat hpa_kafka_lag.yaml
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: external-kafka-lag-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: executor-co-ec-eks-cluster-vpc-05b52a0b999999999
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: metrics-kafka-lag
targetAverageValue: 3
Custom metrics with Promethus:
Multiple metrics for same pod’s HPA:
Troubleshooting:
Check external metrics:
LT-2018-9999:dev jzeng$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" |jq
{
"kind": "APIResourceList",
"apiVersion": "v1",
"groupVersion": "external.metrics.k8s.io/v1beta1",
"resources": [
{
"name": "metrics-kafka-lag",
"singularName": "",
"namespaced": true,
"kind": "ExternalMetricValueList",
"verbs": [
"get"
]
},
{
"name": "hello-queue-length",
"singularName": "",
"namespaced": true,
"kind": "ExternalMetricValueList",
"verbs": [
"get"
]
}
]
}
List metrics from cloudwatch:
LT-2018-9999:dev jzeng$ aws cloudwatch list-metrics --region us-east-1 --namespace co-ec-eks-cluster-vpc-05b52a0b 999999999
{
"Metrics": [
{
"Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",
"Dimensions": [
{
"Name": "metric_type",
"Value": "counter"
},
{
"Name": "QueueName",
"Value": "task"
}
],
"MetricName": "KafkaTopic#TotalLag"
},
{
"Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",
"Dimensions": [
{
"Name": "metric_type",
"Value": "counter"
},
{
"Name": "QueueName",
"Value": "task"
}
],
"MetricName": "KafkaTopicTotalLag"
}
]
}
Query a metric (if--unit is not passed as parameter, all data that was collected with any unit will be returned):
LT-2018-9999:dev jzeng$ aws cloudwatch get-metric-statistics --start-time 2020-04-15T04:00:00Z --end-time 2020-04-19T04:00:00Z --region us-east-1 --namespace co-ec-eks-cluster-vpc-05b52a0b999999999 --metric-name KafkaTopicTotalLag --period 300 --statistics Average --dimensions Name=QueueName,Value=task Name=metric_type,Value=counter
{
"Datapoints": [
{
"Timestamp": "2020-04-18T20:55:00Z",
"Average": 1211.0,
"Unit": "None"
},
{
"Timestamp": "2020-04-18T05:20:00Z",
"Average": 1211.0,
"Unit": "None"
},
Or:
LT-2018-9999:dev jzeng$ cat cwquery.json
[
{
"Id": "metrics_kafka_lag",
"MetricStat": {
"Metric": {
"Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",
"MetricName": "KafkaTopicTotalLag",
"Dimensions": [
{
"Name": "QueueName",
"Value": "task"
},
{
"Name": "metric_type",
"Value": "counter"
}
]
},
"Period": 600,
"Stat": "Sum",
"Unit": "None"
},
"ReturnData": true
}
]
LT-2018-9999:dev jzeng$ aws cloudwatch get-metric-data --metric-data-queries file://./cwquery.json --start-time 2020-04-18T04:00:00Z --end-time 2020-04-18T05:00:00Z --region us-east-1
{
"Messages": [],
"MetricDataResults": [
{
"Timestamps": [
"2020-04-18T04:50:00Z",
"2020-04-18T04:40:00Z",
"2020-04-18T04:30:00Z",
"2020-04-18T04:20:00Z",
"2020-04-18T04:10:00Z",
"2020-04-18T04:00:00Z"
],
"StatusCode": "Complete",
"Values": [
12110.0,
12110.0,
12110.0,
12110.0,
12110.0,
12110.0
],
"Id": "metrics_kafka_lag",
"Label": "KafkaTopicTotalLag"
}
]
}
Important thing to remember:
The maximum number of data points returned from a single call is 1,440. If you request more than 1,440 data points, CloudWatch returns an error (actually reuturn nothing instead of error). To reduce the number of data points, you can narrow the specified time range and make multiple requests across adjacent time ranges, or you can increase the specified period.