Saturday, April 18, 2020

HPA based on external metric in cloudwatch

CA(Cluster Autoscaler)

EKS worker node in general:


One thing to note is that while Managed Node Groups provides a managed experience for the provisioning and lifecycle of EC2 instances, they do not configure horizontal auto-scaling or vertical auto-scaling. This means that you still need to use a service like Kubernetes Cluster Autoscaler to implement auto-scaling of the underlying ASG.

EKS managed node groups for Cluster Autoscaler:


Enable CA for managed node group setup (v1, works):


Enable CA for managed node group setup (following its instruction and use code from v1 below):

Working code:




HPA (horizontal pod autoscaler)

This is the best link I have so far talking about HPA and custom metrics


Limitations

Terraform does not support external metrics for HPA:


MSK does not support consumer group lag in CloudWatch as Dec 2019:




Install Metrics server:


DOWNLOAD_URL=$(curl -Ls "https://api.github.com/repos/kubernetes-sigs/metrics-server/releases/latest" | jq -r .tarball_url)
DOWNLOAD_VERSION=$(grep -o '[^/v]*$' <<< $DOWNLOAD_URL)
curl -Ls $DOWNLOAD_URL -o metrics-server-$DOWNLOAD_VERSION.tar.gz
mkdir metrics-server-$DOWNLOAD_VERSION
tar -xzf metrics-server-$DOWNLOAD_VERSION.tar.gz --directory metrics-server-$DOWNLOAD_VERSION --strip-components 1
kubectl apply -f metrics-server-$DOWNLOAD_VERSION/deploy/1.8+/

LT-2018-9999:custom_hpa_metrics jzeng$ kubectl get pod -n kube-system
NAME                              READY   STATUS    RESTARTS   AGE
aws-node-2m8s5                    1/1     Running   0          5d2h
aws-node-bdbzq                    1/1     Running   0          5d2h
aws-node-drcs5                    1/1     Running   0          5d2h
coredns-74dd858ddc-64jl6          1/1     Running   0          5d2h
coredns-74dd858ddc-f9s8t          1/1     Running   0          5d2h
kube-proxy-2snmp                  1/1     Running   0          5d2h
kube-proxy-q8qfz                  1/1     Running   0          5d2h
kube-proxy-w9vr6                  1/1     Running   0          5d2h
metrics-server-7fcf9cc98b-9fqx9   1/1     Running   0          41s


External Metrics


AWS CloudWatch Metrics Adapter for K8s (k8s-cloudwatch-adapter) for external metrics



Generate external metric:

LT-2018-9999:dev jzeng$ cat metrics_kafka_lag.yaml
apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric
metadata:
  name: metrics-kafka-lag
spec:
  name: metrics-kafka-lag
  resource:
    resource: "deployment"
  queries:
    - id: metrics_kafka_lag
      metricStat:
        metric:
          namespace: "co-ec-eks-cluster-vpc-05b52a0b999999999"
          metricName: "KafkaTopicTotalLag"
          dimensions:
              - name: QueueName
                value: "task"
              - name: metric_type
                value: "counter"
        period: 600
        stat: Average
        unit: None
      returnData: true


Generate hpa for such external metric:

LT-2018-9999:dev jzeng$ cat hpa_kafka_lag.yaml
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: external-kafka-lag-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: executor-co-ec-eks-cluster-vpc-05b52a0b999999999
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: metrics-kafka-lag
      targetAverageValue: 3


Custom metrics with Promethus:



Multiple metrics for same pod’s HPA:



Troubleshooting:

Check external metrics:

LT-2018-9999:dev jzeng$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" |jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "metrics-kafka-lag",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "hello-queue-length",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

List metrics from cloudwatch:

LT-2018-9999:dev jzeng$ aws cloudwatch list-metrics --region us-east-1 --namespace co-ec-eks-cluster-vpc-05b52a0b999999999
{
    "Metrics": [
        {
            "Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",
            "Dimensions": [
                {
                    "Name": "metric_type",
                    "Value": "counter"
                },
                {
                    "Name": "QueueName",
                    "Value": "task"
                }
            ],
            "MetricName": "KafkaTopic#TotalLag"
        },
        {
            "Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",
            "Dimensions": [
                {
                    "Name": "metric_type",
                    "Value": "counter"
                },
                {
                    "Name": "QueueName",
                    "Value": "task"
                }
            ],
            "MetricName": "KafkaTopicTotalLag"
        }
    ]
}

Query a metric (if--unit is not passed as parameter, all data that was collected with any unit will be returned):

LT-2018-9999:dev jzeng$ aws cloudwatch get-metric-statistics  --start-time 2020-04-15T04:00:00Z --end-time 2020-04-19T04:00:00Z --region us-east-1 --namespace co-ec-eks-cluster-vpc-05b52a0b999999999 --metric-name KafkaTopicTotalLag --period 300 --statistics Average --dimensions Name=QueueName,Value=task Name=metric_type,Value=counter
{
    "Datapoints": [
        {
            "Timestamp": "2020-04-18T20:55:00Z",
            "Average": 1211.0,
            "Unit": "None"
        },
        {
            "Timestamp": "2020-04-18T05:20:00Z",
            "Average": 1211.0,
            "Unit": "None"
        },

Or:

LT-2018-9999:dev jzeng$ cat cwquery.json
[
    {
        "Id": "metrics_kafka_lag",
        "MetricStat": {
            "Metric": {
                "Namespace": "co-ec-eks-cluster-vpc-05b52a0b999999999",
                "MetricName": "KafkaTopicTotalLag",
                "Dimensions": [
                    {
                        "Name": "QueueName",
                        "Value": "task"
                    },
                    {
                        "Name": "metric_type",
                        "Value": "counter"
                    }
                ]
            },
            "Period": 600,
            "Stat": "Sum",
            "Unit": "None"
        },
        "ReturnData": true
    }
]

LT-2018-9999:dev jzeng$ aws cloudwatch get-metric-data --metric-data-queries file://./cwquery.json --start-time 2020-04-18T04:00:00Z --end-time 2020-04-18T05:00:00Z --region us-east-1
{
    "Messages": [],
    "MetricDataResults": [
        {
            "Timestamps": [
                "2020-04-18T04:50:00Z",
                "2020-04-18T04:40:00Z",
                "2020-04-18T04:30:00Z",
                "2020-04-18T04:20:00Z",
                "2020-04-18T04:10:00Z",
                "2020-04-18T04:00:00Z"
            ],
            "StatusCode": "Complete",
            "Values": [
                12110.0,
                12110.0,
                12110.0,
                12110.0,
                12110.0,
                12110.0
            ],
            "Id": "metrics_kafka_lag",
            "Label": "KafkaTopicTotalLag"
        }
    ]
}





Important thing to remember:

The maximum number of data points returned from a single call is 1,440. If you request more than 1,440 data points, CloudWatch returns an error (actually reuturn nothing instead of error). To reduce the number of data points, you can narrow the specified time range and make multiple requests across adjacent time ranges, or you can increase the specified period. 










No comments:

Post a Comment