Skip to main content

Troubleshooting

You will find troubleshooting info for Data on Amazon EKS(DoEKS) installation issues

Error: local-exec provisioner errorā€‹

If you encounter the following error during the execution of the local-exec provisioner:

Error: local-exec provisioner error \
with module.eks-blueprints.module.emr_on_eks["data_team_b"].null_resource.update_trust_policy,\
on .terraform/modules/eks-blueprints/modules/emr-on-eks/main.tf line 105, in resource "null_resource" \
"update_trust_policy":ā”‚ 105: provisioner "local-exec" {ā”‚ ā”‚ Error running command 'set -eā”‚ ā”‚ aws emr-containers update-role-trust-policy \
ā”‚ --cluster-name emr-on-eks \ā”‚ --namespace emr-data-team-b \ā”‚ --role-name emr-on-eks-emr-eks-data-team-b

Issue Description:ā€‹

The error message indicates that the emr-containers command is not present in the AWS CLI version being used. This issue has been addressed and fixed in AWS CLI version 2.0.54.

Solutionā€‹

To resolve the issue, update your AWS CLI version to 2.0.54 or a later version by executing the following command:

pip install --upgrade awscliv2

By updating the AWS CLI version, you will ensure that the necessary emr-containers command is available and can be executed successfully during the provisioning process.

If you continue to experience any issues or require further assistance, please consult the AWS CLI GitHub issue for more details or contact our support team for additional guidance.

Timeouts during Terraform Destroyā€‹

Issue Description:ā€‹

Customers may experience timeouts during the deletion of their environments, specifically when VPCs are being deleted. This is a known issue related to the vpc-cni component.

Symptoms:ā€‹

ENIs (Elastic Network Interfaces) remain attached to subnets even after the environment is destroyed. The EKS managed security group associated with the ENI cannot be deleted by EKS.

Solution:ā€‹

To overcome this issue, follow the recommended solution below:

Utilize the provided cleanup.sh scripts to ensure a proper cleanup of resources. Run the `cleanup.sh`` script, which is included in the blueprint. This script will handle the removal of any lingering ENIs and associated security groups.

Error: could not download chartā€‹

If you encounter the following error while attempting to download a chart:

ā”‚ Error: could not download chart: failed to download "oci://public.ecr.aws/karpenter/karpenter" at version "v0.18.1"
ā”‚
ā”‚ with module.eks_blueprints_kubernetes_addons.module.karpenter[0].module.helm_addon.helm_release.addon[0],
ā”‚ on .terraform/modules/eks_blueprints_kubernetes_addons/modules/kubernetes-addons/helm-addon/main.tf line 1, in resource "helm_release" "addon":
ā”‚ 1: resource "helm_release" "addon" {
ā”‚

Follow the steps below to resolve the issue:

Issue Description:ā€‹

The error message indicates that there was a failure in downloading the specified chart. This issue can occur due to a bug in Terraform during the installation of Karpenter.

Solution:ā€‹

To resolve the issue, you can try the following steps:

Authenticate with ECR: Run the following command to authenticate with the ECR (Elastic Container Registry) where the chart is located:

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws

Re-run terraform apply: Execute the terraform apply command again with the --auto-approve flag to reapply the Terraform configuration:

terraform apply --auto-approve

By authenticating with ECR and re-running the terraform apply command, you will ensure that the necessary chart can be downloaded successfully during the installation process.

Terraform apply/destroy error to authenticate with EKS Clusterā€‹

ERROR:
ā•·
ā”‚ Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp [::1]:80: connect: connection refused
ā”‚
ā”‚ with module.eks.kubernetes_config_map_v1_data.aws_auth[0],
ā”‚ on .terraform/modules/eks/main.tf line 550, in resource "kubernetes_config_map_v1_data" "aws_auth":
ā”‚ 550: resource "kubernetes_config_map_v1_data" "aws_auth" {
ā”‚
ā•µ

Solution: In this situation Terraform is unable to refresh the data resources and authenticate with EKS Cluster. See the discussion here

Try this approach first by using exec plugin.

provider "kubernetes" {
host = module.eks_blueprints.eks_cluster_endpoint
cluster_ca_certificate = base64decode(module.eks_blueprints.eks_cluster_certificate_authority_data)

exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", module.eks_blueprints.eks_cluster_id]
}
}


If the issue still persists even after the above change then you can use alternative approach of using local kube config file. NOTE: This approach might not be ideal for production. It helps you to apply/destroy clusters with your local kube config.

  1. Create a local kubeconfig for your cluster
aws eks update-kubeconfig --name <EKS_CLUSTER_NAME> --region <CLUSTER_REGION>
  1. Update the providers.tf file with the below config by just using the config_path.
provider "kubernetes" {
config_path = "<HOME_PATH>/.kube/config"
}

provider "helm" {
kubernetes {
config_path = "<HOME_PATH>/.kube/config"
}
}

provider "kubectl" {
config_path = "<HOME_PATH>/.kube/config"
}

EMR Containers Virtual Cluster (dhwtlq9yx34duzq5q3akjac00) delete: unexpected state 'ARRESTED'ā€‹

If you encounter an error message stating "waiting for EMR Containers Virtual Cluster (xwbc22787q6g1wscfawttzzgb) delete: unexpected state 'ARRESTED', wanted target ''. last error: %!s(nil)", you can follow the steps below to resolve the issue:

Note: Replace <REGION> with the appropriate AWS region where the virtual cluster is located.

  1. Open a terminal or command prompt.
  2. Run the following command to list the virtual clusters in the "ARRESTED" state:
aws emr-containers list-virtual-clusters --region <REGION> --states ARRESTED \
--query 'virtualClusters[0].id' --output text

This command retrieves the ID of the virtual cluster in the "ARRESTED" state.

  1. Run the following command to delete the virtual cluster:
aws emr-containers list-virtual-clusters --region <REGION> --states ARRESTED \
--query 'virtualClusters[0].id' --output text | xargs -I{} aws emr-containers delete-virtual-cluster \
--region <REGION> --id {}

Replace <VIRTUAL_CLUSTER_ID> with the ID of the virtual cluster obtained from the previous step.

By executing these commands, you will be able to delete the virtual cluster that is in the "ARRESTED" state. This should resolve the unexpected state issue and allow you to proceed with further operations.

Terminating namespace issueā€‹

If you encounter the issue where a namespace is stuck in the "Terminating" state and cannot be deleted, you can use the following command to remove the finalizers on the namespace:

Note: Replace <namespace> with the name of the namespace you want to delete.

NAMESPACE=<namespace>
kubectl get namespace $NAMESPACE -o json | sed 's/"kubernetes"//' | kubectl replace --raw "/api/v1/namespaces/$NAMESPACE/finalize" -f -

This command retrieves the namespace details in JSON format, removes the "kubernetes" finalizer, and performs a replace operation to remove the finalizer from the namespace. This should allow the namespace to complete the termination process and be successfully deleted.

Please ensure that you have the necessary permissions to perform this operation. If you continue to experience issues or require further assistance, please reach out to our support team for additional guidance and troubleshooting steps.