Skip to main content

EMR on EKS with CDK blueprint

Introduction​

In this post, we will learn how to use EMR on EKS AddOn and Teams in the cdk-eks-blueprints to deploy a an infrasturcture on EKS to submit Spark Job. The cdk-eks-blueprints allows you deploy an EKS cluster and enable it to be used by EMR on EKS service with minimal setup. The architecture below shows a conceptual view of the infrastructure you will deploy through this blueprint.

EMR on EKS CDK

Deploying the Solution​

In this example, you will provision the following:

  • Creates EKS Cluster Control plane with public endpoint (for demo purpose only)
  • Two managed node groups
    • Core Node group with 3 AZs for running system critical pods. e.g., Cluster Autoscaler, CoreDNS, Logging etc.
    • Spark Node group with single AZ for running Spark jobs
  • Enable EMR on EKS and create one Data teams (emr-data-team-a)
    • Creates new namespace for each team
    • Creates Kubernetes role and role binding(emr-containers user) for the above namespace
    • New IAM role for the team execution role
    • Update AWS_AUTH config map with emr-containers user and AWSServiceRoleForAmazonEMRContainers role
    • Create a trust relationship between the job execution role and the identity of the EMR managed service account
  • EMR Virtual Cluster for emr-data-team-a
  • IAM policy for emr-data-team-a
  • Deploys the following Kubernetes Add-ons
    • Managed Add-ons
      • VPC CNI, CoreDNS, KubeProxy, AWS EBS CSi Driver
    • Self Managed Add-ons
      • Metrics server with HA, Cluster Autoscaler, CertManager and AwsLoadBalancerController

This blueprint can also take an EKS cluster that you defined using the cdk-blueprints-library.

Prerequisites​

Ensure that you have installed the following tools on your machine.

  1. aws cli
  2. kubectl
  3. CDK

NOTE: You need to have an AWS account and region that are bootstrapped by AWS CDK.

Customize​

The the entry point for this cdk blueprint is /bin/emr-eks.ts which instantiate a stack defined in lib/emr-eks-blueprint-stack.ts. This stack must be provided with a VPC and a list of EMR on EKS team definition and the role that will be admin of the EKS cluster. It can also take as options an EKS cluster defined through cdk-blueprints-library and the EKS cluster name.

The properties that are passed to the emr on eks blueprint stack are defined as such:

export interface EmrEksBlueprintProps extends StackProps {
clusterVpc: IVpc,
clusterAdminRoleArn: ArnPrincipal
dataTeams: EmrEksTeamProps[],
eksClusterName?: string, //Default eksBlueprintCluster
eksCluster?: GenericClusterProvider,

}

In this example we define a VPC in lib/vpc.ts and is instantiated in bin/emr-eks.ts. We also define a team called emr-data-team-a and which has an execution role called myBlueprintExecRole. The blueprint will deploy by default an EKS cluster with the managed nodegroups defined in the section Deploying the Solution.

Deploy​

Before you run the solution, you MUST change the clusterAdminRoleArn of the props object in lib/emr-eks.ts. This role allows you to interact manage EKS cluster and should have be allowed at least the IAM action eks:AccessKubernetesApi.

Clone the repository

git clone https://github.com/awslabs/data-on-eks.git

Navigate into one of the example directories and run cdk synth

cd analytics/cdk/emr-eks
npm install
cdk synth --profile YOUR-AWS-PROFILE

Deploy the pattern

cdk deploy --all

Enter yes to deploy.

Verify the resources​

Let’s verify the resources created by cdk deploy.

Verify the Amazon EKS Cluster

aws eks describe-cluster --name eksBlueprintCluster # Update the name cluster name if you supplied your own

Verify EMR on EKS Namespaces batchjob and Pod status for Metrics Server and Cluster Autoscaler.

aws eks --region <ENTER_YOUR_REGION> update-kubeconfig --name eksBlueprintCluster # Creates k8s config file to authenticate with EKS Cluster. Update the name cluster name if you supplied your own

kubectl get nodes # Output shows the EKS Managed Node group nodes

kubectl get ns | grep batchjob # Output shows batchjob

kubectl get pods --namespace=kube-system | grep metrics-server # Output shows Metric Server pod

kubectl get pods --namespace=kube-system | grep cluster-autoscaler # Output shows Cluster Autoscaler pod

Execute Sample Spark job on EMR Virtual Cluster​

Execute the Spark job using the below shell script.

  • Once you deploy the blueprint you will have as output the Virtual Cluster id. You can use the id and the execution role for which you supplied a policy to submit jobs. Below you can find an example of a job you can submit with AWS CLI.

export EMR_ROLE_ARN=arn:aws:iam::<YOUR-ACCOUNT-ID>:role/myBlueprintExecRole

aws emr-containers start-job-run \
--virtual-cluster-id=<VIRTUAL-CLUSTER-ID-IN-CDK-OUTPUT> \
--name=pi-2 \
--execution-role-arn=$EMR_ROLE_ARN \
--release-label=emr-6.8.0-latest \
--job-driver='{
"sparkSubmitJobDriver": {
"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
"sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.kubernetes.node.selector.app=spark"
}
}'

Verify the job execution

kubectl get pods --namespace=batchjob -w

Cleanup​

To clean up your environment, you call the command below. This will destroy the Kubernetes Add-ons, EKS cluster with Node groups and VPC

cdk destroy --all
caution

To avoid unwanted charges to your AWS account, delete all the AWS resources created during this deployment