AI/ML on EKS | 🌥️ The GitOps Platform for Data Analytics & AI/ML 💎

📄️ Introduction

Running AI/ML platforms on Kubernetes can greatly simplify and automate the deployment, scaling, and management of these complex applications. There are a number of popular tools and technologies that have emerged to support this use case, including TensorFlow, PyTorch, Ray, MLFlow, etc.

📄️ Ray on EKS

This blueprint should be considered as experimental and should only be used for proof of concept.

📄️ JupyterHub on EKS

Note: We are actively working on enhancing this blueprint with additional functionalities to make it more enterprise-ready.

📄️ EMR NVIDIA Spark-RAPIDS

The NVIDIA RAPIDS Accelerator for Apache Spark is a powerful tool that builds on the capabilities of NVIDIA CUDA® - a transformative parallel computing platform designed for enhancing computational processes on NVIDIA's GPU architecture. RAPIDS, a project developed by NVIDIA, comprises a suite of open-source libraries that are hinged upon CUDA, thereby enabling GPU-accelerated data science workflows.

📄️ Trainium on EKS

AWS Trainium is an advanced ML accelerator that transforms high-performance deep learning(DL) training. Trn1 instances, powered by AWS Trainium chips, are purpose-built for high-performance DL training of 100B+ parameter models. Meticulously designed for exceptional performance, Trn1 instances cater specifically to training popular Natual Language Processing(NLP) models on AWS, offering up to 50% cost savings compared to GPU-based EC2 instances. This cost efficiency makes them an attractive option for data scientists and ML practitioners seeking optimized training costs without compromising performance.