📄️ Introduction
Running AI/ML platforms on Kubernetes can greatly simplify and automate the deployment, scaling, and management of these complex applications. There are a number of popular tools and technologies that have emerged to support this use case, including TensorFlow, PyTorch, Ray, MLFlow, etc.
📄️ Ray on EKS
This blueprint should be considered as experimental and should only be used for proof of concept.
📄️ JupyterHub on EKS
Note: We are actively working on enhancing this blueprint with additional functionalities to make it more enterprise-ready.
📄️ EMR NVIDIA Spark-RAPIDS
The NVIDIA RAPIDS Accelerator for Apache Spark is a powerful tool that builds on the capabilities of NVIDIA CUDA® - a transformative parallel computing platform designed for enhancing computational processes on NVIDIA's GPU architecture. RAPIDS, a project developed by NVIDIA, comprises a suite of open-source libraries that are hinged upon CUDA, thereby enabling GPU-accelerated data science workflows.
📄️ Trainium on EKS
AWS Trainium is an advanced ML accelerator that transforms high-performance deep learning(DL) training. Trn1 instances, powered by AWS Trainium chips, are purpose-built for high-performance DL training of 100B+ parameter models. Meticulously designed for exceptional performance, Trn1 instances cater specifically to training popular Natual Language Processing(NLP) models on AWS, offering up to 50% cost savings compared to GPU-based EC2 instances. This cost efficiency makes them an attractive option for data scientists and ML practitioners seeking optimized training costs without compromising performance.