Essential Kubeflow

Engineering ML Workflows on Kubernetes

1st Edition - May 1, 2026
Latest edition
Authors: Prashanth Josyula, Sonika Arora, Anant Kumar, Jivitesh Poojary
Language: English

Essential Kubeflow: Engineering ML Workflows on Kubernetes equips readers with the tools to transform ML workflows from experimental notebooks to production-ready platforms with t… Read more

Purchase options

Discover top bestsellers!

Save up to 25% off books and eBooks.

Shop bestsellers

Essential Kubeflow: Engineering ML Workflows on Kubernetes equips readers with the tools to transform ML workflows from experimental notebooks to production-ready platforms with this comprehensive guide to Kubeflow, one of the most widely adopted open source MLOps platforms used to automate ML workloads. Whether you're a Machine Learning engineer looking to operationalize models, a platform engineer diving into ML infrastructure, or a technical leader architecting ML systems, this book provides practical solutions for real-world challenges. Through hands-on examples and production-tested patterns, readers will master essential skills for building enterprise-grade Machine Learning platforms: architecting production systems on Kubernetes, designing end-to-end ML pipelines, implementing robust model serving, scaling workloads efficiently, managing multi-user environments, deploying automated MLOps workflows, and integrating with existing ML tools. By the end of this book, readers will have the expertise to build and maintain scalable ML platforms that can handle the demands of modern enterprise AI initiatives.

Part I: Foundation

1. Kubernetes Essentials for ML Engineers

1.2. Container Fundamentals and Docker Basics

1.3. Kubernetes Architecture Overview

1.4. Key Concepts: Pods, Deployments, Services

1.5. Resource Management and Scheduling

1.6. StatefulSets and Persistent Storage

1.7. Networking and Service Discovery

2. Getting Started with Kubeflow

2.1. Understanding ML Platforms and MLOps

2.2. Kubeflow Architecture and Components

2.3. Installation and Environment Setup

2.4. Multi-user Management Basics

2.5. Platform Security Fundamentals

Part II: Building ML Workflows

3. Understanding Kubeflow Pipelines

3.1. Pipeline Architecture Fundamentals

3.2. The Pipeline SDK and DSL

3.3. Building Your First Pipeline

3.4. Pipeline Components and Artifacts

3.5. Pipeline Execution and Debugging

4. Advanced Pipeline Development

4.1. Designing Reusable Components

4.2. Managing Pipeline Parameters

4.3. Error Handling Strategies

4.4. Pipeline Versioning and Storage

4.5. CI/CD Integration Patterns

5. Experimentation with Notebooks

5.1. JupyterHub in Kubeflow

5.2. Managing Notebook Servers

5.3. Resource Allocation and Quotas

5.4. Persistent Storage Configuration

5.5. From Notebooks to Pipelines

Part III: Model Development and Training

6. Training at Scale

6.1. Understanding Training Operators

6.2. Distributed Training Basics

6.3. TensorFlow Training on Kubeflow

6.4. PyTorch Training on Kubeflow

6.5. Resource Management for Training

7. Hyperparameter Tuning with Katib

7.1. Experiment Configuration

7.2. Defining Search Spaces

7.3. Understanding Search Algorithms

7.4. Managing Training Trials

7.5. Analyzing Experiment Results

Part IV: Model Deployment

8. Serving Models with KServe

8.1. KServe Architecture Overview

8.2. Model Server Deployment

8.3. Inference Service Configuration

8.4. Model Updates and Versioning

8.5. Performance Monitoring

9. Production Operations

9.1. Monitoring ML Workloads

9.2. Resource Management

9.3. Security Best Practices

9.4. Platform Maintenance

9.5. Troubleshooting Guide

Part V: Enterprise Implementation

10. Production Best Practices

10.1. Building Enterprise ML Platforms

10.2. Multi-tenant Architecture Design

10.3. Scaling Strategies and Patterns

10.4. Cost Optimization Techniques

10.5. Team Collaboration Models

11. Platform Integration and Ecosystem

11.1. Integrating with Data Lakes

11.2. CI/CD Pipeline Integration

11.3. Monitoring Stack Integration

11.4. External Model Registry Systems

11.5. Cloud Provider Integrations

Prashanth Josyula

Prashanth Josyula is a seasoned IT professional based in San Francisco, USA, with over 16 years of industry experience spanning enterprise software engineering, artificial intelligence, and cloud-native infrastructure. He specializes in AI/ML systems, Kubernetes, MLOps, and service mesh technologies, and has consistently contributed to building intelligent, scalable, and resilient platforms that power next-generation applications.

In his current role as a Principal Member of Technical Staff (PMTS) at Salesforce, Prashanth is at the forefront of architecting cloud-native solutions that seamlessly integrate AI-driven automation, real-time data processing, and large-scale distributed systems. His work spans across platform services, ML infrastructure, and enterprise-grade deployments, enabling cross-functional teams to build, deploy, and manage intelligent applications with speed and reliability. Prashanth is also an active thought leader and speaker, regularly participating in and presenting at industry-leading conferences. His talks focus on advanced topics such as ML/AI Ops, Retrieval-Augmented Generation (RAG), AI Agents, Responsible AI, and Time-Series Forecasting, where he shares practical insights derived from real-world enterprise experience. With a strong passion for both innovation and knowledge-sharing, Prashanth combines deep technical expertise with a commitment to advancing the field through mentorship, public speaking, authorship, and contributions to research and open-source communities.

Affiliations and expertise

Salesforce AI Cloud, Freemont, CA, USA

Sonika Arora

Sonika Arora is a seasoned software engineer with over a decade of experience building scalable, resilient, and intelligent distributed systems. She currently serves as a Lead Member of Technical Staff at Salesforce, where she architects and delivers complex microservice-based platforms that power machine learning workflows at scale. At Salesforce, Sonika has played a pivotal role in designing orchestration platforms that seamlessly integrate compute services such as training, prediction, and modeling of ML jobs. By leveraging technologies like AWS Lambda, DynamoDB Streams, Kubernetes, and Terraform, she has led initiatives that ensure concurrency, reliability, and observability across distributed architectures. Prior to Salesforce, she made significant contributions at PayPal, where she helped build real-time monitoring systems and QR code payment infrastructure—delivering solutions optimized for scale, fault tolerance, and performance. Sonika’s strength lies in fusing backend engineering with system-level thinking to create cloud-native systems enriched with automation, monitoring, and intelligent orchestration. She remains passionate about advancing AI-powered platforms, stream processing, and high-throughput infrastructure.

Affiliations and expertise

Salesforce AI Cloud, Sunnyvale, CA, USA

Anant Kumar

Anant Kumar is a seasoned technology leader at Salesforce, where he leads the Data Lake team within the Einstein AI Platform. With over 20 years of experience in distributed systems, AI/ML infrastructure, and cloud-native architectures, he architects enterprise-scale Apache Spark services and data lake solutions that power Salesforce’s predictive and generative AI.

His technical expertise includes building scalable Spark services on Kubernetes, developing cloud-native data pipelines processing billions of events, and designing secure, high-performance infrastructure for AI/ML workloads. He holds multiple U.S. patents in network visibility and security, and his innovations have earned him industry recognition.

Anant is a passionate advocate for responsible AI, contributing to IEEE conferences, peer-reviewed journals, and academic reviews. He mentors emerging researchers and students through non profit organizations and serves as a technical reviewer for leading publishers like O'Reilly, Packt, Manning and Plos ONE.

He is an alumnus of the Stanford Graduate School of Business Ignite Program and actively supports interdisciplinary collaboration across AI, cloud infrastructure, and data science. Recognized for his leadership, mentorship, and commitment to ethical innovation, Anant continues to shape the future of enterprise AI platforms.

Affiliations and expertise

Salesforce AI Cloud, San Jose, CA, USA

Jivitesh Poojary

Jivitesh works in a leading Fortune 100 Telecom organization as a Lead Machine Learning Engineer. He has over 11 years of experience building large scale AI / ML systems for Enterprises. He has cross functional skills in Data Science, Data Engineering and DevOps and is able to look at AI problems holistically. Beyond technical skills, he collaborates across departments to align ML strategies with business goals, advocate for data-driven decision-making, and establish robust MLOps practices. He has a Masters in Data Science from Indiana University Bloomington and has also been active in the AI / ML community by writing research papers, giving conference talks and contributing to open source projects.

Affiliations and expertise

Comcast, Philadelphia, PA, USA

Life Sciences

Physical Sciences & Engineering

Social Sciences & Humanities

Health

Essential Kubeflow

Engineering ML Workflows on Kubernetes

Purchase options

Discover top bestsellers!

Prashanth Josyula

Sonika Arora

Anant Kumar

Jivitesh Poojary

Related books