☁️ Azure MLOps in Production

Running ML in production is less about choosing a model than making every change safe to ship and easy to explain. This is the operating manual for how the CellarTracker team used Azure Machine Learning for a wine platform with more than a million members, across three real models: recommendations, drinkability scoring, and a personalized "will I like this wine" score. The through line is simple: MLOps is a systems discipline, not a tooling checklist. Data, features, training, evaluation, deployment, and monitoring form one versioned, reproducible, auditable machine; Azure ML was the control plane that held it together.

Thesis: What MLOps Actually Is

Before any Azure service, the definition has to be right. MLOps is the engineering discipline that makes machine learning reproducible, governable, and continuously improvable in production. The deliverable is never a single model file; it is a promotable artifact with lineage and a system that can retrain it, evaluate it, ship it safely, and roll it back. We held the platform to one weekly question: can we prove the change we shipped improved the product, and can we undo it before lunch if it did not?

MLOps is not "DevOps with notebooks." It is the discipline of treating data, features, training, evaluation, deployment, and monitoring as one versioned, reproducible, and auditable system.
Azure Machine Learning is the control plane that unifies those concerns: one workspace ties together data assets, compute, environments, jobs, the model registry, managed endpoints, and monitoring under a single identity and RBAC boundary.
The unit of progress is not a model file. It is a promotable artifact with lineage: which data version, which environment image, which component graph, which metrics, and which approval produced the thing now serving traffic.
A model that cannot be retrained on a schedule, rolled back in one command, and explained to a stakeholder is a liability wearing the costume of an asset.
We judged the platform by a single question, repeated weekly: can we prove that the change we shipped made the product better, and can we undo it before lunch if it did not?

The MLOps Maturity Model, Applied Honestly

Microsoft publishes an MLOps maturity model from Level 0 (manual notebooks) to Level 4 (fully automated, monitored, self-triggering operations). The professional move is not to claim Level 4 everywhere; it is to state, per model, where it genuinely sits and why. Each of the three CellarTracker models lived at a different, deliberately chosen maturity, matched to its risk and its blast radius.

Microsoft frames MLOps as a maturity model from Level 0 (no MLOps, manual notebooks) to Level 4 (full automation with monitored, self-triggering retraining). The point is not to score points. It is to be honest about where each model actually lives.
The drinkability-score model ran at high maturity: scheduled pipelines, automated batch scoring across every cellar, drift baselines, and registry-gated promotion. It was a metronome, not a heroic event.
The recommendations model lived at mid-to-high maturity: automated training and offline evaluation gates, blue/green rollout on a managed endpoint, but a human still read the engagement trend before opening the traffic valve.
The "will I like this wine" model was the most operationally demanding because it shows a calibrated percentage to a human, so it earned the strictest gates: calibration monitoring, segment fairness checks, and champion/challenger before any cutover.
Naming the maturity level per model is itself an act of engineering maturity. Pretending every model is Level 4 is how teams ship silent regressions with confidence.

The Azure MLOps Reference Architecture We Operated

These are the platform building blocks, each a versioned, governed asset rather than a loose script. The pattern is deliberately repetitive: every important thing has a version, an owner, and a lineage. That is what turns a pile of models into an operable system. Open a block to see how it carried production load at CellarTracker.

Azure ML Workspace

The control plane and identity boundary

SDK v2 and CLI v2

Authoring in Python, promoting in YAML

MLflow Tracking and Model Packaging

Open standards under Azure orchestration

Components and Pipeline Jobs

Typed, versioned, reusable steps in a DAG

Compute: Clusters, Spot, and Serverless

Capacity that exists only when a job runs

Environments and ACR

Reproducible, pinned execution contexts

Data Assets, Datastores, and MLTable

Versioned data is half of reproducibility

Managed Feature Store

One feature definition, two materializations

Model Registry and Registries

Promotion as a metadata transition

Managed Online Endpoints

Real-time serving with blue/green built in

Batch Endpoints

Score the entire universe on a schedule

Model Monitoring

Drift, data quality, and outcome degradation

Responsible AI Dashboard

Error analysis, interpretability, fairness

Security and Governance

Identity, RBAC, Key Vault, private networking

CI/CD and Infrastructure-as-Code

GitHub Actions, Azure DevOps, Bicep, Terraform

Azure AI Foundry: The GenAI Half

Prompt flow, LoRA, evaluations, one governance boundary

Three Models in Production

Architecture only earns its keep when it carries real load. These three models made three different demands: the recommender needed hybrid batch-plus-online serving; drinkability needed massive scheduled batch scoring; and the preference model needed low-latency, calibrated inference. One platform, three genuinely different operational shapes.

1. Recommendations Model

The recommendations model answers a deceptively simple member question: "what should I open or buy next?" Underneath, it is a two-stage retrieve-and-rank system. A candidate-generation stage proposes a few hundred plausible wines from a catalog of millions, and a ranking stage orders them for this specific member in this specific moment. Conflating those stages is the classic mistake; separating them is what lets the system be both fast and personalized.

Data and Features

Structured holdings meet unstructured tasting language

Training Pipeline

Candidate generation, ranking, and gated registration

Serving and Deployment

Batch precompute plus online re-ranking

Monitoring

Prediction drift, feature drift, engagement, fairness

Azure Lessons

What the platform specifically made possible

2. Drinkability Score Model

The drinkability-score model answers "is this bottle in its window right now, and when is its peak?" It is fundamentally a temporal regression over an aging curve: drinkability is not a static label but a function of wine, age, and storage. The engineering challenge is producing physically plausible curves at the scale of every bottle in every member's cellar, recomputed as time marches forward and the wines quietly age in the dark.

Data and Features

Aging priors, storage physics, and temporal signals

Training Pipeline

Monotonic constraints and curve calibration

Serving and Deployment

Nightly batch across every cellar

Monitoring

Seasonal drift and late-arriving labels

Azure Lessons

Batch primitives and physical credibility

3. Will I Like This Wine Model

The "will I like this wine" model answers the most personal question on the platform: given this specific member's palate, what is the probability they will love this specific bottle? It is a calibrated, per-member preference model, and it is distinct from the population recommender. The recommender asks "what is good for people like you," while this model asks "what is good for you," and it has to express that as an honest percentage a human will trust.

Data and Features

Personal signals, cold start, and honest negatives

Training Pipeline

Two towers and a calibration layer

Serving and Deployment

Low-latency online scoring with champion/challenger

Monitoring

Calibration drift is the metric that matters

Azure Lessons

The data flywheel and the honest percentage

The Pros of Azure MLOps

An honest assessment starts with where the platform genuinely earns its keep. These are not marketing bullets; they are the properties that changed how the team worked day to day across three production models.

Unified control plane: one workspace ties data, compute, environments, jobs, the registry, endpoints, and monitoring together, so lineage and governance are properties of the system instead of a spreadsheet someone maintains by hand.
MLflow-native and open at the edges: tracking and model packaging follow open standards, so experiment history and model artifacts stay portable even while orchestration is Azure-idiomatic.
Components plus registries give genuine reuse and clean dev-to-prod promotion: the exact validated artifact moves across workspaces and regions without a rebuild.
Managed online and batch endpoints remove the operational toil of hosting models, and blue/green traffic splitting plus mirror traffic make safe rollout a first-class feature, not a homegrown script.
The managed feature store attacks training-serving skew at the root with point-in-time-correct offline features and low-latency online features defined exactly once.
Enterprise security and governance are native: Entra identity, RBAC, managed identities, Key Vault, and private networking are defaults, not integrations you bolt on after an audit finding.
Built-in model monitoring and the Responsible AI dashboard make drift, error analysis, interpretability, and fairness part of the platform rather than a separate stack.
A coherent GenAI story through Azure AI Foundry and prompt flow means LLM and retrieval work lives under the same identity, governance, and evaluation discipline as the classic models.
Real cost controls: scale-to-zero compute clusters, spot and low-priority nodes, and batch endpoints that pay for throughput instead of idle capacity.

The Cons and Sharp Edges

Credibility comes from naming the friction, not hiding it. A team that cannot tell you where a platform hurts has not shipped on it under real constraints. These are the rough edges we actually hit, and the fact that we shipped anyway is the point.

The SDK and CLI v1-to-v2 migration is a real tax. Mixed-version examples are everywhere online, and adopting v2 idioms wholesale is the only way to stop subtle, version-mismatched breakage.
YAML schema sprawl is steep: components, pipelines, environments, and endpoints each have verbose specifications, and the ramp for a new engineer is non-trivial.
Managed online endpoints can be slow to deploy. Image builds in ACR and cold starts add latency to the iteration loop, and debugging a failed deployment is too often an exercise in log archaeology.
Quota and SKU friction is real: GPU quota requests, regional SKU availability, and vCPU caps can block a Friday plan until a quota increase clears.
Newer surfaces (managed feature store, model monitoring) move fast and occasionally show preview-grade rough edges and shifting APIs, which is the cost of living near the frontier.
Cost can surprise the careless: an idle compute instance, an always-on online endpoint, or accumulating ACR images quietly bill while proving nothing.
The platform is opinionated. Adopting its mental model is productive; fighting its conventions is painful, and "do it our way" is the implicit contract.
Managed virtual networks and private endpoints are powerful but fiddly, and aggressive egress restrictions will break a naive pip install until the networking is done properly.
Vendor gravity is real: MLflow models are portable, but a deeply Azure-idiomatic pipeline, feature store, and endpoint topology are not something you lift to another cloud over a weekend.

Monitoring, Evaluation, and the Release Bar

Deploying a model is the beginning of operations, not the finish line. We tied Azure ML monitoring and the Responsible AI dashboard to the evaluation discipline the team built for CellarChat, so promotion decisions for tabular models and agentic surfaces rested on evidence stakeholders could actually read. Drift and degradation drove retraining triggers and release gates instead of becoming untouched charts. The embedded dashboard is the live evaluation surface where quality trends, version comparisons, and regressions became legible enough to govern releases.

The evaluations dashboard the team built, where model and agent quality trends gated production releases.

Azure MLOps, As We Would Teach It

These are the questions we use to separate someone who has clicked through the Azure ML portal from someone who has operated real models on it under production pressure. Each answer is the reasoning the team actually applied across the three CellarTracker models, not a recital of service names.

How do you decide between a managed online endpoint and a batch endpoint?

Explain how Azure Machine Learning uses MLflow, and why that matters.

Why build component-based pipelines instead of one training script?

How do you prevent training-serving skew?

How do you execute a safe model rollout on a managed online endpoint?

How do you trigger retraining, and how do you avoid retraining for no reason?

How do you monitor a model whose labels arrive late?

Where does Azure AI Foundry fit relative to Azure Machine Learning?

How do you keep Azure MLOps costs from quietly exploding?

Closing: The Discipline, Not the Dashboard

No single Azure service makes MLOps mature. The value is a coherent system for data, features, training, evaluation, deployment, and monitoring—one that lets a small team operate a hybrid recommender, a scheduled batch regressor, and a low-latency calibrated classifier without inventing a separate operating model for each. Maturity is not reciting service names. It is choosing the right primitive for a model's real shape, naming its maturity honestly, monitoring the signal that matters, and keeping every change reversible. That was the standard we held the platform to, and it is the standard we would bring to yours.

Keep Reading

CellarChat AI Engineering

AI Engineering Methodologies

Hardware is Horsepower

Azure AI Foundry Docs

MLflow

HEALTHCARE AI

LLM SYSTEMS

AI INFRASTRUCTURE

PATIENT IMPACT

☁️ Azure MLOps in Production

Thesis: What MLOps Actually Is

The MLOps Maturity Model, Applied Honestly

The Azure MLOps Reference Architecture We Operated

Azure ML Workspace

SDK v2 and CLI v2

MLflow Tracking and Model Packaging

Components and Pipeline Jobs

Compute: Clusters, Spot, and Serverless

Environments and ACR

Data Assets, Datastores, and MLTable

Managed Feature Store

Model Registry and Registries

Managed Online Endpoints

Batch Endpoints

Model Monitoring

Responsible AI Dashboard

Security and Governance

CI/CD and Infrastructure-as-Code

Azure AI Foundry: The GenAI Half

Three Models in Production

1. Recommendations Model

Data and Features

Training Pipeline

Serving and Deployment

Monitoring

Azure Lessons

2. Drinkability Score Model

Data and Features

Training Pipeline

Serving and Deployment

Monitoring

Azure Lessons

3. Will I Like This Wine Model

Data and Features

Training Pipeline

Serving and Deployment

Monitoring

Azure Lessons

The Pros of Azure MLOps

The Cons and Sharp Edges

Monitoring, Evaluation, and the Release Bar

Azure MLOps, As We Would Teach It

How do you decide between a managed online endpoint and a batch endpoint?

Explain how Azure Machine Learning uses MLflow, and why that matters.

Why build component-based pipelines instead of one training script?

How do you prevent training-serving skew?

How do you execute a safe model rollout on a managed online endpoint?

How do you trigger retraining, and how do you avoid retraining for no reason?

How do you monitor a model whose labels arrive late?

Where does Azure AI Foundry fit relative to Azure Machine Learning?

How do you keep Azure MLOps costs from quietly exploding?

Closing: The Discipline, Not the Dashboard