RYAN ZERNACH

Senior AI Systems Engineer

Ryan_Zernach_2025_Senior_AI_Systems_Engineer_Remote_United_States

☁️ Azure MLOps in Production

This is the operating manual for how the CellarTracker team ran production MLOps on Azure Machine Learning for a wine platform with more than a million members. It is organized around three real models we built and operated end to end: a recommendations model, a drinkability-score model, and a personalized "will I like this wine" model. The thesis throughout is that MLOps is a systems discipline, not a tooling checklist: data, features, training, evaluation, deployment, and monitoring are one versioned, reproducible, auditable machine, and Azure ML is the control plane that holds that machine together.

Related Links
CellarTracker AI Engineering
AI Engineering Methodologies
Azure Machine Learning Docs
CellarTracker
Azure Machine Learning, the platform behind the three production models in this post
Azure Machine Learning was the control plane for three production models serving CellarTracker, the #1-rated wine app.

Thesis: What MLOps Actually Is

Before any Azure service, the definition has to be right. MLOps is the engineering discipline that makes machine learning reproducible, governable, and continuously improvable in production. The deliverable is never a single model file; it is a promotable artifact with lineage and a system that can retrain it, evaluate it, ship it safely, and roll it back. We held the platform to one weekly question: can we prove the change we shipped improved the product, and can we undo it before lunch if it did not?

  • MLOps is not "DevOps with notebooks." It is the discipline of treating data, features, training, evaluation, deployment, and monitoring as one versioned, reproducible, and auditable system.
  • Azure Machine Learning is the control plane that unifies those concerns: one workspace ties together data assets, compute, environments, jobs, the model registry, managed endpoints, and monitoring under a single identity and RBAC boundary.
  • The unit of progress is not a model file. It is a promotable artifact with lineage: which data version, which environment image, which component graph, which metrics, and which approval produced the thing now serving traffic.
  • A model that cannot be retrained on a schedule, rolled back in one command, and explained to a stakeholder is a liability wearing the costume of an asset.
  • We judged the platform by a single question, repeated weekly: can we prove that the change we shipped made the product better, and can we undo it before lunch if it did not?

The MLOps Maturity Model, Applied Honestly

Microsoft publishes an MLOps maturity model from Level 0 (manual notebooks) to Level 4 (fully automated, monitored, self-triggering operations). The professional move is not to claim Level 4 everywhere; it is to state, per model, where it genuinely sits and why. Each of the three CellarTracker models lived at a different, deliberately chosen maturity, matched to its risk and its blast radius.

  • Microsoft frames MLOps as a maturity model from Level 0 (no MLOps, manual notebooks) to Level 4 (full automation with monitored, self-triggering retraining). The point is not to score points. It is to be honest about where each model actually lives.
  • The drinkability-score model ran at high maturity: scheduled pipelines, automated batch scoring across every cellar, drift baselines, and registry-gated promotion. It was a metronome, not a heroic event.
  • The recommendations model lived at mid-to-high maturity: automated training and offline evaluation gates, blue/green rollout on a managed endpoint, but a human still read the engagement trend before opening the traffic valve.
  • The "will I like this wine" model was the most operationally demanding because it shows a calibrated percentage to a human, so it earned the strictest gates: calibration monitoring, segment fairness checks, and champion/challenger before any cutover.
  • Naming the maturity level per model is itself an act of engineering maturity. Pretending every model is Level 4 is how teams ship silent regressions with confidence.

The Azure MLOps Reference Architecture We Operated

These are the building blocks of the platform, each one a versioned, governed asset rather than a loose script. The recurring theme is that everything has a version, an owner, and a lineage, which is what turns a pile of models into an operable system. Open each block for how we used it in production on CellarTracker.

Azure ML Workspace

The control plane and identity boundary

SDK v2 and CLI v2

Authoring in Python, promoting in YAML

MLflow Tracking and Model Packaging

Open standards under Azure orchestration

Components and Pipeline Jobs

Typed, versioned, reusable steps in a DAG

Compute: Clusters, Spot, and Serverless

Capacity that exists only when a job runs

Environments and ACR

Reproducible, pinned execution contexts

Data Assets, Datastores, and MLTable

Versioned data is half of reproducibility

Managed Feature Store

One feature definition, two materializations

Model Registry and Registries

Promotion as a metadata transition

Managed Online Endpoints

Real-time serving with blue/green built in

Batch Endpoints

Score the entire universe on a schedule

Model Monitoring

Drift, data quality, and outcome degradation

Responsible AI Dashboard

Error analysis, interpretability, fairness

Security and Governance

Identity, RBAC, Key Vault, private networking

CI/CD and Infrastructure-as-Code

GitHub Actions, Azure DevOps, Bicep, Terraform

Azure AI Foundry: The GenAI Half

Prompt flow, LoRA, evaluations, one governance boundary

Three Models in Production

The architecture above is abstract until it carries real load. These are the three models we operated on it. Each one made a different demand on the platform, which is exactly why they are worth studying together: the recommender wanted hybrid batch-plus-online serving, the drinkability model wanted massive scheduled batch scoring, and the preference model wanted low-latency, calibrated, real-time inference. One platform, three genuinely different operational shapes.

1. Recommendations Model

The recommendations model answers a deceptively simple member question: "what should I open or buy next?" Underneath, it is a two-stage retrieve-and-rank system. A candidate-generation stage proposes a few hundred plausible wines from a catalog of millions, and a ranking stage orders them for this specific member in this specific moment. Conflating those stages is the classic mistake; separating them is what lets the system be both fast and personalized.

Data and Features

Structured holdings meet unstructured tasting language

Training Pipeline

Candidate generation, ranking, and gated registration

Serving and Deployment

Batch precompute plus online re-ranking

Monitoring

Prediction drift, feature drift, engagement, fairness

Azure Lessons

What the platform specifically made possible

2. Drinkability Score Model

The drinkability-score model answers "is this bottle in its window right now, and when is its peak?" It is fundamentally a temporal regression over an aging curve: drinkability is not a static label but a function of wine, age, and storage. The engineering challenge is producing physically plausible curves at the scale of every bottle in every member's cellar, recomputed as time marches forward and the wines quietly age in the dark.

Data and Features

Aging priors, storage physics, and temporal signals

Training Pipeline

Monotonic constraints and curve calibration

Serving and Deployment

Nightly batch across every cellar

Monitoring

Seasonal drift and late-arriving labels

Azure Lessons

Batch primitives and physical credibility

3. Will I Like This Wine Model

The "will I like this wine" model answers the most personal question on the platform: given this specific member's palate, what is the probability they will love this specific bottle? It is a calibrated, per-member preference model, and it is distinct from the population recommender. The recommender asks "what is good for people like you," while this model asks "what is good for you," and it has to express that as an honest percentage a human will trust.

Data and Features

Personal signals, cold start, and honest negatives

Training Pipeline

Two towers and a calibration layer

Serving and Deployment

Low-latency online scoring with champion/challenger

Monitoring

Calibration drift is the metric that matters

Azure Lessons

The data flywheel and the honest percentage

The Pros of Azure MLOps

An honest assessment starts with where the platform genuinely earns its keep. These are not marketing bullets; they are the properties that changed how the team worked day to day across three production models.

  • Unified control plane: one workspace ties data, compute, environments, jobs, the registry, endpoints, and monitoring together, so lineage and governance are properties of the system instead of a spreadsheet someone maintains by hand.
  • MLflow-native and open at the edges: tracking and model packaging follow open standards, so experiment history and model artifacts stay portable even while orchestration is Azure-idiomatic.
  • Components plus registries give genuine reuse and clean dev-to-prod promotion: the exact validated artifact moves across workspaces and regions without a rebuild.
  • Managed online and batch endpoints remove the operational toil of hosting models, and blue/green traffic splitting plus mirror traffic make safe rollout a first-class feature, not a homegrown script.
  • The managed feature store attacks training-serving skew at the root with point-in-time-correct offline features and low-latency online features defined exactly once.
  • Enterprise security and governance are native: Entra identity, RBAC, managed identities, Key Vault, and private networking are defaults, not integrations you bolt on after an audit finding.
  • Built-in model monitoring and the Responsible AI dashboard make drift, error analysis, interpretability, and fairness part of the platform rather than a separate stack.
  • A coherent GenAI story through Azure AI Foundry and prompt flow means LLM and retrieval work lives under the same identity, governance, and evaluation discipline as the classic models.
  • Real cost controls: scale-to-zero compute clusters, spot and low-priority nodes, and batch endpoints that pay for throughput instead of idle capacity.

The Cons and Sharp Edges

Credibility comes from naming the friction, not hiding it. A team that cannot tell you where a platform hurts has not shipped on it under real constraints. These are the rough edges we actually hit, and the fact that we shipped anyway is the point.

  • The SDK and CLI v1-to-v2 migration is a real tax. Mixed-version examples are everywhere online, and adopting v2 idioms wholesale is the only way to stop subtle, version-mismatched breakage.
  • YAML schema sprawl is steep: components, pipelines, environments, and endpoints each have verbose specifications, and the ramp for a new engineer is non-trivial.
  • Managed online endpoints can be slow to deploy. Image builds in ACR and cold starts add latency to the iteration loop, and debugging a failed deployment is too often an exercise in log archaeology.
  • Quota and SKU friction is real: GPU quota requests, regional SKU availability, and vCPU caps can block a Friday plan until a quota increase clears.
  • Newer surfaces (managed feature store, model monitoring) move fast and occasionally show preview-grade rough edges and shifting APIs, which is the cost of living near the frontier.
  • Cost can surprise the careless: an idle compute instance, an always-on online endpoint, or accumulating ACR images quietly bill while proving nothing.
  • The platform is opinionated. Adopting its mental model is productive; fighting its conventions is painful, and "do it our way" is the implicit contract.
  • Managed virtual networks and private endpoints are powerful but fiddly, and aggressive egress restrictions will break a naive pip install until the networking is done properly.
  • Vendor gravity is real: MLflow models are portable, but a deeply Azure-idiomatic pipeline, feature store, and endpoint topology are not something you lift to another cloud over a weekend.

Monitoring, Evaluation, and the Release Bar

Operating models is not the deploy; it is everything after. We tied Azure ML model monitoring and the Responsible AI dashboard to the same evaluation discipline the team built for CellarChat, so promotion decisions for both the tabular models and the agentic surfaces were made against evidence on a dashboard stakeholders could actually read. Drift and degradation fed retraining triggers and release gates rather than scrolling past as untouched charts. The embedded dashboard below is the live evaluations surface the team designed and built, the operational home where quality trends, version comparisons, and regressions became legible enough to gate releases on.

The evaluations dashboard the team built, where model and agent quality trends gated production releases.

Azure MLOps, As We Would Teach It

These are the questions we use to separate someone who has clicked through the Azure ML portal from someone who has operated real models on it under production pressure. Each answer is the reasoning the team actually applied across the three CellarTracker models, not a recital of service names.

How do you decide between a managed online endpoint and a batch endpoint?

Explain how Azure Machine Learning uses MLflow, and why that matters.

Why build component-based pipelines instead of one training script?

How do you prevent training-serving skew?

How do you execute a safe model rollout on a managed online endpoint?

How do you trigger retraining, and how do you avoid retraining for no reason?

How do you monitor a model whose labels arrive late?

Where does Azure AI Foundry fit relative to Azure Machine Learning?

How do you keep Azure MLOps costs from quietly exploding?

Closing: The Discipline, Not the Dashboard

Azure MLOps is not impressive because of any single service. It is valuable because it lets a small team treat data, features, training, evaluation, deployment, and monitoring as one versioned, governable system, and then operate genuinely different models, a hybrid recommender, a massive scheduled batch regressor, and a low-latency calibrated classifier, on one coherent platform. The mark of a mature team is not naming the services; it is choosing the right primitive for each model's real shape, naming the maturity honestly, monitoring the signal that actually matters, and keeping every change reversible. That is the standard we held the platform to, and it is the standard we would bring to yours.

Keep Reading
CellarChat AI Engineering
AI Engineering Methodologies
Hardware is Horsepower
Azure AI Foundry Docs
MLflow