🍷 CellarChat™ RAG Agent for 1M+ Users

Built CellarChat™, an AI wine assistant using RAG, agents, OpenAPI tools, and LLM evals for CellarTracker's 1M+ member platform, built with Python, TypeScript, React Native, Redis, and PGVector. I focused on ensuring that CellarChat improved the experience for the user rather than simply asking ChatGPT. I also designed and shipped the internal React Native evaluations dashboard that product and leadership used to compare model versions, catch regressions, and gate releases with evidence.

CellarChat in motion

CellarChat experience showcasing conversational context, multi-step tool calls, and the product feel that I shipped to our global community. The animation also nods to the computer vision and object-character-recognition pipelines powered by Tesseract, which keep label parsing grounded in real wine data.

CellarChat animation highlighting AI conversations, workflows, and response orchestration.

Development Challenges & Solutions

The core challenge was building an assistant that felt genuinely useful instead of generic. Members wanted personalized recommendations, not broad wine advice, which meant combining structured collection state with unstructured tasting language and returning grounded responses mapped to real actions. Shipping CellarChat required solving several production constraints that were easy to miss in demos. Early model context windows were much smaller than they are now, so retrieval and orchestration quality had to carry real load. I also needed a way to prove quality movement to product and executive stakeholders as the system changed week to week. I built the evaluations dashboard specifically so that proof was visible, repeatable, and tied to release decisions.

Context window limits: at the time, the practical ceiling was around 150k tokens, so I used embeddings and retrieval strategies to compress and prioritize relevant context instead of trying to pass everything directly. This reduced token cost and fit critical context into a single request window.
Evaluation discipline for investment decisions: leadership wanted to know whether weekly system changes were improving or regressing behavior, so I implemented CI/CD-connected evals, wired them into a React Native quality-ops dashboard I owned end to end, and used those trend lines to make go or no-go release calls with evidence instead of anecdote.
Deterministic math for agentic workflows: for questions like bottle counts and inventory arithmetic, I extended tool-calling paths with math-capable tools so the agent could compute reliably instead of depending on non-deterministic LLM-only calculations.
"Mendocino" can mean region, subregion, or appellation depending on user intent, which was causing errors in retrieving accurate wines. Therefore, I generated thousands of synthetic conversation examples that taught disambiguation behavior, then fine-tuned with Azure AI Foundry LoRA on an OpenAI model to improve how the assistant asked clarifying questions instead of assuming.

Technical Approach: Evaluations & Observability

I owned evaluations and observability end to end: the golden-set methodology, the CI/CD eval pipeline, and the React Native dashboard product teams used every week. The goal was to answer one leadership question with data, not opinion: were the AI changes I shipped improving the product or regressing it? I designed a golden set made of both canonical user prompts and golden set users so each run tested the model against realistic cellar profiles, not synthetic averages. I split golden questions into quantitative and qualitative tracks, piped results into a dashboard I designed and built, and made trend lines, inflection points, and regressions visible before broad rollout.

Golden set design: I curated stable, high-value prompts and paired them with representative user accounts covering different cellar sizes, data quality patterns, and usage behaviors.
Quantitative track: questions with objectively verifiable answers, where expected outputs were computed from hard-coded SQL queries against user profile and cellar tables. This gave deterministic pass/fail signals for metrics like bottle counts, holdings by region or variety, and other inventory facts.
Qualitative track: recommendation and interpretation prompts where correctness is not a single database value. For these, I used an LLM-as-judge framework with explicit rubrics to score dimensions like relevance, grounding, clarity, and actionability.
Evaluations dashboard (my build): I designed and shipped a React Native dashboard for agentic AI/ML quality operations, run history, version comparisons, quantitative pass rates, qualitative judge scores, and regression alerts in one surface leadership could read without opening notebooks.
CI/CD and release gating: I wired eval runs into the pipeline and used dashboard trends in release reviews so prompt, retrieval, tool, and model changes shipped only when both tracks showed acceptable movement.

Evaluation Dashboard

I designed and built this React Native dashboard as the operational home for CellarChat quality. It is not a side report; it is the surface I used with product and leadership to run agentic AI/ML evaluations, compare builds, inspect regressions, and justify release decisions. Quantitative SQL-backed checks and qualitative LLM-as-judge scores live in one place so teams could see whether a weekly change actually improved member-facing behavior.

Evaluations dashboard I designed and built, React Native quality ops for CellarChat releases.

My Role

I worked across the stack to turn CellarChat from concept into a production AI surface during my August 2023 to July 2025 engagement. I led and implemented core RAG architecture, agentic workflows, OpenAPI tool integration, and LLM observability. I designed the golden-set methodology, built CI/CD-connected evaluations, and shipped the React Native evaluations dashboard that graphed quantitative and qualitative quality over time for product and executive stakeholders. CellarTracker is a wine cellar management platform where members track inventory, decide what to open, monitor drinking windows, and learn from community tasting notes. The user problem sounds simple, but the data is not. Wine data spans structured records like bottle counts, vintages, storage locations, and dates, plus unstructured content like tasting notes, preferences, and free-form prompts; good answers depend on both data types together, and the assistant has to reason in product context so it knows what the member owns, where bottles are stored, and what is actionable right now. At that scale, "ship and hope" is not viable. I built the evaluations dashboard so I could measure whether the assistant stayed grounded as data and models changed. I contributed to the Python backend that powered AI orchestration and retrieval flows. On the product side, I integrated the AI system into TypeScript frontend experiences and React Native mobile surfaces so the assistant was available where members already manage their collections. I treated this as a systems problem, not a prompt-only problem.

RAG architecture and retrieval quality tuning for structured and unstructured wine data
Agentic workflows with OpenAPI tool-calling and multi-step orchestration paths
End-to-end ownership of LLM evals, observability, and the React Native evaluations dashboard used for release gating
Golden-set design, SQL-backed quantitative scoring, LLM-as-judge qualitative rubrics, and CI/CD trend visualization
Python backend delivery plus TypeScript and React Native product integration

I also shipped mobile barcode scanning and bottle label capture workflows powered by Tesseract OCR/CV to ground bottle data.

Technical Approach: Retrieval-Augmented Generation

For retrieval-augmented generation, I focused on grounding responses in member-specific context and trusted wine data. The retrieval path combined personal cellar information with broader tasting intelligence so the model could answer recommendation and exploration questions with context, not guesses. Query handling needed to support both direct lookups and fuzzy intent because users asked for wines by variety, pairing, vintage, readiness, region, and price constraints. The key engineering decision was to optimize retrieval quality and relevance first, then model behavior. In practice, better retrieval design delivered bigger gains than prompt tweaking alone, and every retrieval change showed up in the evaluations dashboard I built, so I could see whether member-specific grounding actually moved before I shipped.

Technical Approach: Agentic Tool Use / OpenAPI Workflows

I architected OpenAPI-based tool-calling workflows so the model could execute reliable product-aware actions instead of only generating text. Tool access gave the assistant a controlled interface into collection and product capabilities, which improved response usefulness and reduced hallucinated behavior. I also led fine-tuning and orchestration patterns that improved how the system selected tools and handled multi-step requests. The objective was predictable behavior under real user prompts: select the right tool path, gather the right context, and produce a clear answer tied to what the user can actually do next. Tool and orchestration changes were evaluated through the same dashboard pipeline, so regressions in bottle-count math or multi-step flows were visible alongside RAG quality, not buried in ad hoc manual testing.

Technical Approach: Full-Stack Product Integration

I connected AI workflows to real user surfaces across web and mobile. CellarChat was not a side experiment; it was integrated into the core CellarTracker experience, including the collections flow where members decide what to drink and manage inventory. I worked in Python on backend AI services, in TypeScript on frontend integration, and in React Native to deliver cohesive mobile experiences, including the evaluations dashboard as a first-class internal product, not a throwaway admin page. I also contributed beyond AI-only surfaces, including subscription and payment-related improvements, because shipping successful AI products requires alignment with broader product, platform, and business workflows.

Impact

CellarChat brought AI assistance into CellarTracker's core wine experience for a platform with 1M+ members. It connected personalized recommendations and collection intelligence to daily user decisions, not just novelty chat interactions. The evaluations dashboard I built became how we proved reliability to stakeholders: visible trends, comparable runs, and a repeatable release bar instead of subjective "it feels better" reviews. Through full-stack integration, I delivered AI capabilities directly in product surfaces across web and mobile, increasing practical adoption opportunities. The work also supported broader product and revenue efforts, including subscription and payment improvements.

Shipped production AI assistance for CellarTracker's 1M+ member platform.
Designed and built the React Native evaluations dashboard used for weekly quality reviews, regression detection, and CI/CD-informed release gating.
Connected AI workflows to real React Native mobile product surfaces and allowed members to thumbs up, thumbs down, and provide custom feedback about AI responses.
Established golden-set evals with quantitative SQL checks and qualitative LLM-as-judge scoring, operationalized through the dashboard, not one-off scripts.
Contributed to broader growth work: app rating improvements, Apple Pay subscriptions, and recurring revenue initiatives.

Lessons Learned

CellarChat made one lesson unavoidable: embeddings are the backbone of production RAG. They decide what context the model sees, what failures look like in evals, and whether weekly changes actually improve member-facing answers. The evaluations dashboard I built is what made those embedding and retrieval lessons legible to the org, every bullet below is something I could point to on a chart, not argue about in a meeting. The list is what I would tell any team shipping agentic AI on top of real user data, not theory from a notebook, but patterns that showed up repeatedly on a 1M+ member wine platform.

Embeddings are the real interface between your product data and the model. If retrieval is wrong, no prompt engineering will save the answer.
Wine data taught me that one embedding space must serve structured facts and unstructured language at once: bottle counts and locations live beside tasting notes, preferences, and free-form questions.
When context windows were capped around 150k tokens, embeddings were not optional; they were how I compressed the cellar into what could fit in a single request without drowning the model in noise.
Embedding quality is a data-modeling problem first. Region names like "Mendocino" can mean region, subregion, or appellation; bad vectors surface the wrong documents and the model confidently answers the wrong question.
Chunk boundaries matter as much as the embedding model. Splitting tasting notes, inventory rows, and help articles poorly creates vectors that look relevant in cosine distance but fail in product context.
Metadata on every chunk is half the retrieval system: member ID, cellar scope, vintage, storage location, and content type let you filter before you rank, which is cheaper and more reliable than pure semantic search alone.
Hybrid retrieval beats embeddings-only in production: combine vector similarity with structured lookups and hard filters so inventory math and recommendation prompts do not compete in one undifferentiated pile of text.
Re-embedding is a release event. Changing models, dimensions, or chunking strategy without a backfill plan silently shifts which answers your users get, even when prompts and tools stay identical.
Stale embeddings are a silent regression. New bottles, edited tasting notes, and deleted holdings only help the assistant if your index refresh path keeps pace with member activity.
Duplicate and near-duplicate chunks inflate recall and confuse ranking. Deduplication and canonical sources for the same fact (one cellar row, one summary chunk) improved precision more than swapping embedding providers.
Embedding the question is not enough; embed the task. "What should I drink tonight?" and "How many bottles of Barolo do I have in Seattle?" need different retrieval profiles even when the surface wording is similar.
Negative retrieval signals are underrated: knowing what not to pull into context (other members' cellars, outdated vintages, generic wine encyclopedia filler) reduces hallucination pressure as much as finding the right passage.
Grounding checks should trace back to chunks, not vibes. In evals, I could only defend release decisions when failures mapped to specific retrieval misses: wrong chunk, missing metadata filter, or bad rank, not "the model felt off."
Token cost and latency follow embedding choices. Smaller, well-curated context from strong retrieval beat sending large raw dumps; embeddings are how you buy quality per dollar at inference time.
Fine-tuning and LoRA do not replace embeddings; they complement them. Disambiguation training helped Mendocino-style edge cases, but the assistant still needed retrieval to put the right cellar facts in front of the model on every turn.
Quantitative eval questions exposed embedding gaps quickly: when SQL-backed golden answers failed, the bug was often retrieval (wrong slice of the cellar vectorized), not the LLM's reasoning.
Qualitative evals still depend on embeddings. LLM-as-judge scores for relevance and grounding collapse when the retrieved set mixes generic wine advice with member-specific inventory; the judge penalizes answers that were doomed in retrieval.
Observability for RAG means logging embedding queries, top-k results, chunk IDs, and scores, and surfacing them in an evaluations dashboard your team actually opens. Without that trail in a product you own, you cannot tell whether a bad answer was a model failure or a vector failure.
Treat your embedding index like a product surface: version it, diff it in CI, and gate releases when retrieval metrics move, the same discipline as API contracts, because members experience retrieval as the product.
The biggest lesson: in production RAG, embeddings are not preprocessing; they are the system. Prompts and agents orchestrate; vectors decide what truth the model is allowed to see.

Closing

CellarChat™ represents the kind of AI engineering I enjoy most: turning ambiguous product problems into reliable, evaluated, production-ready LLM systems, with a dashboard and eval discipline stakeholders can trust, not demos that only work on Tuesday.

CellarChat support documentation

RYAN ZERNACH

Full-Stack AI Systems Engineer

HEALTHCARE AI

LLM SYSTEMS

AI INFRASTRUCTURE

PATIENT IMPACT