Wine Pricing Data Architecture

WineBox estimates retail prices for wines using a multi-source pipeline that combines real market data with AI estimation. This document describes every data source, MongoDB collection, and script involved.

Data Sources

1. X-Wines Dataset (Reference Wine Catalogue)

  • Origin: github.com/rogerioxavier/X-Wines

  • Size: 100,646 wines with 21M+ community ratings

  • Import script: deploy/import_xwines_mongo.py

  • Collection: xwines_wines

  • Fields used: name, winery_name, wine_type, grapes, ABV, country, region_name, vintages, avg_rating, rating_count

This is the reference wine catalogue. Every wine in the autocomplete, every enrichment lookup, and every price annotation starts here. The vintages field (a string like "[2020, 2019, 2018]") drives per-vintage pricing.

2. Kaggle/Vivino Wine Prices (Real Market Data)

  • Origin: kaggle.com/datasets/budnyak/wine-rating-and-price

  • Size: 13,830 wines with actual retail prices from Vivino

  • Import script: scripts/import_kaggle_prices.py

  • Collection: kaggle_wine_prices

  • Fields: name, winery, country, region, wine_type, rating, rating_count, price_usd, vintage

  • Price range: $3.15 - $3,410.79 (median $15.95)

This is ground truth pricing data. The annotation script matches X-Wines entries against this dataset by winery name and wine name. Wines with a direct match get real Vivino prices (no LLM needed). Wines without a direct match get Kaggle prices from the same winery/region injected as context into the LLM prompt.

3. Claude Haiku API (AI Price Estimation)

  • Service: Anthropic Claude claude-haiku-4-5-20251001

  • Used by: scripts/annotate_xwines_prices.py

  • Purpose: Estimates US retail prices for wines that lack real market data

  • Input: Wine name, winery, region, grape, vintage, rating, plus Kaggle reference prices when available

  • Output: price_low_usd, price_high_usd, price_tier, confidence, note

The LLM is the fallback for the ~81% of wines not directly matched in the Kaggle dataset. Its accuracy improves when Kaggle reference prices from the same winery or region are included in the prompt.

4. Brave Search API (Price Validation)

  • Service: Brave Search API

  • Used by: scripts/validate_xwines_prices.py

  • Purpose: Spot-checks estimated prices against live retail data

  • Sources queried: Wine-Searcher, Wine.com, Total Wine, Vivino

  • Not in the serving path – validation only

MongoDB Collections

xwines_wines – Reference Wine Catalogue

Field

Type

Description

xwines_id

int

Unique wine identifier

name

string

Wine name

winery_name

string

Producer/winery

wine_type

string

Red, White, Rose, Sparkling

country

string

Country of origin

region_name

string

Wine region

grapes

string

Grape varieties (stringified list)

vintages

string

Available vintages (stringified list)

avg_rating

float

Community average rating

rating_count

int

Number of ratings

Written by: deploy/import_xwines_mongo.py Read by: xwines router (search), enrichment service, annotation script

xwines_prices – Estimated Retail Prices

Field

Type

Description

xwines_id

int

Links to xwines_wines.xwines_id

vintage

int or null

Vintage year; null = base/fallback price

price_low_usd

float

Lower bound of estimated price

price_high_usd

float

Upper bound of estimated price

price_tier

string

budget / value / mid_range / premium / luxury / ultra_premium

confidence

string

high / medium / low

note

string

Pricing justification

model

string

kaggle-vivino or claude-haiku-4-5-20251001

created_at

datetime

When the price was generated

Indexes:

  • (xwines_id, vintage) unique – one price per wine-vintage pair

  • (xwines_id) non-unique – fast lookups for “all prices for this wine”

Written by: scripts/annotate_xwines_prices.py Read by: xwines router (search + detail), enrichment service (check-in + batch)

kaggle_wine_prices – Vivino Market Reference

Field

Type

Description

name / name_lower

string

Wine name (original + lowercase)

winery / winery_lower

string

Winery name (original + lowercase)

country

string

Country

region

string

Region

wine_type

string

red / white / rose / sparkling

price_usd

float

Actual Vivino retail price

vintage

int or null

Vintage year

rating

float

Vivino rating

Written by: scripts/import_kaggle_prices.py Read by: scripts/annotate_xwines_prices.py (matching + context)

xwines_price_validations – Validation Results

Field

Type

Description

xwines_id

int

Wine reference

vintage

int or null

Vintage

estimated_low / estimated_high

float

Our estimate

search_price

float

Actual web price found

source

string

wine-searcher / wine.com / totalwine / vivino

status

string

accurate / close / overestimated / underestimated / no_data

Written by: scripts/validate_xwines_prices.py Read by: scripts/compare_pricing_models.py, --report flag

xwines_metadata – Dataset Version Tracking

Stores key-value pairs: version, import_date, rating_count, source.

Written by: deploy/import_xwines_mongo.py Read by: xwines router /stats endpoint

Data Flow

                    IMPORT PHASE
                    ============

X-Wines GitHub CSVs ──→ deploy/import_xwines_mongo.py ──→ xwines_wines (100K wines)
                                                          xwines_metadata

Kaggle Vivino CSVs  ──→ scripts/import_kaggle_prices.py ──→ kaggle_wine_prices (14K prices)


                    ANNOTATION PHASE
                    ================

                    scripts/annotate_xwines_prices.py
                              │
                    ┌─────────┴─────────┐
                    │                   │
              Kaggle Match?        No Match
              (winery+name)             │
                    │              ┌────┴────┐
                    │              │         │
                    │         Has Kaggle   No context
                    │         context?         │
                    │              │           │
                    ▼              ▼           ▼
              Use Vivino      LLM + refs   LLM only
              price ±15%      in prompt    (fallback)
                    │              │           │
                    └──────────────┴───────────┘
                              │
                              ▼
                    xwines_prices collection
                    (one doc per wine-vintage pair)


                    VALIDATION PHASE
                    ================

xwines_prices ──→ scripts/validate_xwines_prices.py ──→ Brave Search API
                                                              │
                                                              ▼
                                                    xwines_price_validations
                                                    (accuracy report)


                    SERVING PHASE
                    =============

User searches wine ──→ xwines router /search
                              │
                              ├── Atlas Search on xwines_wines
                              ├── Price filter on xwines_prices (any vintage in range)
                              └── Batch price lookup (vintage=null for base prices)
                                        │
                                        ▼
                              Search results with price_low, price_high, price_tier

User checks in wine ──→ enrichment service
                              │
                              ├── Match xwines_wines by name
                              ├── Fill missing fields (winery, grape, region, etc.)
                              └── Price lookup: vintage-specific → base price fallback
                                        │
                                        ▼
                              Enriched wine with estimated_price_low/high

Price Tiers

Tier

Range

Description

budget

< $15

Everyday wines

value

$15 - $25

Good value picks

mid_range

$25 - $50

Quality wines

premium

$50 - $100

Special occasion

luxury

$100 - $250

Collector wines

ultra_premium

> $250

Investment-grade

Scripts Reference

Script

Purpose

Run Command

deploy/import_xwines_mongo.py

Import X-Wines dataset

uv run python deploy/import_xwines_mongo.py

scripts/import_kaggle_prices.py

Import Kaggle/Vivino prices

uv run python scripts/import_kaggle_prices.py --path /tmp/wine-prices

scripts/annotate_xwines_prices.py

Generate price estimates

uv run python scripts/annotate_xwines_prices.py

scripts/validate_xwines_prices.py

Validate prices via web search

uv run python scripts/validate_xwines_prices.py --sample-size 50

scripts/compare_pricing_models.py

Compare Haiku vs Sonnet accuracy

uv run python scripts/compare_pricing_models.py

Annotation Script Options

# Dry run to see what would be processed
uv run python scripts/annotate_xwines_prices.py --dry-run

# Annotate 100 wines (popular first)
uv run python scripts/annotate_xwines_prices.py --max-wines 100

# Check progress
uv run python scripts/annotate_xwines_prices.py --status

# Estimate remaining cost
uv run python scripts/annotate_xwines_prices.py --estimate-cost

# Filter by grape variety
uv run python scripts/annotate_xwines_prices.py --grape "Pinot Noir" --max-wines 500

Accuracy

Based on validation of 50 randomly sampled prices against Brave Search:

  • 57% acceptable (in estimated range or within 30%)

  • 29% accurate (web price falls within estimated range)

  • 37% underestimated (mostly prestige wines: Burgundy Grand Cru, aged Port)

  • 6% overestimated

  • Kaggle-matched prices are expected to be significantly more accurate since they use real Vivino data

The underestimation bias primarily affects luxury and ultra-premium wines where the LLM lacks specific pricing knowledge. Budget and mid-range wines are ~70-80% accurate.