Wine Pricing Data Architecture
WineBox estimates retail prices for wines using a multi-source pipeline that combines real market data with AI estimation. This document describes every data source, MongoDB collection, and script involved.
Data Sources
1. X-Wines Dataset (Reference Wine Catalogue)
Origin: github.com/rogerioxavier/X-Wines
Size: 100,646 wines with 21M+ community ratings
Import script:
deploy/import_xwines_mongo.pyCollection:
xwines_winesFields used: name, winery_name, wine_type, grapes, ABV, country, region_name, vintages, avg_rating, rating_count
This is the reference wine catalogue. Every wine in the autocomplete, every
enrichment lookup, and every price annotation starts here. The vintages
field (a string like "[2020, 2019, 2018]") drives per-vintage pricing.
2. Kaggle/Vivino Wine Prices (Real Market Data)
Size: 13,830 wines with actual retail prices from Vivino
Import script:
scripts/import_kaggle_prices.pyCollection:
kaggle_wine_pricesFields: name, winery, country, region, wine_type, rating, rating_count, price_usd, vintage
Price range: $3.15 - $3,410.79 (median $15.95)
This is ground truth pricing data. The annotation script matches X-Wines entries against this dataset by winery name and wine name. Wines with a direct match get real Vivino prices (no LLM needed). Wines without a direct match get Kaggle prices from the same winery/region injected as context into the LLM prompt.
3. Claude Haiku API (AI Price Estimation)
Service: Anthropic Claude
claude-haiku-4-5-20251001Used by:
scripts/annotate_xwines_prices.pyPurpose: Estimates US retail prices for wines that lack real market data
Input: Wine name, winery, region, grape, vintage, rating, plus Kaggle reference prices when available
Output: price_low_usd, price_high_usd, price_tier, confidence, note
The LLM is the fallback for the ~81% of wines not directly matched in the Kaggle dataset. Its accuracy improves when Kaggle reference prices from the same winery or region are included in the prompt.
4. Brave Search API (Price Validation)
Service: Brave Search API
Used by:
scripts/validate_xwines_prices.pyPurpose: Spot-checks estimated prices against live retail data
Sources queried: Wine-Searcher, Wine.com, Total Wine, Vivino
Not in the serving path – validation only
MongoDB Collections
xwines_wines – Reference Wine Catalogue
Field |
Type |
Description |
|---|---|---|
|
int |
Unique wine identifier |
|
string |
Wine name |
|
string |
Producer/winery |
|
string |
Red, White, Rose, Sparkling |
|
string |
Country of origin |
|
string |
Wine region |
|
string |
Grape varieties (stringified list) |
|
string |
Available vintages (stringified list) |
|
float |
Community average rating |
|
int |
Number of ratings |
Written by: deploy/import_xwines_mongo.py
Read by: xwines router (search), enrichment service, annotation script
xwines_prices – Estimated Retail Prices
Field |
Type |
Description |
|---|---|---|
|
int |
Links to |
|
int or null |
Vintage year; null = base/fallback price |
|
float |
Lower bound of estimated price |
|
float |
Upper bound of estimated price |
|
string |
budget / value / mid_range / premium / luxury / ultra_premium |
|
string |
high / medium / low |
|
string |
Pricing justification |
|
string |
|
|
datetime |
When the price was generated |
Indexes:
(xwines_id, vintage)unique – one price per wine-vintage pair(xwines_id)non-unique – fast lookups for “all prices for this wine”
Written by: scripts/annotate_xwines_prices.py
Read by: xwines router (search + detail), enrichment service (check-in + batch)
kaggle_wine_prices – Vivino Market Reference
Field |
Type |
Description |
|---|---|---|
|
string |
Wine name (original + lowercase) |
|
string |
Winery name (original + lowercase) |
|
string |
Country |
|
string |
Region |
|
string |
red / white / rose / sparkling |
|
float |
Actual Vivino retail price |
|
int or null |
Vintage year |
|
float |
Vivino rating |
Written by: scripts/import_kaggle_prices.py
Read by: scripts/annotate_xwines_prices.py (matching + context)
xwines_price_validations – Validation Results
Field |
Type |
Description |
|---|---|---|
|
int |
Wine reference |
|
int or null |
Vintage |
|
float |
Our estimate |
|
float |
Actual web price found |
|
string |
wine-searcher / wine.com / totalwine / vivino |
|
string |
accurate / close / overestimated / underestimated / no_data |
Written by: scripts/validate_xwines_prices.py
Read by: scripts/compare_pricing_models.py, --report flag
xwines_metadata – Dataset Version Tracking
Stores key-value pairs: version, import_date, rating_count, source.
Written by: deploy/import_xwines_mongo.py
Read by: xwines router /stats endpoint
Data Flow
IMPORT PHASE
============
X-Wines GitHub CSVs ──→ deploy/import_xwines_mongo.py ──→ xwines_wines (100K wines)
xwines_metadata
Kaggle Vivino CSVs ──→ scripts/import_kaggle_prices.py ──→ kaggle_wine_prices (14K prices)
ANNOTATION PHASE
================
scripts/annotate_xwines_prices.py
│
┌─────────┴─────────┐
│ │
Kaggle Match? No Match
(winery+name) │
│ ┌────┴────┐
│ │ │
│ Has Kaggle No context
│ context? │
│ │ │
▼ ▼ ▼
Use Vivino LLM + refs LLM only
price ±15% in prompt (fallback)
│ │ │
└──────────────┴───────────┘
│
▼
xwines_prices collection
(one doc per wine-vintage pair)
VALIDATION PHASE
================
xwines_prices ──→ scripts/validate_xwines_prices.py ──→ Brave Search API
│
▼
xwines_price_validations
(accuracy report)
SERVING PHASE
=============
User searches wine ──→ xwines router /search
│
├── Atlas Search on xwines_wines
├── Price filter on xwines_prices (any vintage in range)
└── Batch price lookup (vintage=null for base prices)
│
▼
Search results with price_low, price_high, price_tier
User checks in wine ──→ enrichment service
│
├── Match xwines_wines by name
├── Fill missing fields (winery, grape, region, etc.)
└── Price lookup: vintage-specific → base price fallback
│
▼
Enriched wine with estimated_price_low/high
Price Tiers
Tier |
Range |
Description |
|---|---|---|
|
< $15 |
Everyday wines |
|
$15 - $25 |
Good value picks |
|
$25 - $50 |
Quality wines |
|
$50 - $100 |
Special occasion |
|
$100 - $250 |
Collector wines |
|
> $250 |
Investment-grade |
Scripts Reference
Script |
Purpose |
Run Command |
|---|---|---|
|
Import X-Wines dataset |
|
|
Import Kaggle/Vivino prices |
|
|
Generate price estimates |
|
|
Validate prices via web search |
|
|
Compare Haiku vs Sonnet accuracy |
|
Annotation Script Options
# Dry run to see what would be processed
uv run python scripts/annotate_xwines_prices.py --dry-run
# Annotate 100 wines (popular first)
uv run python scripts/annotate_xwines_prices.py --max-wines 100
# Check progress
uv run python scripts/annotate_xwines_prices.py --status
# Estimate remaining cost
uv run python scripts/annotate_xwines_prices.py --estimate-cost
# Filter by grape variety
uv run python scripts/annotate_xwines_prices.py --grape "Pinot Noir" --max-wines 500
Accuracy
Based on validation of 50 randomly sampled prices against Brave Search:
57% acceptable (in estimated range or within 30%)
29% accurate (web price falls within estimated range)
37% underestimated (mostly prestige wines: Burgundy Grand Cru, aged Port)
6% overestimated
Kaggle-matched prices are expected to be significantly more accurate since they use real Vivino data
The underestimation bias primarily affects luxury and ultra-premium wines where the LLM lacks specific pricing knowledge. Budget and mid-range wines are ~70-80% accurate.