# Wine Pricing Data Architecture

WineBox estimates retail prices for wines using a multi-source pipeline that
combines real market data with AI estimation. This document describes every
data source, MongoDB collection, and script involved.

## Data Sources

### 1. X-Wines Dataset (Reference Wine Catalogue)

- **Origin:** [github.com/rogerioxavier/X-Wines](https://github.com/rogerioxavier/X-Wines)
- **Size:** 100,646 wines with 21M+ community ratings
- **Import script:** `deploy/import_xwines_mongo.py`
- **Collection:** `xwines_wines`
- **Fields used:** name, winery_name, wine_type, grapes, ABV, country, region_name, vintages, avg_rating, rating_count

This is the reference wine catalogue. Every wine in the autocomplete, every
enrichment lookup, and every price annotation starts here. The `vintages`
field (a string like `"[2020, 2019, 2018]"`) drives per-vintage pricing.

### 2. Kaggle/Vivino Wine Prices (Real Market Data)

- **Origin:** [kaggle.com/datasets/budnyak/wine-rating-and-price](https://www.kaggle.com/datasets/budnyak/wine-rating-and-price)
- **Size:** 13,830 wines with actual retail prices from Vivino
- **Import script:** `scripts/import_kaggle_prices.py`
- **Collection:** `kaggle_wine_prices`
- **Fields:** name, winery, country, region, wine_type, rating, rating_count, price_usd, vintage
- **Price range:** $3.15 - $3,410.79 (median $15.95)

This is ground truth pricing data. The annotation script matches X-Wines
entries against this dataset by winery name and wine name. Wines with a
direct match get real Vivino prices (no LLM needed). Wines without a direct
match get Kaggle prices from the same winery/region injected as context into
the LLM prompt.

### 3. Claude Haiku API (AI Price Estimation)

- **Service:** Anthropic Claude `claude-haiku-4-5-20251001`
- **Used by:** `scripts/annotate_xwines_prices.py`
- **Purpose:** Estimates US retail prices for wines that lack real market data
- **Input:** Wine name, winery, region, grape, vintage, rating, plus Kaggle reference prices when available
- **Output:** price_low_usd, price_high_usd, price_tier, confidence, note

The LLM is the fallback for the ~81% of wines not directly matched in the
Kaggle dataset. Its accuracy improves when Kaggle reference prices from the
same winery or region are included in the prompt.

### 4. Brave Search API (Price Validation)

- **Service:** [Brave Search API](https://brave.com/search/api/)
- **Used by:** `scripts/validate_xwines_prices.py`
- **Purpose:** Spot-checks estimated prices against live retail data
- **Sources queried:** Wine-Searcher, Wine.com, Total Wine, Vivino
- **Not in the serving path** -- validation only

## MongoDB Collections

### `xwines_wines` -- Reference Wine Catalogue

| Field | Type | Description |
|-------|------|-------------|
| `xwines_id` | int | Unique wine identifier |
| `name` | string | Wine name |
| `winery_name` | string | Producer/winery |
| `wine_type` | string | Red, White, Rose, Sparkling |
| `country` | string | Country of origin |
| `region_name` | string | Wine region |
| `grapes` | string | Grape varieties (stringified list) |
| `vintages` | string | Available vintages (stringified list) |
| `avg_rating` | float | Community average rating |
| `rating_count` | int | Number of ratings |

**Written by:** `deploy/import_xwines_mongo.py`
**Read by:** xwines router (search), enrichment service, annotation script

### `xwines_prices` -- Estimated Retail Prices

| Field | Type | Description |
|-------|------|-------------|
| `xwines_id` | int | Links to `xwines_wines.xwines_id` |
| `vintage` | int or null | Vintage year; null = base/fallback price |
| `price_low_usd` | float | Lower bound of estimated price |
| `price_high_usd` | float | Upper bound of estimated price |
| `price_tier` | string | budget / value / mid_range / premium / luxury / ultra_premium |
| `confidence` | string | high / medium / low |
| `note` | string | Pricing justification |
| `model` | string | `kaggle-vivino` or `claude-haiku-4-5-20251001` |
| `created_at` | datetime | When the price was generated |

**Indexes:**
- `(xwines_id, vintage)` unique -- one price per wine-vintage pair
- `(xwines_id)` non-unique -- fast lookups for "all prices for this wine"

**Written by:** `scripts/annotate_xwines_prices.py`
**Read by:** xwines router (search + detail), enrichment service (check-in + batch)

### `kaggle_wine_prices` -- Vivino Market Reference

| Field | Type | Description |
|-------|------|-------------|
| `name` / `name_lower` | string | Wine name (original + lowercase) |
| `winery` / `winery_lower` | string | Winery name (original + lowercase) |
| `country` | string | Country |
| `region` | string | Region |
| `wine_type` | string | red / white / rose / sparkling |
| `price_usd` | float | Actual Vivino retail price |
| `vintage` | int or null | Vintage year |
| `rating` | float | Vivino rating |

**Written by:** `scripts/import_kaggle_prices.py`
**Read by:** `scripts/annotate_xwines_prices.py` (matching + context)

### `xwines_price_validations` -- Validation Results

| Field | Type | Description |
|-------|------|-------------|
| `xwines_id` | int | Wine reference |
| `vintage` | int or null | Vintage |
| `estimated_low` / `estimated_high` | float | Our estimate |
| `search_price` | float | Actual web price found |
| `source` | string | wine-searcher / wine.com / totalwine / vivino |
| `status` | string | accurate / close / overestimated / underestimated / no_data |

**Written by:** `scripts/validate_xwines_prices.py`
**Read by:** `scripts/compare_pricing_models.py`, `--report` flag

### `xwines_metadata` -- Dataset Version Tracking

Stores key-value pairs: `version`, `import_date`, `rating_count`, `source`.

**Written by:** `deploy/import_xwines_mongo.py`
**Read by:** xwines router `/stats` endpoint

## Data Flow

```
                    IMPORT PHASE
                    ============

X-Wines GitHub CSVs ──→ deploy/import_xwines_mongo.py ──→ xwines_wines (100K wines)
                                                          xwines_metadata

Kaggle Vivino CSVs  ──→ scripts/import_kaggle_prices.py ──→ kaggle_wine_prices (14K prices)


                    ANNOTATION PHASE
                    ================

                    scripts/annotate_xwines_prices.py
                              │
                    ┌─────────┴─────────┐
                    │                   │
              Kaggle Match?        No Match
              (winery+name)             │
                    │              ┌────┴────┐
                    │              │         │
                    │         Has Kaggle   No context
                    │         context?         │
                    │              │           │
                    ▼              ▼           ▼
              Use Vivino      LLM + refs   LLM only
              price ±15%      in prompt    (fallback)
                    │              │           │
                    └──────────────┴───────────┘
                              │
                              ▼
                    xwines_prices collection
                    (one doc per wine-vintage pair)


                    VALIDATION PHASE
                    ================

xwines_prices ──→ scripts/validate_xwines_prices.py ──→ Brave Search API
                                                              │
                                                              ▼
                                                    xwines_price_validations
                                                    (accuracy report)


                    SERVING PHASE
                    =============

User searches wine ──→ xwines router /search
                              │
                              ├── Atlas Search on xwines_wines
                              ├── Price filter on xwines_prices (any vintage in range)
                              └── Batch price lookup (vintage=null for base prices)
                                        │
                                        ▼
                              Search results with price_low, price_high, price_tier

User checks in wine ──→ enrichment service
                              │
                              ├── Match xwines_wines by name
                              ├── Fill missing fields (winery, grape, region, etc.)
                              └── Price lookup: vintage-specific → base price fallback
                                        │
                                        ▼
                              Enriched wine with estimated_price_low/high
```

## Price Tiers

| Tier | Range | Description |
|------|-------|-------------|
| `budget` | < $15 | Everyday wines |
| `value` | $15 - $25 | Good value picks |
| `mid_range` | $25 - $50 | Quality wines |
| `premium` | $50 - $100 | Special occasion |
| `luxury` | $100 - $250 | Collector wines |
| `ultra_premium` | > $250 | Investment-grade |

## Scripts Reference

| Script | Purpose | Run Command |
|--------|---------|-------------|
| `deploy/import_xwines_mongo.py` | Import X-Wines dataset | `uv run python deploy/import_xwines_mongo.py` |
| `scripts/import_kaggle_prices.py` | Import Kaggle/Vivino prices | `uv run python scripts/import_kaggle_prices.py --path /tmp/wine-prices` |
| `scripts/annotate_xwines_prices.py` | Generate price estimates | `uv run python scripts/annotate_xwines_prices.py` |
| `scripts/validate_xwines_prices.py` | Validate prices via web search | `uv run python scripts/validate_xwines_prices.py --sample-size 50` |
| `scripts/compare_pricing_models.py` | Compare Haiku vs Sonnet accuracy | `uv run python scripts/compare_pricing_models.py` |

### Annotation Script Options

```bash
# Dry run to see what would be processed
uv run python scripts/annotate_xwines_prices.py --dry-run

# Annotate 100 wines (popular first)
uv run python scripts/annotate_xwines_prices.py --max-wines 100

# Check progress
uv run python scripts/annotate_xwines_prices.py --status

# Estimate remaining cost
uv run python scripts/annotate_xwines_prices.py --estimate-cost

# Filter by grape variety
uv run python scripts/annotate_xwines_prices.py --grape "Pinot Noir" --max-wines 500
```

## Accuracy

Based on validation of 50 randomly sampled prices against Brave Search:

- **57% acceptable** (in estimated range or within 30%)
- **29% accurate** (web price falls within estimated range)
- **37% underestimated** (mostly prestige wines: Burgundy Grand Cru, aged Port)
- **6% overestimated**
- Kaggle-matched prices are expected to be significantly more accurate since they use real Vivino data

The underestimation bias primarily affects luxury and ultra-premium wines where
the LLM lacks specific pricing knowledge. Budget and mid-range wines are
~70-80% accurate.