Skip to content
Last updated

Sports Betting & Odds Aggregation

This guide demonstrates how to build production-ready sports betting pipelines that aggregate odds from multiple bookmakers, detect arbitrage opportunities, and track odds movements over time.

Business Use Case

Goal: Build a unified odds comparison system by aggregating betting odds from multiple bookmakers, enabling:

  • Odds comparison: Compare odds for the same event/market across different bookmakers
  • Arbitrage detection: Identify opportunities where the combined odds guarantee profit
  • Odds tracking: Monitor odds movements and line movements over time
  • Risk management: Calculate optimal bet sizing and manage exposure

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Bookmaker A │     │ Bookmaker B │     │ Bookmaker C │     │ Bookmaker D │
│  (Crawl)    │     │  (API)      │     │  (Crawl)    │     │  (API)      │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │                   │
       └───────────────────┼───────────────────┘

                    ┌──────▼──────┐
                    │   Union &   │
                    │  Normalize  │
                    └──────┬──────┘

                    ┌──────▼──────┐
                    │  Match by   │
                    │ Event/Market│
                    └──────┬──────┘

                    ┌──────▼──────┐
                    │  Calculate  │
                    │  Arbitrage  │
                    └──────┬──────┘

                    ┌──────▼──────┐
                    │  Save to    │
                    │   Storage   │
                    └─────────────┘

Canonical Odds Schema

A standardized schema is crucial for effective odds comparison:

Field NameTypeDescriptionExample
event_idStringUnique event identifier (e.g., match_id)"match_20250101_team1_team2"
event_nameStringHuman-readable event name"Manchester United vs Liverpool"
event_dateTimestampEvent start date/time"2025-01-15T15:00:00Z"
sportStringSport type"football", "basketball", "tennis"
market_typeStringBetting market type"match_winner", "over_under", "handicap"
selectionStringBetting selection/outcome"home_win", "away_win", "draw", "over_2.5"
odds_decimalDoubleDecimal odds (e.g., 2.50)2.50
odds_fractionalStringFractional odds (e.g., "3/2")"3/2"
odds_americanStringAmerican odds (e.g., "+150")"+150"
bookmakerStringBookmaker name"bet365", "pinnacle", "betfair"
urlStringURL of the betting page"https://www.bet365.com/event/12345"
last_updatedTimestampLast time odds were extracted"2025-01-01T10:30:00Z"

Concrete Example: Aggregating Odds from 5 Bookmakers

Pipeline Pattern: Multi-Source Union with Event Matching

This example shows how to aggregate odds from 5 different bookmakers for the same football match:

# Complete example: Multi-bookmaker odds aggregation
fetch:
  url: "https://www.bet365.com/football"  # Starting URL for Bookmaker A

pipeline:
  # ============================================
  # Bookmaker A: bet365 (Direct crawl)
  # ============================================
  - stage: explore
    args: [ "a.match-link", 10 ]  # Navigate to match pages
  - stage: join
    args: [ "div.market-container", "LeftOuter" ]
  - stage: extract
    args:
      - { selector: "h1.match-title", method: "text", as: "event_name" }
      - { selector: "span.match-date", method: "attr:data-timestamp", as: "event_date" }
      - { selector: "div.market-type", method: "text", as: "market_type" }
      - { selector: "span.selection-name", method: "text", as: "selection" }
      - { selector: "span.odds-decimal", method: "text", as: "odds_raw" }
      - { selector: "link[rel=canonical]", method: "attr:href", as: "url" }
  - stage: python_row_transform:normalize_bet365
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "bet365"
        sport: "football"
  - stage: cache
    args: []
  - stage: store
    args: [ "bookmaker_a" ]

  # ============================================
  # Bookmaker B: Pinnacle (API-based)
  # ============================================
  - stage: reset
    args: []
  - stage: load_csv
    args:
      - { path: "${PINNACLE_API_RESPONSE_PATH}", header: "true", inferSchema: "true" }
  - stage: python_row_transform:normalize_pinnacle
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "pinnacle"
        sport: "football"
  - stage: cache
    args: []
  - stage: store
    args: [ "bookmaker_b" ]

  # ============================================
  # Bookmaker C: Betfair (Crawl)
  # ============================================
  - stage: reset
    args: []
  - stage: visit
    args: [ "https://www.betfair.com/sport/football" ]
  - stage: explore
    args: [ "a.event-link", 10 ]
  - stage: join
    args: [ "div.market-row", "LeftOuter" ]
  - stage: extract
    args:
      - { selector: "span.event-name", method: "text", as: "event_name" }
      - { selector: "span.event-time", method: "attr:data-time", as: "event_date" }
      - { selector: "span.market-name", method: "text", as: "market_type" }
      - { selector: "span.selection", method: "text", as: "selection" }
      - { selector: "span.odds", method: "text", as: "odds_raw" }
      - { selector: "link[rel=canonical]", method: "attr:href", as: "url" }
  - stage: python_row_transform:normalize_betfair
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "betfair"
        sport: "football"
  - stage: cache
    args: []
  - stage: store
    args: [ "bookmaker_c" ]

  # ============================================
  # Bookmaker D: William Hill (Crawl)
  # ============================================
  - stage: reset
    args: []
  - stage: visit
    args: [ "https://sports.williamhill.com/betting/en-gb/football" ]
  - stage: explore
    args: [ "a.event", 10 ]
  - stage: join
    args: [ "div.bet-market", "LeftOuter" ]
  - stage: extract
    args:
      - { selector: "h2.event-title", method: "text", as: "event_name" }
      - { selector: "time.event-date", method: "attr:datetime", as: "event_date" }
      - { selector: "span.market", method: "text", as: "market_type" }
      - { selector: "button.selection", method: "text", as: "selection" }
      - { selector: "span.price", method: "text", as: "odds_raw" }
      - { selector: "link[rel=canonical]", method: "attr:href", as: "url" }
  - stage: python_row_transform:normalize_williamhill
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "williamhill"
        sport: "football"
  - stage: cache
    args: []
  - stage: store
    args: [ "bookmaker_d" ]

  # ============================================
  # Bookmaker E: Unibet (API-based)
  # ============================================
  - stage: reset
    args: []
  - stage: load_csv
    args:
      - { path: "${UNIBET_API_RESPONSE_PATH}", header: "true", inferSchema: "true" }
  - stage: python_row_transform:normalize_unibet
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "unibet"
        sport: "football"
  - stage: cache
    args: []
  - stage: store
    args: [ "bookmaker_e" ]

  # ============================================
  # Merge: Union all bookmakers
  # ============================================
  - stage: reset
    args: []
  - stage: union_with
    args: [ "bookmaker_a", "bookmaker_b", "bookmaker_c", "bookmaker_d", "bookmaker_e" ]

  # ============================================
  # Normalize event matching keys
  # ============================================
  - stage: python_row_transform:generate_event_id
    args: []  # Creates event_id from event_name + event_date

  # ============================================
  # Save unified odds dataset
  # ============================================
  - stage: save_csv
    args: [ "${OUTPUT_PATH_UNIFIED_ODDS}", "overwrite" ]

Python Extensions for Odds Normalization

# python_extensions:
#   stages:
#     normalize_bet365:
#       type: row_transform
#       function: |
def normalize_bet365(row):
    """Normalize bet365 odds format to decimal."""
    odds_raw = row.get("odds_raw", "")
    if odds_raw:
        # bet365 format: "2.50" or "3/2" -> convert to decimal
        try:
            if "/" in odds_raw:
                # Fractional odds: "3/2" -> 2.50
                parts = odds_raw.split("/")
                decimal = (float(parts[0]) / float(parts[1])) + 1
                row["odds_decimal"] = round(decimal, 2)
            else:
                row["odds_decimal"] = float(odds_raw)
        except:
            row["odds_decimal"] = None
    return row

#     normalize_pinnacle:
#       type: row_transform
#       function: |
def normalize_pinnacle(row):
    """Normalize Pinnacle API response to canonical format."""
    # Pinnacle API returns decimal odds directly
    row["odds_decimal"] = float(row.get("odds", 0))
    row["event_name"] = row.get("eventName", "")
    row["selection"] = row.get("selectionName", "")
    return row

#     generate_event_id:
#       type: row_transform
#       function: |
def generate_event_id(row):
    """Generate stable event_id for matching across bookmakers."""
    import hashlib
    event_name = row.get("event_name", "").lower().strip()
    event_date = row.get("event_date", "")
    # Normalize event name (remove special chars, standardize team names)
    normalized = f"{event_name}_{event_date}"
    event_id = hashlib.md5(normalized.encode()).hexdigest()[:16]
    row["event_id"] = f"event_{event_id}"
    return row

Event Matching Logic: Recognizing the Same Sports Event Across Bookmakers

Critical Challenge: The same sports event can appear with different names and formats across different bookmakers. For example:

  • "Manchester United vs Liverpool" vs "Man Utd v Liverpool" vs "MUFC vs LFC"
  • "Arsenal vs Chelsea FC" vs "Arsenal vs Chelsea" vs "Arsenal v Chelsea"
  • Date formats: "2025-01-15 15:00" vs "15/01/2025 3:00 PM" vs "Jan 15, 2025"

The surebet finder must consult aggregated odds by event to detect arbitrage opportunities.

Event Matching Strategy

  1. Event Normalization:

    • Standardize team names (abbreviations, full names, aliases)
    • Normalize date/time formats
    • Remove special characters and standardize separators
    • Handle league/competition variations
  2. Event ID Generation:

    • Create stable event_id from normalized event name + date
    • Use hashing for consistent matching across bookmakers
  3. Aggregation by Event: All odds comparisons must be aggregated by event_id before surebet detection.

Event Name Normalization

# python_extensions:
#   stages:
#     normalize_event_name:
#       type: row_transform
#       function: |
def normalize_event_name(row):
    """Normalize event name for matching across bookmakers."""
    import re
    from datetime import datetime
    
    event_name = row.get("event_name", "").strip()
    event_date = row.get("event_date", "")
    
    # Common team name mappings (expand as needed)
    team_mappings = {
        "manchester united": ["man utd", "manchester utd", "mufc", "man united"],
        "liverpool": ["liverpool fc", "lfc"],
        "arsenal": ["arsenal fc"],
        "chelsea": ["chelsea fc", "cfc"],
        "tottenham": ["tottenham hotspur", "spurs", "tottenham fc"],
        "manchester city": ["man city", "mcfc", "manchester city fc"],
        # Add more mappings as needed
    }
    
    # Normalize event name
    event_name_lower = event_name.lower()
    
    # Replace common separators
    event_name_lower = re.sub(r'\s+vs\.?\s+', ' vs ', event_name_lower)
    event_name_lower = re.sub(r'\s+v\.?\s+', ' vs ', event_name_lower)
    event_name_lower = re.sub(r'\s+-\s+', ' vs ', event_name_lower)
    
    # Standardize team names
    for standard_name, variations in team_mappings.items():
        for variation in variations:
            event_name_lower = event_name_lower.replace(variation, standard_name)
    
    # Remove common prefixes/suffixes
    event_name_lower = re.sub(r'^(live|upcoming|today|tomorrow)\s+', '', event_name_lower)
    event_name_lower = re.sub(r'\s+(live|upcoming|today|tomorrow)$', '', event_name_lower)
    
    # Normalize whitespace
    event_name_lower = re.sub(r'\s+', ' ', event_name_lower).strip()
    
    row["event_name_normalized"] = event_name_lower
    
    # Normalize event date
    if event_date:
        try:
            # Try to parse various date formats
            date_formats = [
                "%Y-%m-%d %H:%M:%S",
                "%Y-%m-%dT%H:%M:%SZ",
                "%d/%m/%Y %H:%M",
                "%m/%d/%Y %H:%M",
                "%d-%m-%Y %H:%M",
                "%Y-%m-%d %H:%M",
            ]
            
            parsed_date = None
            for fmt in date_formats:
                try:
                    parsed_date = datetime.strptime(event_date, fmt)
                    break
                except:
                    continue
            
            if parsed_date:
                # Standardize to ISO format
                row["event_date_normalized"] = parsed_date.strftime("%Y-%m-%d %H:%M:%S")
            else:
                row["event_date_normalized"] = event_date
        except:
            row["event_date_normalized"] = event_date
    else:
        row["event_date_normalized"] = ""
    
    return row

#     generate_event_id:
#       type: row_transform
#       function: |
def generate_event_id(row):
    """Generate stable event_id for matching across bookmakers."""
    import hashlib
    
    event_name = row.get("event_name_normalized", "").lower().strip()
    event_date = row.get("event_date_normalized", "")
    sport = row.get("sport", "football").lower()
    
    # Create normalized key
    normalized_key = f"{sport}_{event_name}_{event_date}"
    
    # Remove special characters for consistent hashing
    normalized_key = normalized_key.replace(" ", "_").replace("-", "_")
    normalized_key = "".join(c for c in normalized_key if c.isalnum() or c == "_")
    
    # Generate hash-based event_id
    event_id_hash = hashlib.md5(normalized_key.encode()).hexdigest()[:16]
    row["event_id"] = f"event_{event_id_hash}"
    
    return row

Aggregation by Event ID

Before surebet detection, aggregate all odds by event_id:

# Pipeline: Prepare row-level odds for downstream surebet detection
pipeline:
  - stage: load_csv
    args:
      - { path: "${OUTPUT_PATH_UNIFIED_ODDS}", header: "true", inferSchema: "true" }
  
  # Normalize event names and dates
  - stage: python_row_transform:normalize_event_name
    args: []
  
  # Generate stable event_id
  - stage: python_row_transform:generate_event_id
    args: []
  
  # Create matching key (event_id + market_type + selection)
  - stage: python_row_transform:create_matching_key
    args: []
  
  # Deduplicate row-level odds
  - stage: dedup
    args: [ "matching_key", "bookmaker", "url" ]

  # Save row-level odds dataset (ready for downstream aggregation + surebet detection)
  - stage: save_csv
    args: [ "${OUTPUT_PATH_ODDS_ROW_LEVEL}", "overwrite" ]

Surebet detection and bookmaker-odds pivoting are computed downstream after grouping by matching_key (recommended: Trino/Spark SQL).

Event Matching & Odds Comparison Table

After aggregating odds from all bookmakers by event_id, create a comparison table that shows odds for the same event/market across all bookmakers. The surebet finder consults these aggregated odds by event to detect arbitrage opportunities.

Step 1: Generate Event Matching Table

# Pipeline: Generate event matching table
pipeline:
  - stage: load_csv
    args:
      - { path: "${OUTPUT_PATH_UNIFIED_ODDS}", header: "true", inferSchema: "true" }
  
  # Group by event_id + market_type + selection to create matching table
  - stage: python_row_transform:create_matching_key
    args: []  # Creates key: event_id + market_type + selection
  
  # Save matching table
  - stage: save_csv
    args: [ "${OUTPUT_PATH_MATCHING_TABLE}", "overwrite" ]

Step 2: Pivot Table for Odds Comparison

# python_extensions:
#   stages:
#     create_matching_key:
#       type: row_transform
#       function: |
def create_matching_key(row):
    """Create a matching key for event + market + selection."""
    event_id = row.get("event_id", "")
    market_type = row.get("market_type", "").lower().replace(" ", "_")
    selection = row.get("selection", "").lower().replace(" ", "_")
    row["matching_key"] = f"{event_id}_{market_type}_{selection}"
    return row

Example Output: Odds Comparison Table

The final matching table will look like:

event_idevent_namemarket_typeselectionbet365pinnaclebetfairwilliamhillunibetbest_oddsbest_bookmaker
event_123Man Utd vs Liverpoolmatch_winnerhome_win2.102.152.122.082.132.15pinnacle
event_123Man Utd vs Liverpoolmatch_winneraway_win3.503.453.553.603.483.60williamhill
event_123Man Utd vs Liverpoolmatch_winnerdraw3.203.253.183.223.243.25pinnacle

Concrete Example: Surebet Detection with Intelligent Navigation

This example demonstrates how to use intelligent navigation (intelligent_explore) and intelligent table extraction (intelligent_flatSelect) to extract odds from bookmaker sites with complex, non-trivial table structures.

Challenge: Complex Table Structures

Many bookmaker sites use:

  • Dynamic tables with JavaScript-rendered content
  • Nested structures where odds are in different containers
  • Non-standard HTML that doesn't follow simple CSS selector patterns
  • Multiple market types on the same page (match winner, over/under, handicaps, etc.)

Solution: Intelligent Extraction

Use intelligent_explore for navigation and intelligent_flatSelect for extracting odds from complex table structures:

# Complete example: Surebet detection with intelligent extraction
fetch:
  url: "https://www.bet365.com/football"
  traces:
    - { action: "visit", params: { url: "https://www.bet365.com/football", cooldown: 1.0 } }
    - { action: "wait", params: { seconds: 2 } }  # Wait for dynamic content to load

pipeline:
  # ============================================
  # Step 1: Intelligent Navigation
  # Navigate through match/event pages using LLM to find links
  # ============================================
  - stage: intelligent_explore
    args: 
      - "links to football match pages or event listings"  # Segmentation prompt
      - 3  # Depth: navigate 3 levels deep

  # ============================================
  # Step 2: Intelligent Table Extraction
  # Extract odds from complex table structures using LLM
  # ============================================
  - stage: intelligent_flatSelect
    args:
      # Segmentation prompt: Find repeated elements (rows/cards) containing betting markets
      - "repeated betting market rows or cards, each containing a market type and multiple selection options with odds"
      # Extraction prompt: Extract structured data from each segment
      - "Extract: event_name (match/event title), event_date (date/time), market_type (e.g., match_winner, over_under, handicap), selection (e.g., home_win, away_win, draw, over_2.5, under_2.5), odds_decimal (decimal odds as number), odds_raw (original odds string)"
      # Optional prefix for extracted fields
      - "bet365_"  # Prefix for all extracted fields

  # ============================================
  # Step 3: Normalize and Enrich
  # ============================================
  - stage: python_row_transform:normalize_bet365_intelligent
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "bet365"
        sport: "football"
        extraction_method: "intelligent"

  # ============================================
  # Step 4: Cache and Store
  # ============================================
  - stage: cache
    args: []
  - stage: store
    args: [ "bet365_offers" ]

  # ============================================
  # Repeat for other bookmakers...
  # ============================================
  # Bookmaker 2: Pinnacle (similar pattern)
  - stage: reset
    args: []
  - stage: visit
    args: [ "https://www.pinnacle.com/en/football" ]
  - stage: intelligent_explore
    args:
      - "links to football match pages"
      - 2
  - stage: intelligent_flatSelect
    args:
      - "betting market sections with odds for different selections"
      - "Extract: event_name, event_date, market_type, selection, odds_decimal, odds_raw"
      - "pinnacle_"
  - stage: python_row_transform:normalize_pinnacle_intelligent
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "pinnacle"
        sport: "football"
        extraction_method: "intelligent"
  - stage: cache
    args: []
  - stage: store
    args: [ "pinnacle_offers" ]

  # Bookmaker 3: Betfair
  - stage: reset
    args: []
  - stage: visit
    args: [ "https://www.betfair.com/sport/football" ]
  - stage: intelligent_explore
    args:
      - "event or match links in the football section"
      - 2
  - stage: intelligent_flatSelect
    args:
      - "betting market rows or cards with selection buttons and odds"
      - "Extract: event_name, event_date, market_type, selection, odds_decimal, odds_raw"
      - "betfair_"
  - stage: python_row_transform:normalize_betfair_intelligent
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "betfair"
        sport: "football"
        extraction_method: "intelligent"
  - stage: cache
    args: []
  - stage: store
    args: [ "betfair_offers" ]

  # Bookmaker 4: William Hill
  - stage: reset
    args: []
  - stage: visit
    args: [ "https://sports.williamhill.com/betting/en-gb/football" ]
  - stage: intelligent_explore
    args:
      - "football match or event links"
      - 2
  - stage: intelligent_flatSelect
    args:
      - "betting market containers with selection options and price buttons"
      - "Extract: event_name, event_date, market_type, selection, odds_decimal, odds_raw"
      - "williamhill_"
  - stage: python_row_transform:normalize_williamhill_intelligent
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "williamhill"
        sport: "football"
        extraction_method: "intelligent"
  - stage: cache
    args: []
  - stage: store
    args: [ "williamhill_offers" ]

  # Bookmaker 5: Unibet
  - stage: reset
    args: []
  - stage: visit
    args: [ "https://www.unibet.com/betting/sports/filter/football" ]
  - stage: intelligent_explore
    args:
      - "football match links or event cards"
      - 2
  - stage: intelligent_flatSelect
    args:
      - "betting market sections with odds for different outcomes"
      - "Extract: event_name, event_date, market_type, selection, odds_decimal, odds_raw"
      - "unibet_"
  - stage: python_row_transform:normalize_unibet_intelligent
    args: []
  - stage: python_row_transform:add_metadata
    args:
      - bookmaker: "unibet"
        sport: "football"
        extraction_method: "intelligent"
  - stage: cache
    args: []
  - stage: store
    args: [ "unibet_offers" ]

  # ============================================
  # Step 5: Merge All Bookmakers
  # ============================================
  - stage: reset
    args: []
  - stage: union_with
    args: [ "bet365_offers", "pinnacle_offers", "betfair_offers", "williamhill_offers", "unibet_offers" ]

  # ============================================
  # Step 6: Normalize Event Names for Matching
  # Critical: Same event can have different names across bookmakers
  # ============================================
  - stage: python_row_transform:normalize_event_name
    args: []

  # ============================================
  # Step 7: Generate Event ID for Matching
  # Creates stable event_id from normalized event name + date
  # ============================================
  - stage: python_row_transform:generate_event_id
    args: []

  # ============================================
  # Step 8: Create Matching Key
  # Group by event_id + market_type + selection
  # ============================================
  - stage: python_row_transform:create_matching_key
    args: []

  # ============================================
  # Step 9: Note on aggregation
  # ============================================
  # Surebet detection requires aggregation across bookmakers by (event_id, market_type, selection).
  # This example exports row-level odds; compute the pivot + surebet math downstream (Trino/Spark SQL).

  - stage: dedup
    args: [ "matching_key", "bookmaker", "url" ]

  - stage: save_csv
    args: [ "${OUTPUT_PATH_ODDS_ROW_LEVEL}", "overwrite" ]

Python Extensions for Intelligent Extraction Normalization

# python_extensions:
#   stages:
#     normalize_bet365_intelligent:
#       type: row_transform
#       function: |
def normalize_bet365_intelligent(row):
    """Normalize odds extracted via intelligent_flatSelect from bet365."""
    # intelligent_flatSelect extracts fields with prefix "bet365_"
    odds_decimal = row.get("bet365_odds_decimal")
    odds_raw = row.get("bet365_odds_raw", "")
    
    # If odds_decimal is already a number, use it
    if odds_decimal:
        try:
            row["odds_decimal"] = float(odds_decimal)
        except:
            row["odds_decimal"] = None
    elif odds_raw:
        # Parse odds_raw if odds_decimal wasn't extracted
        # bet365 format: "2.50" or "3/2"
        try:
            if "/" in odds_raw:
                parts = odds_raw.split("/")
                decimal = (float(parts[0]) / float(parts[1])) + 1
                row["odds_decimal"] = round(decimal, 2)
            else:
                row["odds_decimal"] = float(odds_raw.replace(",", ""))
        except:
            row["odds_decimal"] = None
    
    # Map extracted fields to canonical schema
    row["event_name"] = row.get("bet365_event_name", "")
    row["event_date"] = row.get("bet365_event_date", "")
    row["market_type"] = row.get("bet365_market_type", "").lower().replace(" ", "_")
    row["selection"] = row.get("bet365_selection", "").lower().replace(" ", "_")
    row["url"] = row.get("_url", "")  # Current page URL
    
    return row

#     detect_surebet:
#       type: row_transform
#       function: |
def detect_surebet(row, min_profit_margin=0.02):
    """Detect surebet (arbitrage) opportunities for a market."""
    # For match_winner market with 3 outcomes
    # Collect all odds for this event+market from different bookmakers
    # This is a simplified version - in production, you'd group by event_id + market_type first
    
    # Example: if we have odds from multiple bookmakers in the same row (after pivot)
    odds_home = row.get("odds_home", 0) or 999
    odds_draw = row.get("odds_draw", 0) or 999
    odds_away = row.get("odds_away", 0) or 999
    
    # Find best odds for each outcome across all bookmakers
    bookmakers = ["bet365", "pinnacle", "betfair", "williamhill", "unibet"]
    
    best_home = min([row.get(f"{b}_odds_home", 999) for b in bookmakers if row.get(f"{b}_odds_home")])
    best_draw = min([row.get(f"{b}_odds_draw", 999) for b in bookmakers if row.get(f"{b}_odds_draw")])
    best_away = min([row.get(f"{b}_odds_away", 999) for b in bookmakers if row.get(f"{b}_odds_away")])
    
    # Calculate implied probability
    if best_home > 0 and best_draw > 0 and best_away > 0:
        implied_prob = (1/best_home) + (1/best_draw) + (1/best_away)
        
        # Surebet exists if implied_prob < 1.0
        if implied_prob < 1.0:
            profit_margin = (1.0 - implied_prob) * 100  # Percentage profit
            
            if profit_margin >= min_profit_margin * 100:
                row["surebet_exists"] = True
                row["surebet_profit_pct"] = profit_margin
                row["best_home_odds"] = best_home
                row["best_draw_odds"] = best_draw
                row["best_away_odds"] = best_away
            else:
                row["surebet_exists"] = False
                row["surebet_profit_pct"] = 0.0
        else:
            row["surebet_exists"] = False
            row["surebet_profit_pct"] = 0.0
    
    return row

Why Intelligent Extraction is Better for Complex Tables

Traditional CSS Selectors often fail with:

  • Dynamic content loaded via JavaScript
  • Nested structures where odds are in different containers
  • Non-standard HTML that doesn't follow predictable patterns
  • Multiple market types mixed on the same page

Intelligent Extraction (intelligent_flatSelect) solves this by:

  • Using LLM to understand the page structure semantically
  • Adapting to different HTML layouts automatically
  • Extracting structured data even from complex, nested tables
  • Handling dynamic content that changes structure

Critical: Event Matching Before Surebet Detection

Important: The surebet finder must consult aggregated odds by event. This means:

  1. Normalize event names across bookmakers (e.g., "Man Utd vs Liverpool" = "Manchester United v Liverpool")
  2. Generate stable event_id from normalized event name + date
  3. Aggregate all odds for the same event/market/selection from all bookmakers
  4. Then detect surebet by comparing aggregated odds across bookmakers

Without proper event matching, the same event might be treated as different events, preventing surebet detection.

Example Output: Surebet Opportunities Table

The final output will contain rows with surebet opportunities:

event_idevent_namemarket_typeselectionbet365_oddspinnacle_oddsbetfair_oddswilliamhill_oddsunibet_oddsbest_oddssurebet_existssurebet_profit_pct
event_123Man Utd vs Liverpoolmatch_winnerhome_win2.102.152.122.082.132.15False0.0
event_123Man Utd vs Liverpoolmatch_winnerdraw3.203.253.183.223.243.25False0.0
event_124Arsenal vs Chelseamatch_winnerhome_win2.502.552.482.522.512.55True2.3

Arbitrage Detection

After creating the matching table, detect arbitrage opportunities:

# Pipeline: Detect arbitrage opportunities
pipeline:
  - stage: load_csv
    args:
      - { path: "${OUTPUT_PATH_MATCHING_TABLE}", header: "true", inferSchema: "true" }
  
  # Calculate arbitrage for each market
  - stage: python_row_transform:calculate_arbitrage
    args: []
  
  # Filter only profitable arbitrage opportunities
  - stage: python_row_transform:filter_arbitrage
    args:
      - min_profit_margin: 0.02  # Minimum 2% profit margin
  
  - stage: save_csv
    args: [ "${OUTPUT_PATH_ARBITRAGE_OPPORTUNITIES}", "overwrite" ]
# python_extensions:
#   stages:
#     calculate_arbitrage:
#       type: row_transform
#       function: |
def calculate_arbitrage(row):
    """Calculate arbitrage opportunity for a market."""
    # For match_winner market with 3 outcomes (home, draw, away)
    odds_home = row.get("bet365_home", 0) or row.get("pinnacle_home", 0) or 999
    odds_draw = row.get("bet365_draw", 0) or row.get("pinnacle_draw", 0) or 999
    odds_away = row.get("bet365_away", 0) or row.get("pinnacle_away", 0) or 999
    
    # Find best odds for each outcome
    best_home = min([row.get(f"{b}_home", 999) for b in ["bet365", "pinnacle", "betfair", "williamhill", "unibet"] if row.get(f"{b}_home")])
    best_draw = min([row.get(f"{b}_draw", 999) for b in ["bet365", "pinnacle", "betfair", "williamhill", "unibet"] if row.get(f"{b}_draw")])
    best_away = min([row.get(f"{b}_away", 999) for b in ["bet365", "pinnacle", "betfair", "williamhill", "unibet"] if row.get(f"{b}_away")])
    
    # Calculate implied probability
    implied_prob = (1/best_home) + (1/best_draw) + (1/best_away)
    
    # Arbitrage exists if implied_prob < 1.0
    if implied_prob < 1.0:
        row["arbitrage_exists"] = True
        row["arbitrage_profit"] = (1.0 - implied_prob) * 100  # Percentage profit
    else:
        row["arbitrage_exists"] = False
        row["arbitrage_profit"] = 0.0
    
    return row

API Endpoints for Odds Comparison

After processing, query the unified odds dataset:

Get Best Odds for an Event

GET /api/webrobot/api/datasets/{datasetId}/query?sqlQuery=SELECT event_name, selection, MAX(odds_decimal) as best_odds, bookmaker FROM odds_table WHERE event_id = 'event_123' GROUP BY event_name, selection, bookmaker

Get Arbitrage Opportunities

GET /api/webrobot/api/datasets/{datasetId}/query?sqlQuery=SELECT * FROM arbitrage_table WHERE arbitrage_profit > 2.0 ORDER BY arbitrage_profit DESC