Vertical Use Cases
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code

This section provides complete, production-ready examples for specific industry verticals, demonstrating how to build end-to-end data pipelines using WebRobot ETL.

What are Vertical Use Cases?

Vertical use cases are industry-specific data pipelines that solve real business problems:

Price Comparison & E-commerce: Aggregate product offers from multiple sources, compare prices, track availability
Real Estate: Monitor property listings, track price trends, identify arbitrage opportunities
Sports Betting: Collect odds from multiple bookmakers, detect arbitrage opportunities, track line movements
Financial Markets: Aggregate news, earnings transcripts, social sentiment for trading signals
Legal & Compliance: Monitor regulatory changes, extract contract terms, track compliance deadlines
Healthcare: Aggregate clinical guidelines, drug information, medical research

Common Patterns Across Verticals

Important: When starting a pipeline with crawling stages (explore, join, etc.), you must include a fetch: section with a starting URL. Pipelines that start with load_csv or other non-crawling stages don't require fetch:.

1. Multi-Source Aggregation

Most verticals require aggregating data from multiple sources:

# Pattern: Fetch from multiple sources, union, deduplicate
fetch:
  url: "https://example.com"  # Starting URL for Source A

pipeline:
  # Source A: Direct crawl
  - stage: explore
    args: [ "li.next a", 2 ]
  - stage: join
    args: [ "article.product_pod h3 a", "LeftOuter" ]
  - stage: extract
    args:
      - { selector: "h1", method: "text", as: "title" }
      - { selector: ".price_color", method: "text", as: "price_raw" }
  - stage: cache
    args: []
  - stage: store
    args: [ "source_a" ]

  # Source B: API-based discovery
  - stage: reset
    args: []
  - stage: load_csv
    args:
      - { path: "${INPUT_PATH}", header: "true", inferSchema: "true" }
  - stage: searchEngine
    args:
      - provider: "google"
        ean: "$ean"
        num_results: 5
  - stage: visit
    args: [ "$result_link" ]
  - stage: extract
    args:
      - { selector: "title", method: "text", as: "title" }
      - { selector: "meta[property='product:price:amount']", method: "attr:content", as: "price_raw" ]
  - stage: cache
    args: []
  - stage: store
    args: [ "source_b" ]

  # Merge sources
  - stage: reset
    args: []
  - stage: union_with
    args: [ "source_a", "source_b" ]
  - stage: dedup
    args: [ "url" ]  # or "ean", "sku", etc.

2. Entity Resolution & Deduplication

Use stable business keys for deduplication:

E-commerce: ean, sku, url + source
Real Estate: property_id, address + source
Sports Betting: match_id, bookmaker + market_type
Financial: ticker, date, source

3. Schema Normalization

Normalize data across sources to a canonical schema:

# Example: Price normalization
- stage: python_row_transform:normalize_price
  args: []

4. Enrichment & Validation

Enrich records with additional data:

EAN validation: Verify product codes
Geocoding: Convert addresses to coordinates
Sentiment analysis: Analyze text content
Image matching: Match product images

Vertical-Specific Guides

📦 Price Comparison & E-commerce

Use Case: Aggregate product offers from multiple e-commerce sites, compare prices, track availability.

Key Features:

Multi-source product aggregation
Price normalization and comparison
EAN-based product matching
Availability tracking
Historical price analysis

Example Output: Unified product catalog with prices from 10+ sources, deduplicated by EAN.

🏠 Real Estate

Use Case: Monitor property listings, track price trends, identify undervalued properties.

Key Features:

Multi-listing site aggregation
Price per square meter calculation
Property clustering (ML-based)
Arbitrage opportunity detection
Market trend analysis

Example Output: Properties flagged as "undervalued" based on cluster analysis.

⚽ Sports Betting

Use Case: Collect odds from multiple bookmakers, detect arbitrage opportunities, track line movements.

Key Features:

Multi-bookmaker odds aggregation
Real-time odds polling
Arbitrage detection (surebet scanner)
Line movement tracking
Implied probability calculation

Example Output: Arbitrage opportunities with profit margins > 2%.

📈 Financial Markets

Use Case: Aggregate news, earnings transcripts, social sentiment for trading signals.

Key Features:

News aggregation (licensed providers and/or public feeds)
Earnings call transcript extraction
Social sentiment analysis (licensed social/sentiment providers)
SEC filing extraction
Trading signal generation

Example Output: Trading signals based on sentiment + news correlation.

⚖️ Legal & Compliance

Use Case: Monitor regulatory changes, extract contract terms, track compliance deadlines.

Key Features:

Regulatory document monitoring
Contract clause extraction
Obligation tracking
Compliance deadline alerts
Risk flagging

Example Output: Compliance dashboard with upcoming deadlines.

🏥 Healthcare

Use Case: Aggregate clinical guidelines, drug information, medical research.

Key Features:

Clinical guideline aggregation
Drug interaction checking
Medical research paper extraction
Patient data anonymization
Regulatory compliance

Example Output: Drug interaction database with clinical evidence.

🤖 LLM Fine-Tuning Dataset Construction

Use Case: Build high-quality datasets for fine-tuning Large Language Models (LLMs) by aggregating, cleaning, and structuring data from multiple sources.

Key Features:

Multi-source data aggregation (forums, documentation, Q&A sites, code repositories)
Text cleaning and normalization
Format conversion (instruction-following, chat, code completion)
Quality filtering and deduplication
Dataset balancing and train/val/test splitting
Export to JSONL/Parquet formats

Example Output: 100K+ instruction-following examples in Alpaca format, ready for LLM fine-tuning.

💼 Portfolio Management & 90-Day Asset Prediction

Use Case: Build training datasets for 90-day asset price prediction models, designed for LLM fine-tuning on NVIDIA DGX SPARK (Feeless portfolio management layer).

Key Features:

Multi-source financial data aggregation (prices, macro, sentiment, alternative data)
Technical indicator calculation (RSI, MACD, Bollinger Bands, volatility)
90-day forward target generation
Time-series alignment and feature engineering
LLM fine-tuning format (instruction-following with temporal context)
Export for distributed training on NVIDIA DGX SPARK

Example Output: JSONL/Parquet training dataset with 50+ features for 90-day price prediction, ready for LLM fine-tuning.

Related: Part of the Feeless portfolio management layer for agentic pools.

Getting Started

Choose your vertical: Review the guides above to find the use case that matches your needs
Review the architecture: Each guide includes a complete pipeline architecture
Customize the pipeline: Adapt the YAML examples to your specific sources and requirements
Deploy and monitor: Use the WebRobot API to deploy and monitor your pipelines

Best Practices

1. Start with a Single Source

Begin with one source, validate the extraction, then add more sources:

# Phase 1: Single source
fetch:
  url: "https://example.com"  # Starting URL

pipeline:
  - stage: explore
    args: [ "li.next a", 2 ]
  - stage: extract
    args: [ ... ]
  - stage: save_csv
    args: [ "${OUTPUT_PATH}", "overwrite" ]

2. Use Environment Variables

Always use environment variables for paths and configuration:

- stage: save_csv
  args: [ "${OUTPUT_PATH}", "overwrite" ]

3. Cache Intermediate Results

Use cache or persist before store for in-memory branching:

- stage: cache
  args: []
- stage: store
  args: [ "branch_label" ]

4. Choose Stable Deduplication Keys

Select keys that uniquely identify entities across sources:

✅ Good: ean, sku, property_id, match_id
❌ Bad: title, description, price

5. Normalize Schemas Early

Normalize data as early as possible in the pipeline:

- stage: extract
  args: [ ... ]
- stage: python_row_transform:normalize_schema
  args: []

Build a Pipeline: Learn the fundamentals of building pipelines
Pipeline Stages Reference: Complete reference of available stages
Pipeline Examples: Generic pipeline examples
Observability & Metrics: Monitor your vertical pipelines

Vertical Use CasesCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from ClaudeConnect to CursorInstall MCP server on CursorConnect to VS CodeInstall MCP server on VS Code

What are Vertical Use Cases?

Common Patterns Across Verticals

1. Multi-Source Aggregation

2. Entity Resolution & Deduplication

3. Schema Normalization

4. Enrichment & Validation

Vertical-Specific Guides

📦 Price Comparison & E-commerce

🏠 Real Estate

⚽ Sports Betting

📈 Financial Markets

⚖️ Legal & Compliance

🏥 Healthcare

🤖 LLM Fine-Tuning Dataset Construction

💼 Portfolio Management & 90-Day Asset Prediction

Getting Started

Best Practices

1. Start with a Single Source

2. Use Environment Variables

3. Cache Intermediate Results

4. Choose Stable Deduplication Keys

5. Normalize Schemas Early

Related Documentation

Was this helpful?

Vertical Use Cases
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code