This section provides complete, production-ready examples for specific industry verticals, demonstrating how to build end-to-end data pipelines using WebRobot ETL.
Vertical use cases are industry-specific data pipelines that solve real business problems:
- Price Comparison & E-commerce: Aggregate product offers from multiple sources, compare prices, track availability
- Real Estate: Monitor property listings, track price trends, identify arbitrage opportunities
- Sports Betting: Collect odds from multiple bookmakers, detect arbitrage opportunities, track line movements
- Financial Markets: Aggregate news, earnings transcripts, social sentiment for trading signals
- Legal & Compliance: Monitor regulatory changes, extract contract terms, track compliance deadlines
- Healthcare: Aggregate clinical guidelines, drug information, medical research
Important: When starting a pipeline with crawling stages (
explore,join, etc.), you must include afetch:section with a starting URL. Pipelines that start withload_csvor other non-crawling stages don't requirefetch:.
Most verticals require aggregating data from multiple sources:
# Pattern: Fetch from multiple sources, union, deduplicate
fetch:
url: "https://example.com" # Starting URL for Source A
pipeline:
# Source A: Direct crawl
- stage: explore
args: [ "li.next a", 2 ]
- stage: join
args: [ "article.product_pod h3 a", "LeftOuter" ]
- stage: extract
args:
- { selector: "h1", method: "text", as: "title" }
- { selector: ".price_color", method: "text", as: "price_raw" }
- stage: cache
args: []
- stage: store
args: [ "source_a" ]
# Source B: API-based discovery
- stage: reset
args: []
- stage: load_csv
args:
- { path: "${INPUT_PATH}", header: "true", inferSchema: "true" }
- stage: searchEngine
args:
- provider: "google"
ean: "$ean"
num_results: 5
- stage: visit
args: [ "$result_link" ]
- stage: extract
args:
- { selector: "title", method: "text", as: "title" }
- { selector: "meta[property='product:price:amount']", method: "attr:content", as: "price_raw" ]
- stage: cache
args: []
- stage: store
args: [ "source_b" ]
# Merge sources
- stage: reset
args: []
- stage: union_with
args: [ "source_a", "source_b" ]
- stage: dedup
args: [ "url" ] # or "ean", "sku", etc.Use stable business keys for deduplication:
- E-commerce:
ean,sku,url+source - Real Estate:
property_id,address+source - Sports Betting:
match_id,bookmaker+market_type - Financial:
ticker,date,source
Normalize data across sources to a canonical schema:
# Example: Price normalization
- stage: python_row_transform:normalize_price
args: []Enrich records with additional data:
- EAN validation: Verify product codes
- Geocoding: Convert addresses to coordinates
- Sentiment analysis: Analyze text content
- Image matching: Match product images
Use Case: Aggregate product offers from multiple e-commerce sites, compare prices, track availability.
Key Features:
- Multi-source product aggregation
- Price normalization and comparison
- EAN-based product matching
- Availability tracking
- Historical price analysis
Example Output: Unified product catalog with prices from 10+ sources, deduplicated by EAN.
Use Case: Monitor property listings, track price trends, identify undervalued properties.
Key Features:
- Multi-listing site aggregation
- Price per square meter calculation
- Property clustering (ML-based)
- Arbitrage opportunity detection
- Market trend analysis
Example Output: Properties flagged as "undervalued" based on cluster analysis.
Use Case: Collect odds from multiple bookmakers, detect arbitrage opportunities, track line movements.
Key Features:
- Multi-bookmaker odds aggregation
- Real-time odds polling
- Arbitrage detection (surebet scanner)
- Line movement tracking
- Implied probability calculation
Example Output: Arbitrage opportunities with profit margins > 2%.
Use Case: Aggregate news, earnings transcripts, social sentiment for trading signals.
Key Features:
- News aggregation (licensed providers and/or public feeds)
- Earnings call transcript extraction
- Social sentiment analysis (licensed social/sentiment providers)
- SEC filing extraction
- Trading signal generation
Example Output: Trading signals based on sentiment + news correlation.
Use Case: Monitor regulatory changes, extract contract terms, track compliance deadlines.
Key Features:
- Regulatory document monitoring
- Contract clause extraction
- Obligation tracking
- Compliance deadline alerts
- Risk flagging
Example Output: Compliance dashboard with upcoming deadlines.
Use Case: Aggregate clinical guidelines, drug information, medical research.
Key Features:
- Clinical guideline aggregation
- Drug interaction checking
- Medical research paper extraction
- Patient data anonymization
- Regulatory compliance
Example Output: Drug interaction database with clinical evidence.
Use Case: Build high-quality datasets for fine-tuning Large Language Models (LLMs) by aggregating, cleaning, and structuring data from multiple sources.
Key Features:
- Multi-source data aggregation (forums, documentation, Q&A sites, code repositories)
- Text cleaning and normalization
- Format conversion (instruction-following, chat, code completion)
- Quality filtering and deduplication
- Dataset balancing and train/val/test splitting
- Export to JSONL/Parquet formats
Example Output: 100K+ instruction-following examples in Alpaca format, ready for LLM fine-tuning.
Use Case: Build training datasets for 90-day asset price prediction models, designed for LLM fine-tuning on NVIDIA DGX SPARK (Feeless portfolio management layer).
Key Features:
- Multi-source financial data aggregation (prices, macro, sentiment, alternative data)
- Technical indicator calculation (RSI, MACD, Bollinger Bands, volatility)
- 90-day forward target generation
- Time-series alignment and feature engineering
- LLM fine-tuning format (instruction-following with temporal context)
- Export for distributed training on NVIDIA DGX SPARK
Example Output: JSONL/Parquet training dataset with 50+ features for 90-day price prediction, ready for LLM fine-tuning.
Related: Part of the Feeless portfolio management layer for agentic pools.
- Choose your vertical: Review the guides above to find the use case that matches your needs
- Review the architecture: Each guide includes a complete pipeline architecture
- Customize the pipeline: Adapt the YAML examples to your specific sources and requirements
- Deploy and monitor: Use the WebRobot API to deploy and monitor your pipelines
Begin with one source, validate the extraction, then add more sources:
# Phase 1: Single source
fetch:
url: "https://example.com" # Starting URL
pipeline:
- stage: explore
args: [ "li.next a", 2 ]
- stage: extract
args: [ ... ]
- stage: save_csv
args: [ "${OUTPUT_PATH}", "overwrite" ]Always use environment variables for paths and configuration:
- stage: save_csv
args: [ "${OUTPUT_PATH}", "overwrite" ]Use cache or persist before store for in-memory branching:
- stage: cache
args: []
- stage: store
args: [ "branch_label" ]Select keys that uniquely identify entities across sources:
- ✅ Good:
ean,sku,property_id,match_id - ❌ Bad:
title,description,price
Normalize data as early as possible in the pipeline:
- stage: extract
args: [ ... ]
- stage: python_row_transform:normalize_schema
args: []- Build a Pipeline: Learn the fundamentals of building pipelines
- Pipeline Stages Reference: Complete reference of available stages
- Pipeline Examples: Generic pipeline examples
- Observability & Metrics: Monitor your vertical pipelines