WebRobot provides comprehensive observability and metrics collection capabilities, with built-in correlation to pay-to-use billing models for cloud provider partners.
WebRobot's observability stack enables:
- Real-time monitoring: Track pipeline execution, resource usage, and performance metrics
- Cost correlation: Automatically correlate metrics with cloud provider billing for accurate pay-to-use pricing
- Multi-cloud support: Unified metrics collection across different cloud providers (AWS, GCP, Azure, etc.)
- Partner integration: Seamless integration with cloud provider billing APIs for transparent cost tracking
WebRobot collects comprehensive infrastructure metrics for each pipeline execution:
CPU Usage: CPU time consumed (CPU-seconds)
- Measured per executor and driver
- Aggregated at job level
- Unit: CPU-seconds or vCPU-hours
Memory Usage: Peak and average memory consumption
- Driver memory (GB-hours)
- Executor memory (GB-hours)
- Total memory footprint
Storage I/O: Data transfer metrics
- Input data read (GB)
- Output data written (GB)
- S3/MinIO read/write operations
- Network transfer (GB)
Network Metrics: Network bandwidth and latency
- Ingress/egress data transfer
- Cross-region transfer costs
- API call volumes
Job Duration: Total execution time
- Wall-clock time (seconds)
- Spark application runtime
- Breakdown by stage
Task Metrics: Spark task-level metrics
- Tasks completed/failed
- Shuffle read/write (GB)
- GC time and frequency
Pipeline Stage Metrics: Per-stage execution metrics
- Stage duration
- Records processed
- Data volume processed
- Records Processed: Number of records/rows processed
- Data Volume: Total data processed (GB)
- API Calls: External API calls made during execution
- Web Requests: HTTP requests made for web scraping
WebRobot automatically correlates collected metrics with cloud provider billing APIs to enable transparent pay-to-use pricing:
- EC2/EMR Costs: Correlates CPU-hours, memory-hours with EC2/EMR pricing
- S3 Costs: Tracks S3 storage, requests, and data transfer costs
- Data Transfer: Monitors cross-region and internet data transfer costs
- CloudWatch Integration: Pulls billing data from CloudWatch Billing API
- Compute Engine Costs: Correlates vCPU-hours, memory-hours with GCP pricing
- Cloud Storage Costs: Tracks GCS storage and operations costs
- Network Egress: Monitors network egress costs
- Billing API Integration: Uses GCP Billing API for real-time cost tracking
- VM Costs: Correlates compute hours with Azure VM pricing
- Blob Storage Costs: Tracks Azure Blob storage and operations
- Data Transfer: Monitors Azure data transfer costs
- Cost Management API: Integrates with Azure Cost Management API
WebRobot provides granular cost attribution at multiple levels:
- Organization Level: Total costs per organization
- Project Level: Costs broken down by project
- Job Level: Individual job execution costs
- Pipeline Level: Costs per pipeline definition
- Stage Level: Cost breakdown by pipeline stage
- Real-time Cost Tracking: Monitor costs as jobs execute
- Cost Forecasting: Estimate costs before job execution
- Cost Optimization: Recommendations for cost reduction
- Multi-currency Support: Support for different cloud provider currencies
WebRobot persists all metrics directly in the database for:
- Reliability: Direct database persistence ensures metrics are never lost
- Query Performance: Fast queries and aggregations using SQL
- Data Integrity: ACID guarantees for metric data consistency
- Cost Attribution: Direct correlation with billing and usage data
- Historical Analysis: Long-term storage and analysis of historical metrics
Database Schema:
- Metrics are stored in dedicated tables with proper indexing
- Time-series data optimized for range queries
- Aggregated metrics for fast dashboard queries
- Raw metrics for detailed analysis
- Structured Logging: JSON-formatted logs for easy parsing
- Log Levels: DEBUG, INFO, WARN, ERROR with appropriate filtering
- Log Correlation: Trace IDs for correlating logs across services
- Database Persistence: Critical logs persisted to database for audit and analysis
- Retention Policies: Configurable log retention based on organization tier
- Distributed Tracing: OpenTelemetry-compatible tracing
- Span Correlation: Correlate traces with metrics and logs
- Performance Analysis: Identify bottlenecks and optimization opportunities
- Database Storage: Trace data persisted for historical analysis
WebRobot uses Trino (formerly PrestoSQL) for post-processing queries on datasets produced by Spark pipelines.
All datasets produced by Spark pipelines are automatically indexed in Trino:
- Automatic Registration: When a pipeline execution completes and writes data to S3/MinIO, the dataset is automatically registered in Trino
- Schema Discovery: Trino automatically discovers the schema from the stored data (Parquet, Delta, Iceberg, etc.)
- Catalog Integration: Datasets are registered in Trino catalogs (e.g.,
s3,minio,hive) based on storage location - Metadata Management: Table metadata is maintained in Trino's metastore for fast query planning
The WebRobot API uses Trino for all post-processing queries:
- SQL Interface: Query datasets using standard SQL through Trino
- Performance: Trino's distributed query engine provides fast analytical queries
- Federation: Query across multiple data sources (S3, databases, etc.) in a single query
- API Integration: All dataset query endpoints use Trino under the hood
- Fast Queries: Trino's columnar processing and distributed execution for fast analytical queries
- Standard SQL: Use familiar SQL syntax for data exploration and analysis
- Unified Access: Single query interface for all datasets regardless of storage format
- Real-time Access: Query data immediately after pipeline execution completes
POST /api/webrobot/api/projects/{projectId}/jobs/{jobId}/datasets/{datasetId}/queryRequest Body:
{
"sql": "SELECT * FROM dataset_table WHERE column = 'value' LIMIT 100",
"format": "json"
}Response:
{
"queryId": "query-123",
"status": "completed",
"rows": [
{ "column1": "value1", "column2": "value2" },
{ "column1": "value3", "column2": "value4" }
],
"rowCount": 2,
"executionTimeMs": 150
}GET /api/webrobot/api/projects/{projectId}/jobs/{jobId}/datasets/{datasetId}/schemaResponse:
{
"datasetId": "dataset-123",
"tableName": "s3.default.pipeline_output",
"schema": {
"columns": [
{ "name": "id", "type": "bigint" },
{ "name": "title", "type": "varchar" },
{ "name": "price", "type": "double" },
{ "name": "created_at", "type": "timestamp" }
]
},
"rowCount": 1000000,
"sizeBytes": 52428800
}GET /api/webrobot/api/projects/{projectId}/jobs/{jobId}/executions/{executionId}/metricsResponse:
{
"executionId": "exec-123",
"jobId": "job-456",
"metrics": {
"infrastructure": {
"cpuSeconds": 3600,
"memoryGbHours": 8.5,
"storageReadGb": 100,
"storageWriteGb": 50,
"networkTransferGb": 5
},
"execution": {
"durationSeconds": 1800,
"tasksCompleted": 1000,
"recordsProcessed": 1000000,
"dataVolumeGb": 150
},
"costs": {
"computeCost": 12.50,
"storageCost": 2.30,
"networkCost": 0.50,
"totalCost": 15.30,
"currency": "USD"
}
},
"cloudProvider": "aws",
"region": "us-east-1"
}GET /api/webrobot/api/projects/{projectId}/jobs/{jobId}/executions/{executionId}/costsResponse:
{
"executionId": "exec-123",
"costBreakdown": {
"compute": {
"driver": {
"instanceType": "m5.xlarge",
"hours": 0.5,
"cost": 5.00
},
"executors": {
"instanceType": "m5.2xlarge",
"count": 3,
"hours": 0.5,
"cost": 7.50
}
},
"storage": {
"s3Read": {
"gb": 100,
"cost": 2.00
},
"s3Write": {
"gb": 50,
"cost": 0.30
}
},
"network": {
"dataTransfer": {
"gb": 5,
"cost": 0.50
}
},
"total": 15.30,
"currency": "USD"
}
}GET /api/webrobot/api/organizations/{orgId}/metrics?startDate=2025-01-01&endDate=2025-01-31Query Parameters:
startDate: Start date for metrics aggregation (ISO 8601)endDate: End date for metrics aggregation (ISO 8601)groupBy: Group byproject,job,pipeline(default:project)
Response:
{
"organizationId": "org-123",
"period": {
"start": "2025-01-01T00:00:00Z",
"end": "2025-01-31T23:59:59Z"
},
"summary": {
"totalJobs": 150,
"totalExecutions": 500,
"totalCost": 1250.75,
"totalCpuHours": 720,
"totalMemoryGbHours": 1200,
"totalDataProcessedGb": 5000
},
"breakdown": [
{
"projectId": "proj-1",
"projectName": "E-commerce Scraping",
"jobs": 50,
"executions": 200,
"cost": 500.25,
"cpuHours": 300,
"memoryGbHours": 500
}
]
}GET /api/webrobot/api/projects/{projectId}/jobs/{jobId}/executions/{executionId}/logs?podType=driver&tail=100Query Parameters:
podType:driver(default) orexecutortail: Number of lines from the end (default: 100)level: Filter by log level (DEBUG,INFO,WARN,ERROR)
WebRobot provides API-based dashboards that query metrics directly from the database:
- Organization Dashboard: Overview of all projects and costs (via API endpoints)
- Project Dashboard: Detailed project metrics and costs
- Job Dashboard: Individual job execution metrics
- Cost Dashboard: Cost analysis and trends
- Real-time Metrics: Live metrics via WebSocket or polling APIs
Dashboard API Endpoints:
- All dashboard data is served via REST API endpoints
- Clients can build custom visualizations using the metrics API
- Real-time updates via WebSocket connections for live dashboards
WebRobot integrates with cloud provider partners to enable transparent pay-to-use billing:
- Real-time Cost Sync: Automatically sync costs from cloud provider billing APIs
- Cost Reconciliation: Reconcile WebRobot metrics with provider billing data
- Invoice Generation: Generate detailed invoices based on actual usage
- Multi-provider Support: Support for multiple cloud providers in a single organization
WebRobot enables usage-based pricing models for partners:
- Per-execution pricing: Charge based on number of job executions
- Resource-based pricing: Charge based on CPU-hours, memory-hours, data processed
- Tiered pricing: Different rates based on usage tiers
- Custom pricing models: Flexible pricing models based on partner agreements
Cloud provider partners can access:
- Usage Analytics: Detailed usage analytics per customer
- Revenue Tracking: Track revenue from WebRobot usage
- Cost Analysis: Analyze costs and margins
- Customer Insights: Understand customer usage patterns
- Right-sizing: Choose appropriate instance types based on workload
- Spot Instances: Use spot instances for non-critical workloads
- Data Locality: Minimize cross-region data transfer
- Caching: Cache frequently accessed data to reduce I/O costs
- Batch Processing: Batch multiple operations to reduce API call costs
- Set Alerts: Configure alerts for cost thresholds and anomalies
- Regular Reviews: Review costs regularly to identify optimization opportunities
- Cost Attribution: Use cost attribution to understand spending patterns
- Forecasting: Use cost forecasting to plan budgets
- Data Isolation: Metrics and logs are isolated per organization
- Access Control: Role-based access control for metrics and cost data
- Encryption: All metrics and logs are encrypted at rest and in transit
- Compliance: Support for GDPR, SOC 2, and other compliance requirements
Related Guides: