This page documents the EAN Image Sourcing Jersey plugin and the pipeline stages it uses internally.
The plugin exposes simplified endpoints under:
POST /webrobot/api/ean-image-sourcing/{country}/upload(multipart CSV upload)POST /webrobot/api/ean-image-sourcing/{country}/executePOST /webrobot/api/ean-image-sourcing/{country}/schedulePOST /webrobot/api/ean-image-sourcing/{country}/query(query the latest dataset)POST /webrobot/api/ean-image-sourcing/{country}/images(retrieve images, optional base64)GET /webrobot/api/ean-image-sourcing/{country}/statusGET /webrobot/api/ean-image-sourcing/info
Behind the scenes it manages Projects/Agents/Jobs and stores the YAML pipeline on the Agent (pipelineYaml).
The plugin bootstraps Agents with a default pipeline similar to:
pipeline:
- stage: load_csv
args:
- path: "${INPUT_CSV_PATH}"
header: "true"
# Search by EAN (credentials can be provided via env vars injected from CloudCredentials)
- stage: searchEngine
args:
- provider: "google"
ean: "$EAN number"
num_results: 10
enrich: true
image_search: false
# Visit each result link with a browser
- stage: visit
args:
- "$result_link"
# Extract structured fields with LLM and cache selectors per template cluster
- stage: iextract
args:
- selector: "body"
method: "code"
- "Extract product information from this e-commerce page: product name as product_name, current price as price (include currency symbol), brand as brand, EAN/GTIN/SKU code as ean_code, full product description as description, and all product image URLs (main product images, not logos) as product_image_urls. Also preserve the original input data: EAN number from CSV, item description, brand, search result title, snippet, matching score, and images from Google search results."
- "prod_"
# Score and select best images (LLM + heuristics)
- stage: imageSimilarity
args: []
output:
path: "${OUTPUT_PARQUET_PATH}"
mode: "overwrite"
format: "parquet"Notes:
- The plugin replaces
${INPUT_CSV_PATH}/${OUTPUT_PARQUET_PATH}at runtime. - Stages used above are documented in the general stage reference:
load_csvsearchEnginevisitiextractimageSimilarity
load_csv: reads the uploaded CSV from MinIO/S3 and creates the initial dataset.searchEngine: searches by EAN and enriches results; expects search credentials via env vars or CloudCredentials injection.visit: browser-based fetch for each row (uses Steel Dev if configured in the cluster).iextract: LLM extraction; uses selector-cache + template clustering to reduce repeated inference.imageSimilarity: ranks candidate images using LLM + heuristics (EAN in URL, context match, etc.).
The EAN plugin auto-discovers (if not explicitly provided) these CloudCredential providers:
GOOGLE_SEARCH: providesGOOGLE_SEARCH_API_KEY+GOOGLE_SEARCH_ENGINE_IDTOGETHERAI: providesTOGETHERAI_API_KEY(uses the generic credentialapiKey)STEEL_DEV: providesSTEEL_DEV_API_KEY(uses the generic credentialapiKey)
When you call the plugin execute endpoint, credential IDs can be provided in the request body:
cloudCredentialIds: list of credential UUIDs (preferred)cloudCredentialId: single credential UUID (legacy)
If neither is provided, the plugin auto-discovers credentials by provider:
- First try an enabled credential for the request Organization (
organizationId), - then fall back to a global enabled credential (
organizationId = null), - otherwise use the first enabled credential for that provider.
During Spark job submission, the Kubernetes runner resolves CloudCredentials and injects them into the Spark driver/executor environment:
GOOGLE_SEARCH_API_KEYGOOGLE_SEARCH_ENGINE_IDBING_SEARCH_API_KEYSTEEL_DEV_API_KEYTOGETHERAI_API_KEY
If sensitive fields are encrypted, you can provide an encryption key:
- Header:
X-Encryption-Key(plugin endpoints) - Internally forwarded as
encryptionKeyfor credential decryption during job submission.
The EAN plugin is commonly used to build vision+text datasets (e.g., product catalog enrichment) that can be consumed for model training/fit.
There is no dedicated “download file” endpoint under the plugin; instead you can:
- Query the latest EAN dataset for a country via:
POST /webrobot/api/ean-image-sourcing/{country}/query
This is the recommended way to fetch filtered subsets (e.g., a list of EANs, enriched columns, top-N).
Use:
POST /webrobot/api/ean-image-sourcing/{country}/images
Key request fields:
eans: list of EAN codeslimit: max images per EANincludeBase64: settrueto embed base64 (useful for training pipelines where you want a single JSON payload)
Example (retrieve 1 best image with base64 per EAN):
curl -X POST "${WEBROBOT_BASE_URL}/webrobot/api/ean-image-sourcing/italy/images" \
-H "Content-Type: application/json" \
-d '{
"eans": ["5901234123457", "5901234123458"],
"includeBase64": true,
"limit": 1
}'If you need the full dataset as a file, use the generic dataset endpoints:
GET /webrobot/api/datasets(list datasets)GET /webrobot/api/datasets/{datasetId}(returnsstoragePath/filePath/format)
Then download from object storage (MinIO/S3) using your infrastructure credentials.
For fine-tuning/training you typically want a record like:
eanproduct_namebrandimage_url(orimage_base64)- provenance fields:
source,url,retrieved_at,license
This makes it straightforward to produce JSONL training records downstream (Spark/Trino + your training stack).