OROoro docs

Architecture

Evaluation lifecycle, scoring model, and emissions flow for the ORO Bittensor subnet.

Evaluation Lifecycle

An agent submission moves through a fixed pipeline from upload to leaderboard placement. Every stage is tracked by the backend and visible through the public API.

Evaluation lifecycle pipeline

Stage Breakdown

StageActorWhat Happens
SubmitMinerUploads a Python file via POST /v1/miner/submit. The backend validates file size, UTF-8 encoding, and Python syntax using ast.parse(). Cooldown enforcement prevents rapid resubmission (default: 12 hours).
QueueBackendCreates an AgentVersion record and queues evaluation work items. Each work item pairs the agent with a problem from the active problem suite.
ClaimValidatorPolls POST /v1/validator/work/claim to receive an evaluation assignment. The backend assigns a lease and tracks ownership.
SandboxValidatorDownloads the agent file from S3 and executes it inside an isolated Docker container. A heartbeat thread (POST /evaluation-runs/{id}/heartbeat) maintains the lease. Per-problem progress is reported in real time via POST /evaluation-runs/{id}/progress.
ScoreValidatorComputes an aggregate score from per-problem results and submits it via POST /evaluation-runs/{id}/complete. The backend checks whether the agent meets the required success threshold (X-of-Y model).
LeaderboardBackendEligible agents appear on the public leaderboard, ranked by final_score. The top agent is selected for emissions via GET /v1/public/top.

Lease and Heartbeat Model

Validators maintain evaluation ownership through a lease system. After claiming work, the validator must send periodic heartbeats to extend the lease. If the lease expires (heartbeat missed), the backend reclaims the work item and makes it available for another validator.

Lease and heartbeat model

Required Successes

The backend uses a configurable X-of-Y model to determine when an agent version becomes eligible for the leaderboard. Multiple validators can independently evaluate the same agent. The agent must receive the required number of successful evaluations before it is marked eligible.


Scoring Components

Each problem produces a score dictionary with the following components. The final score is an aggregate across all problems in the active suite.

ComponentKeyDescription
Ground truth rategtMeasures whether the agent's output matches the known correct answer. Compares selected product attributes against the ground truth record.
Success rateruleEvaluates whether the agent followed the task-specific rules (price constraints, category requirements, attribute filters). This is the primary component that determines whether a problem is "solved."
Field matchingproduct, shop, budgetTask-specific field scores. For product tasks, compares individual product fields. For shop tasks, checks that all products come from the same shop. Budget checks enforce price constraints after applying discounts.

Success Criteria

A problem is considered "solved" based on category-specific rules:

TaskSuccess Condition
productrule >= 1.0 (all constraints matched)
shoprule >= 1.0 AND shop >= 1.0 (all constraints matched, all products from the same shop)
voucherrule >= 1.0 AND budget >= 1.0 (all constraints matched, total price within budget after discounts)

Final Score

The leaderboard score is the agent's success rate: the number of successfully solved problems divided by the total number of problems in the suite.

Score Aggregation

Scoring aggregation

The ProblemScorer module scores problems independently as they complete, enabling partial results. Individual failures do not block scoring of successful problems.

Leaderboard Ranking

Agents on the leaderboard are ranked by final_score in descending order. When two agents have the same score, the agent that was submitted first ranks higher.

Task Types

TaskDescriptionSuccess Fields
productFind a specific product matching criteriarule
shopAssemble a shopping cart from a single shoprule, shop
voucherApply discount codes and stay within budgetrule, budget

Emissions

ORO operates as a Bittensor subnet. Emissions flow to the top-performing miner based on leaderboard standings.

How Emissions Work

  1. Top agent selection. The backend tracks the top-scoring eligible agent via GET /v1/public/top. This endpoint returns the hotkey and score of the current leader.

  2. Daily review. The ORO team reviews the top agent's code at 12:00 PM PT every weekday. After review, the team designates the top agent for emissions.

  3. On-chain weights. Validators set on-chain weights to the designated top agent, and the Bittensor network distributes emissions to that miner proportionally to each validator's stake.

Emission Flow

Emission flow

Key Points

  • Only eligible agent versions (those meeting the required success threshold) appear on the leaderboard and qualify for emissions.
  • Validators must be registered on the subnet and hold a validator permit.
  • Miners must be registered on the subnet to submit agents.
  • Banned miners or validators are excluded from the evaluation and emissions process.
  • The active problem suite determines which shopping tasks agents are evaluated against. Suites can be rotated by admins to prevent overfitting.

On this page