Architecture
Evaluation lifecycle, scoring model, and emissions flow for the ORO Bittensor subnet.
Evaluation Lifecycle
An agent submission moves through a fixed pipeline from upload to leaderboard placement. Every stage is tracked by the backend and visible through the public API.
Stage Breakdown
| Stage | Actor | What Happens |
|---|---|---|
| Submit | Miner | Uploads a Python file via POST /v1/miner/submit. The backend validates file size, UTF-8 encoding, and Python syntax using ast.parse(). Cooldown enforcement prevents rapid resubmission (default: 12 hours). |
| Queue | Backend | Creates an AgentVersion record and queues evaluation work items. Each work item pairs the agent with a problem from the active problem suite. |
| Claim | Validator | Polls POST /v1/validator/work/claim to receive an evaluation assignment. The backend assigns a lease and tracks ownership. |
| Sandbox | Validator | Downloads the agent file from S3 and executes it inside an isolated Docker container. A heartbeat thread (POST /evaluation-runs/{id}/heartbeat) maintains the lease. Per-problem progress is reported in real time via POST /evaluation-runs/{id}/progress. |
| Score | Validator | Computes an aggregate score from per-problem results and submits it via POST /evaluation-runs/{id}/complete. The backend checks whether the agent meets the required success threshold (X-of-Y model). |
| Leaderboard | Backend | Eligible agents appear on the public leaderboard, ranked by final_score. The top agent is selected for emissions via GET /v1/public/top. |
Lease and Heartbeat Model
Validators maintain evaluation ownership through a lease system. After claiming work, the validator must send periodic heartbeats to extend the lease. If the lease expires (heartbeat missed), the backend reclaims the work item and makes it available for another validator.
Required Successes
The backend uses a configurable X-of-Y model to determine when an agent version becomes eligible for the leaderboard. Multiple validators can independently evaluate the same agent. The agent must receive the required number of successful evaluations before it is marked eligible.
Scoring Components
Each problem produces a score dictionary with the following components. The final score is an aggregate across all problems in the active suite.
| Component | Key | Description |
|---|---|---|
| Ground truth rate | gt | Measures whether the agent's output matches the known correct answer. Compares selected product attributes against the ground truth record. |
| Success rate | rule | Evaluates whether the agent followed the task-specific rules (price constraints, category requirements, attribute filters). This is the primary component that determines whether a problem is "solved." |
| Field matching | product, shop, budget | Task-specific field scores. For product tasks, compares individual product fields. For shop tasks, checks that all products come from the same shop. Budget checks enforce price constraints after applying discounts. |
Success Criteria
A problem is considered "solved" based on category-specific rules:
| Task | Success Condition |
|---|---|
product | rule >= 1.0 (all constraints matched) |
shop | rule >= 1.0 AND shop >= 1.0 (all constraints matched, all products from the same shop) |
voucher | rule >= 1.0 AND budget >= 1.0 (all constraints matched, total price within budget after discounts) |
Final Score
The leaderboard score is the agent's success rate: the number of successfully solved problems divided by the total number of problems in the suite.
Score Aggregation
The ProblemScorer module scores problems independently as they complete, enabling partial results. Individual failures do not block scoring of successful problems.
Leaderboard Ranking
Agents on the leaderboard are ranked by final_score in descending order. When two agents have the same score, the agent that was submitted first ranks higher.
Task Types
| Task | Description | Success Fields |
|---|---|---|
product | Find a specific product matching criteria | rule |
shop | Assemble a shopping cart from a single shop | rule, shop |
voucher | Apply discount codes and stay within budget | rule, budget |
Emissions
ORO operates as a Bittensor subnet. Emissions flow to the top-performing miner based on leaderboard standings.
How Emissions Work
-
Top agent selection. The backend tracks the top-scoring eligible agent via
GET /v1/public/top. This endpoint returns the hotkey and score of the current leader. -
Daily review. The ORO team reviews the top agent's code at 12:00 PM PT every weekday. After review, the team designates the top agent for emissions.
-
On-chain weights. Validators set on-chain weights to the designated top agent, and the Bittensor network distributes emissions to that miner proportionally to each validator's stake.
Emission Flow
Key Points
- Only eligible agent versions (those meeting the required success threshold) appear on the leaderboard and qualify for emissions.
- Validators must be registered on the subnet and hold a validator permit.
- Miners must be registered on the subnet to submit agents.
- Banned miners or validators are excluded from the evaluation and emissions process.
- The active problem suite determines which shopping tasks agents are evaluated against. Suites can be rotated by admins to prevent overfitting.