ADR: Why Our Agent Registry Uses Static Manifests Instead of Dynamic Discovery
At Ledd Consulting, we run 25 services coordinated by 7 specialized agents on a single VPS. When we needed to build an agent registry — the system that knows which agents exist, what they can do, and how to reach them — we had a real architectural choice to make. This is the record of that decision, why we made it, and what happened after.
Context — What Decision Needed to Be Made and Why
Our platform routes queries to specialized agents based on capability matching. An inbound request might need a research agent, a code analysis agent, or a lead qualification agent. Something has to know which agents are available, what each one is good at, and whether it's healthy enough to receive traffic.
This is a service discovery problem. And the industry has strong opinions about how to solve it.
We had 7 agents at the time of the decision, with plans for maybe 12. New agents onboard through a registration flow, get validated, receive credentials, and then start receiving routed queries. The topology changes when we onboard a new agent — which happens a few times per month, not a few times per minute.
The question was: should agent capabilities and routing information be discovered dynamically at runtime, or declared statically in a manifest that gets validated at onboarding time?
Options Considered
Option 1: Dynamic Discovery with Consul or etcd
The "proper" microservices answer. Each agent registers itself on startup, publishes a health check endpoint, and the registry queries the discovery service in real-time to build its routing table.
Pros:
- Agents can appear and disappear without any central coordination
- Health checks are automatic and continuous
- Well-documented pattern with mature tooling
Cons:
- Another distributed system to operate (Consul cluster, etcd quorum)
- Agents need a sidecar or SDK to participate in discovery
- Split-brain scenarios when the discovery service itself has issues
- Massive operational overhead for a team of one running 7 agents
Option 2: DNS-Based Discovery with SRV Records
Use DNS SRV records to map agent capabilities to endpoints. The registry resolves capabilities to agent addresses through DNS lookups.
Pros:
- No additional infrastructure beyond what we already run
- DNS is well-understood and battle-tested
- TTL-based caching handles most staleness concerns
Cons:
- DNS doesn't carry capability metadata — only addresses
- No way to encode specialties, pricing tiers, or performance history in a DNS record
- Updating records requires DNS propagation delays
- Doesn't solve the "what can this agent do" question, only "where is it"
Option 3: Static JSON Manifests with Onboarding-Time Validation
Each agent declares its capabilities, pricing, and performance expectations in a structured manifest at registration time. The manifest is validated, stored in a JSON file, and becomes the single source of truth for routing.
Pros:
- Zero additional infrastructure
- The registry file is human-readable documentation
- Validation happens once, at onboarding — not continuously at runtime
- Easy to audit, diff, and version control
- No network dependency for capability lookups
Cons:
- Stale data if an agent's capabilities change without re-registration
- No automatic health-based deregistration
- Manual process to update agent metadata
Decision Criteria — What Mattered Most and Why
We ranked our priorities explicitly before choosing:
- Operational simplicity — We are a small team. Every piece of infrastructure we add is infrastructure we maintain at 2 AM when it breaks. The registry cannot be the thing that takes down the whole system.
- Debuggability — When a query gets misrouted, we need to understand why in under 60 seconds. "Read the JSON file" beats "query the Consul API, check the service mesh, inspect the health check history."
- Registry-as-documentation — New team members (and our own future selves) should be able to open one file and understand every agent in the system, what it does, and how it's configured.
- Correctness at onboarding time — We would rather reject a bad agent manifest during registration than discover at runtime that an agent's self-reported capabilities don't match reality.
- Change frequency — Our agent topology changes a few times per month. Dynamic discovery solves a problem we don't have.
Our Decision — What We Chose and How We Implemented It
We chose static JSON manifests with onboarding-time validation. Here's how it works in production.
The Registry: A JSON File That Is Also the Documentation
The core of the system is a registry class that loads agent manifests from a single JSON file at startup:
class AgentRegistry {
constructor(registryPath = './agent-registry.json') {
this.registryPath = registryPath;
this.agents = [];
this.loadRegistry();
}
loadRegistry() {
try {
if (fs.existsSync(this.registryPath)) {
const data = fs.readFileSync(this.registryPath, 'utf8');
this.agents = JSON.parse(data);
console.log(`✓ Loaded ${this.agents.length} agents from registry`);
} else {
this.agents = [];
console.log('ℹ No existing registry found, starting fresh');
}
} catch (error) {
console.error('Failed to load registry:', error.message);
this.agents = [];
}
}
}
No network calls. No quorum. No consensus protocol. fs.readFileSync and JSON.parse. The registry loads in under 5ms with 7 agents. It would load in under 5ms with 70.
Onboarding-Time Validation: Reject Bad Manifests Early
This is where the real value of the static approach shows up. When a new agent registers, we validate everything upfront — required fields, duplicate detection, credential generation — before the agent ever enters the registry:
registerAgent(agentData) {
// Validate required fields
const required = ['agent_id', 'name', 'specialties', 'pricing', 'performance'];
for (const field of required) {
if (!agentData[field]) {
throw new Error(`Missing required field: ${field}`);
}
}
// Check for duplicate agent_id
if (this.agents.find(a => a.agent_id === agentData.agent_id)) {
throw new Error(`Agent ID already registered: ${agentData.agent_id}`);
}
const agent = {
...agentData,
agent_uuid: this.generateAgentUUID(),
registered_at: new Date().toISOString(),
status: 'active',
total_queries: 0,
total_revenue: 0,
agent_earned: 0,
last_query: null,
};
this.agents.push(agent);
this.saveRegistry();
return {
agent_uuid: agent.agent_uuid,
registered_at: agent.registered_at,
status: agent.status,
marketplace_url: `https://api.example.com/agent/${agent.agent_uuid}`,
};
}
The key insight: every field that matters for routing is validated and persisted at onboarding time, not discovered at query time. Specialties, pricing tiers, performance baselines — all declared upfront.
The Full Onboarding Flow: Registration as a Gate, Not a Formality
Our onboarding handler enforces a multi-step flow where agents must pass validation before they can receive traffic:
async registerAgent(data) {
const startTime = Date.now();
try {
this.validateRegistrationData(data);
const credentials = {
api_key: this.generateApiKey(),
webhook_secret: this.generateWebhookSecret(),
};
const agentProfile = {
agent_id: this.generateAgentId(data.name),
agent_uuid: this.generateUUID(),
name: data.name,
description: data.description,
specialties: data.specialties || [],
webhook_url: data.webhook_url,
webhook_secret: credentials.webhook_secret,
pricing: data.pricing || {
base_cost: 0.05,
per_token: 0.0001,
min_query: 0.01,
},
status: 'pending_confirmation',
tier: 'default',
performance: {
success_rate: 1.0,
avg_response_time_ms: 0,
uptime_percent: 100.0,
total_queries: 0,
total_revenue: 0,
},
onboarding_step: 'registration_complete',
};
this.onboardingData.agents[agentProfile.agent_uuid] = agentProfile;
this.onboardingData.stats.total_registrations++;
this.onboardingData.stats.pending_confirmations++;
this.saveOnboardingData();
console.log(`✅ Agent registered: ${agentProfile.agent_uuid}`);
console.log(` Registration time: ${Date.now() - startTime}ms`);
return {
success: true,
agent_uuid: agentProfile.agent_uuid,
status: agentProfile.status,
credentials: {
api_key: credentials.api_key,
webhook_secret: credentials.webhook_secret,
},
};
} catch (error) {
console.error(`❌ Registration failed: ${error.message}`);
return {
success: false,
error: error.message,
code: error.code || 'REGISTRATION_FAILED',
};
}
}
Notice the status: 'pending_confirmation' default. An agent doesn't enter the live routing table until it completes email confirmation and gets promoted to active. The confirmation step flips the status and adds the agent to the main registry — only then can it receive traffic.
This is onboarding as a gate. The registry never contains an agent that hasn't been fully validated.
Discovery Without a Discovery Service
Query routing uses the same static data with simple filtering:
getDiscoveryList(options = {}) {
let filtered = this.getActiveAgents();
if (options.specialty) {
filtered = filtered.filter(a =>
a.specialties && a.specialties.includes(options.specialty)
);
}
if (options.min_uptime) {
filtered = filtered.filter(a =>
(a.performance?.uptime_percent || 100) >= options.min_uptime
);
}
if (options.max_cost) {
filtered = filtered.filter(a =>
(a.pricing?.base_cost || 0.05) <= options.max_cost
);
}
if (options.sort_by === 'reputation') {
filtered.sort((a, b) => (b.rating || 5) - (a.rating || 5));
} else if (options.sort_by === 'cost') {
filtered.sort((a, b) =>
(a.pricing?.base_cost || 0) - (b.pricing?.base_cost || 0)
);
} else if (options.sort_by === 'speed') {
filtered.sort((a, b) =>
(a.performance?.avg_response_time_ms || 5000) -
(b.performance?.avg_response_time_ms || 5000)
);
}
return filtered;
}
This is just array filtering. It runs in microseconds against an in-memory array. No network hop, no cache invalidation, no eventual consistency. The data is always exactly what was validated at onboarding time, plus any stats accumulated from actual query processing.
Consequences — What Worked, What We'd Do Differently
What Worked
Debuggability is exceptional. When a query gets misrouted, we open agent-registry.json, find the agent, and read its specialties array. Total debug time: 15 seconds. We've never had a routing mystery that took more than a minute to diagnose.
Onboarding catches real errors. In the first month, we caught 3 registration attempts with missing specialty declarations and 1 with a malformed pricing block. All rejected cleanly at registration time. With dynamic discovery, these agents would have registered themselves and then silently failed when queries arrived.
The registry file is genuinely useful documentation. New sessions, automated processes, and monitoring tools all read the same JSON file to understand the current agent topology. We reference it in our morning briefing synthesis across all 25 services. One file, one truth.
Zero downtime from the registry itself. Across 60+ scheduled timers and continuous agent operations, the registry has never been a point of failure. It's a file. Files don't have quorum elections.
What We'd Do Differently
Stats updates create write contention. The updateAgentStats method writes back to the same JSON file after every query. With 7 agents handling moderate query volume, this hasn't been a problem. But we can see the ceiling. If we hit sustained high throughput, we'd separate the mutable stats from the immutable manifest — keeping the manifest static and moving counters to an in-memory store that flushes periodically.
No automatic staleness detection. If an agent's webhook endpoint goes down, the registry doesn't know until a query fails. We compensate with our separate uptime monitor (which covers all 26 services), but the registry itself is blissfully unaware. An active health check — even a simple periodic ping — would be a reasonable addition.
When to Reconsider
We would revisit this decision if any of these conditions become true:
- Agent count exceeds 30-40. At that point, the human-readable-JSON advantage starts to erode, and the filtering logic needs indexing rather than linear scans.
- Agents start self-deploying. If agents can spin up and register without human involvement, we need automatic discovery because there's no human to validate the manifest.
- Multi-host deployment. Our current single-VPS architecture means the JSON file is always local. The moment the registry needs to be consistent across multiple hosts, we need either a distributed store or an API-based registry service.
- Sub-second topology changes. If agents need to scale horizontally based on load — spinning up 5 instances of the research agent during peak hours — static manifests can't keep up.
None of these conditions are true today. We run 7 agents on one host, topology changes monthly, and a human reviews every new agent before it enters the system.
Conclusion
Dynamic service discovery is a solution to a real problem — but it's a problem that emerges at a specific scale and rate of change. For a system with fewer than a dozen agents, monthly topology changes, and a small operations team, a validated JSON manifest gives you everything dynamic discovery provides (capability routing, filtered lookups, metadata-rich agent profiles) without the operational tax of another distributed system to babysit.
The boring choice was the right choice. Our registry is a file. It loads in milliseconds, it's debuggable by anyone with a text editor, and it has never woken us up at 2 AM. We'll add complexity when our scale demands it — not before.
Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.