The Review Layer Your Autonomous Agent Needs Before It Takes Real-World Actions
Your agent drafts a perfect proposal. It scores an 8 out of 10. The LLM greenlights it. Then it submits the bid to a public marketplace with your company name on it — and it included an email address in the body text, which gets the entire bid rejected by the platform's content filter.
This is the kind of failure that doesn't show up in demos. It shows up at 2 AM when your autonomous system is running unsupervised and making decisions that hit real APIs with real consequences.
At Ledd Consulting, we run autonomous agents that submit proposals, draft outreach emails, and take actions against external APIs — all without a human in the loop. We learned early that the hard part isn't getting the agent to act. It's getting it to act correctly across every edge case, every time, at 2 AM when nobody's watching.
This post walks through the review architecture we built to solve that problem.
The Pain Point
If you're building agents that call external APIs — placing orders, submitting bids, sending emails, creating tickets — you've already hit the uncomfortable realization: LLMs are confident, fluent, and occasionally wrong in ways that are expensive to undo.
The failure modes are specific and predictable:
- The model includes contact information the platform explicitly blocks
- It bids $5,000 on a project when your account verification caps you at $2,400
- It submits a proposal to a project in a market you don't serve
- It drafts an outreach email and sends it automatically instead of saving a draft for review
- It takes action on a closed or expired listing
Each of these is a real bug we caught. Not in testing — in production.
The core tension: you want the agent to act autonomously (that's the whole point), but every action that touches an external system is potentially irreversible. You can't undo a submitted bid. You can't unsend an email. You can't un-notify a prospect with a malformed message.
Why Common Solutions Fall Short
Most teams reach for one of three patterns, and all of them break down at scale.
Human-in-the-loop for everything. This is the safe default, but it defeats the purpose of automation. If a human has to approve every action, you've built a suggestion engine, not an autonomous agent. We tried this first. At 15-20 actions per day across multiple pipelines, the approval queue became a bottleneck that negated all the efficiency gains.
Confidence thresholds on the LLM output. "Only act if the model is more than 90% confident." The problem is that LLM confidence doesn't correlate with correctness in the way you'd hope. A model will confidently submit a bid with a phone number embedded in the text because it doesn't know the platform blocks contact information. Confidence measures reasoning quality, not domain constraint awareness.
Post-hoc monitoring and rollback. Log everything, alert on anomalies, fix it after the fact. This works for internal systems where you control both sides. It doesn't work for external APIs. There is no "rollback" on a Freelancer bid or a sent email.
What we needed was a system that applies domain-specific quality gates before the irreversible action, uses an LLM reviewer when judgment is required, and falls back to hard programmatic constraints for things the LLM should never decide.
Our Approach
We built a three-stage review pipeline that separates concerns cleanly:
- Programmatic gates — hard filters that run before the LLM even sees the proposal
- LLM-as-judge — a dedicated model call that reviews and rewrites before submission
- Output sanitization — post-LLM cleaning that catches what the reviewer misses
The key insight: the LLM reviewer is one layer, not the only layer. We don't trust it to enforce account limits or content policies. We use code for constraints and the LLM for judgment.
Model Routing
Every task in our system routes through a central model selector that matches task complexity to model capability:
const ROUTES = {
// Tier 1: Haiku — fast, structured tasks
'exploration': 'haiku',
'classification': 'haiku',
'daily-plan': 'haiku',
'social-post': 'haiku',
// Tier 2: Sonnet — reasoning-heavy decisions
'bid-review': 'sonnet',
'proposal-draft': 'sonnet',
'cold-outreach': 'sonnet',
'action-extract': 'sonnet',
// Tier 3: Opus — high-stakes, long-form
'blog-post': 'opus',
'strategy': 'opus',
};
const DEFAULT_MODEL = 'sonnet';
function routeModel(taskType) {
return ROUTES[taskType] || DEFAULT_MODEL;
}
Bid review routes to Sonnet (Tier 2) because it requires genuine reasoning — evaluating project fit, rewriting proposal text, deciding on pricing. But we don't use Opus for it, because the structured output format keeps the task bounded. Model routing isn't about throwing your best model at everything. It's about matching capability to decision complexity.
Implementation
Stage 1: Programmatic Gates
Before any LLM call, we run five hard filters. If any of them fail, the proposal is rejected immediately with a logged reason. No tokens spent.
Score threshold:
var proposalScore = proposal.score || 0;
if (typeof proposalScore === "number" && proposalScore > 0 && proposalScore < 7) {
console.log(" SKIP: Score " + proposalScore + "/10 below minimum threshold of 7");
appendReviewLog({
title: proposal.title, projectId: projectId,
action: "skipped", reason: "score " + proposalScore + " below threshold 7",
score: proposalScore
});
return;
}
The score itself comes from an upstream drafter that runs a weighted skill-matching algorithm — primary skill hits score +2 (max 6 points), secondary skills +1 (max 3), keyword bonus +0.5, budget alignment +1. This isn't LLM-generated confidence; it's deterministic scoring against a defined skill profile.
Budget floor enforcement:
var projBudgetMin = project.budget && project.budget.minimum
? project.budget.minimum : 0;
var projType = project.type || "fixed";
if (projType === "hourly" && projBudgetMin < 15) {
appendReviewLog({ title: proposal.title, projectId: projectId,
action: "skipped", reason: "hourly budget $" + projBudgetMin + " below $15/hr" });
return;
}
if (projType !== "hourly" && projBudgetMin > 0 && projBudgetMin < 100) {
appendReviewLog({ title: proposal.title, projectId: projectId,
action: "skipped", reason: "fixed budget $" + projBudgetMin + " below $100" });
return;
}
Geographic filter — a regex-based location classifier that checks project location, client country, title, and description against a known non-US location list:
var locationStr = [
projectLocation, clientCountryName,
project.title || "",
(project.description || "").substring(0, 500)
].join(" ");
if (!isUSOrRemote(locationStr)) {
appendReviewLog({ title: proposal.title, projectId: projectId,
action: "skipped", reason: "non-US location: " + (clientCountryName || "unknown") });
return;
}
We also verify the project is still active (project.status !== "active" → skip). These five gates filter roughly 60-70% of candidates before the LLM reviewer runs.
Stage 2: LLM-as-Judge
For proposals that pass all programmatic gates, we invoke a dedicated review model. The critical design decision: the reviewer gets explicit decision authority and a structured output format.
var prompt = [
"You are reviewing a proposal for Ledd Consulting.",
"",
"DECISION AUTHORITY: You have FULL authority to submit or reject this bid.",
"No human approval needed.",
"",
"PROJECT DETAILS:",
"- Title: " + project.title,
"- Type: " + projectType,
"- Budget: $" + budgetMin + " - $" + budgetMax,
"- Bid count: " + bidCount,
"- Description: " + projectDesc,
"",
"ACCOUNT CONSTRAINTS:",
"- Max hourly rate: $45/hr",
"- Max fixed bid: $2400",
"",
"RESPOND WITH EXACTLY ONE LINE:",
"SUBMIT|<amount>|<period_days>|<proposal_text>",
"or",
"REJECT|<reason>",
].join("\n");
Three things matter here:
Explicit authority statement. Without "You have FULL authority to submit or reject", the model hedges. It returns preamble like "I would recommend..." instead of a clean decision. Telling the model it has authority produces more decisive, parseable output.
Account constraints in the prompt. Even though we validate these programmatically after the LLM responds, including them in the prompt reduces the rate of invalid responses from ~15% to ~3%. The LLM respects constraints it knows about.
Strict output format. SUBMIT|amount|period|text or REJECT|reason. One line. This makes parsing reliable — but we still handle the case where the model wraps its response in explanation:
if (result.startsWith("SUBMIT|")) {
// Clean parse path
var parts = result.split("|");
var amount = parseFloat(parts[1]);
var period = parseInt(parts[2]) || 14;
var bidText = parts.slice(3).join("|");
} else {
// Fallback: extract from buried response
var submitMatch = result.match(/^(SUBMIT\|.+)$/m);
var rejectMatch = result.match(/^(REJECT\|.+)$/m);
if (submitMatch) {
// Re-parse the extracted line
var extractedParts = submitMatch[1].split("|");
// ... same validation logic
}
}
This fallback extraction catches roughly 5-8% of responses where the model adds preamble despite being told not to. Without it, those would all become errors.
Stage 3: Output Sanitization
Even after the LLM reviewer rewrites the proposal text, we run programmatic sanitization before the API call. This is the layer that catches what the model doesn't know to avoid:
function sanitizeBidText(text) {
return text
.replace(/[\w.-]+@[\w.-]+\.\w{2,}/g, '') // emails
.replace(/https?:\/\/\S+/g, '') // URLs
.replace(/www\.[\w.-]+\.\w{2,}/g, '') // URLs without protocol
.replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '') // phone numbers
.replace(/email\s+(?:me|us)\s+at[^\n]*/gi, '') // "email me at..."
.replace(/visit\s+(?:our|my)\s+(?:website|site)[^\n]*/gi, '')
.replace(/contact\s+(?:us|me)\s+at[^\n]*/gi, '')
.replace(/calendly[^\n]*/gi, '') // scheduling links
.replace(/\n{3,}/g, '\n\n') // collapse whitespace
.trim();
}
We also enforce account limits after the LLM response, as a hard stop:
if (projectType === "hourly" && amount > 45) {
appendReviewLog({ title: project.title, projectId: projectId,
action: "rejected", reason: "rate exceeds limit" });
return;
}
if (projectType !== "hourly" && amount > 2400) {
appendReviewLog({ title: project.title, projectId: projectId,
action: "rejected", reason: "bid exceeds limit" });
return;
}
The model is told the limits. It usually respects them. But "usually" isn't good enough for an autonomous system making financial commitments.
The Action Category Gate
Beyond individual bid review, we apply the same pattern to our broader action pipeline. When our research system extracts actionable tasks, we filter by category and urgency before any task gets submitted for autonomous execution:
const submittable = actions
.filter(a => ['APPLY', 'CODE', 'OUTREACH', 'CONTENT'].includes(a.category)
&& ['high', 'medium'].includes(a.urgency))
.slice(0, 5);
And for categories with higher risk — outreach and content — we explicitly route to draft mode instead of auto-execution:
if (task.category === 'OUTREACH') {
taskDescription = `OUTREACH TASK — Draft an email or message, do NOT send automatically.
Save the draft for human review.`;
}
if (task.category === 'CONTENT') {
taskDescription = `CONTENT TASK — Draft content for review.
Do NOT publish. Save for human review.`;
}
This is a graduated authority model: some categories (APPLY, CODE) get full autonomous execution. Others (OUTREACH, CONTENT) get autonomous drafting but require human sign-off before external action. The pipeline doesn't treat all actions equally because the consequences aren't equal.
Results
Across our production pipeline with 25 services and 60+ automated timers running daily:
- Programmatic gates reject 60-70% of candidates before spending any LLM tokens
- LLM reviewer processes the remaining 30-40% with a structured SUBMIT/REJECT decision
- Output sanitization catches contact information in ~12% of LLM-generated bid text
- Account limit validation catches over-budget responses in ~3% of SUBMIT decisions
- Fallback parsing recovers valid decisions from ~5-8% of non-conforming LLM responses
- Zero platform rejections since implementing the sanitization layer — previously, contact info in bid text was our #1 API error
The three-stage approach means each layer only needs to catch what the previous layer missed. Programmatic gates handle the obvious. The LLM handles judgment. Sanitization handles platform-specific constraints the LLM wasn't trained on.
Adapting This for Your System
The pattern generalizes to any agent that takes external actions:
1. Identify your irreversible actions. What API calls can't be undone? Submitted orders, sent messages, created records, financial transactions. These are your review pipeline targets.
2. Separate constraints from judgment. Budget limits, rate caps, geographic restrictions, content policies — these are rules, not judgment calls. Enforce them in code, not prompts. Tell the LLM about them (it reduces invalid responses), but don't trust it to enforce them.
3. Design your output format for parsing. SUBMIT|amount|period|text is ugly but unambiguous. JSON works too. Whatever you choose, always implement fallback extraction for when the model wraps its answer in explanation text.
4. Graduate authority by consequence. Not every action category needs the same review depth. Auto-execute low-risk actions. Auto-draft high-risk actions. Require human approval for irreversible high-stakes actions. Match the review layer to the blast radius.
5. Log everything with structured entries. Every gate, every decision, every skip. Our review log captures action, reason, score, amount, and timestamp for every proposal that enters the pipeline. This data feeds back into a win/loss analyzer that continuously improves the system's scoring and review criteria.
6. Build the feedback loop. We track proposal outcomes — wins and losses — and feed them back into an AI-powered analyzer that updates our proposal playbook. The review pipeline doesn't just filter; it learns which patterns lead to successful outcomes and adjusts scoring calibration over time.
Conclusion
The uncomfortable truth about autonomous agents is that the hard engineering isn't the agent itself — it's the review layer around it. The agent is the easy part. Making it trustworthy at 2 AM with nobody watching is the real work.
The pattern is straightforward: programmatic gates for constraints, LLM-as-judge for decisions, output sanitization for platform-specific rules, and graduated authority based on consequence severity. Each layer catches what the previous one missed. None of them alone is sufficient.
We've been running this architecture in production across 25 services with full autonomous authority. The review pipeline is what makes that possible — not by preventing the agents from acting, but by ensuring that when they act, they act correctly.
Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.