Enterprise AI Cost Router  ·  Self-Hosted  ·  Open Source
Now tracking 55 models across 13 vendors · Weekly price sync

Route every request
to its optimal model.

Free and open-source for every organisation — any size, any industry. Tidus selects the cheapest capable model for every AI request, automatically. 5-stage intelligence. 70% cost savings. Fully self-hosted. Every dollar saved stays with you.

53+
Models Tracked
12
AI Vendors
70%
Cost Weight in Routing
5
Cost-Control Pillars

Your AI requests, always on the optimal model

Tidus sits in front of your AI workloads as a self-hosted FastAPI service. You call one endpoint — Tidus picks the right model automatically.

Most teams waste money sending every request to the same premium model. A simple classification task costs $15/1M tokens on Claude Opus when Gemini 2.0 Flash at $0.10/1M returns an equivalent result.

Tidus analyses each request's complexity, privacy requirements, capability needs, and budget constraints — then scores every eligible model using a weighted algorithm (70% cost, 20% tier, 10% latency) to find the optimal route.

Because Tidus is fully self-hosted and open source, no request data ever leaves your environment. It works with any AI vendor and supports local Ollama models for confidential workloads where cloud APIs are prohibited.

The weekly pricing registry keeps costs accurate without manual intervention — syncing prices from multiple sources every Sunday, detecting outliers with MAD-based consensus, and creating versioned, auditable revisions.

tidus_client.py
# Before Tidus — every call goes to GPT-4o
response = openai.chat.completions.create(
  model="gpt-4o",  # $2.50/1M — always
  messages=messages,
)

# After Tidus — optimal model selected per request
response = await tidus.route(
  messages=messages,
  complexity="simple",
  team_id="analytics",
)
# → Gemini 2.0 Flash at $0.10/1M ✓ 96% cheaper

# Critical task? Tidus routes to Tier 1 automatically
response = await tidus.route(
  messages=messages,
  complexity="critical",
)
# → Claude Opus 4.6 ✓ best model for the job

How Tidus reduces AI spend by 60–80%

Five complementary controls — each tackling a distinct source of AI cost waste. Together they form a complete governance layer over your AI infrastructure.

Pillar 1
🏗️
Tiered Model Strategy
Route every task to the cheapest capable tier. Tidus enforces a 4-tier hierarchy: T1 Premium (Claude Opus, o3), T2 Advanced (Claude Sonnet, GPT-4o), T3 Economy (Haiku, GPT-4.1-mini), T4 Budget (local Ollama). Task complexity sets the ceiling — simple tasks can use any tier; critical tasks are restricted to T1 only. This single rule removes the most expensive models from 80% of typical workloads without any manual configuration.
4-tier hierarchy Complexity ceiling T4 local/free models ~99% savings on simple tasks
Pillar 2
🧠
Router Agent Intelligence
A 5-stage selector decides the optimal model before any compute runs. Stage 1 filters by capability and context fit. Stage 2 enforces operator guardrails. Stage 3 applies the tier ceiling. Stage 4 checks team budget. Stage 5 scores survivors: cost ×0.70 + tier ×0.20 + latency ×0.10. The best-value model wins. Decision overhead is sub-millisecond — the routing cost is negligible compared to the vendor API savings.
5-stage pipeline Pre-compute decision Deterministic Sub-ms overhead
Pillar 3
Cache Everything
Two caching layers prevent paying twice for identical work. Layer 1 — Exact cache: SHA-256 keyed by team + messages + model. Same team, same prompt, zero vendor cost on repeat. Layer 2 — Semantic cache: all-MiniLM-L6-v2 embeddings with 95% cosine similarity threshold catch "same question, different wording." Both layers are team-scoped — Team A's cache never leaks to Team B. Confidential-tagged requests bypass caching entirely.
SHA-256 exact match Semantic similarity Redis backend Zero vendor cost on hit
Pillar 4
🛡️
Agent Autonomy Limits
Agentic workflows compound costs exponentially if unchecked. Tidus enforces hard limits at every level — all configurable in policies.yaml and enforced before any API call is made: max_agent_depth (default 5) prevents infinite recursion loops; max_tokens_per_step (default 8,000) caps per-step cost uniformly across all models; max_retries_per_task (default 3) stops retry storms from multiplying costs; max_parallel_sessions_per_team (default 10) prevents concurrency explosions. Each violated guardrail produces a named rejection reason in the API response — no silent failures.
Depth limits Token caps Retry limits Concurrency limits Pre-execution enforcement
Pillar 5
🔌
Vendor-Agnostic Design
8 production adapters running today (Anthropic, OpenAI, Google, Mistral, DeepSeek, xAI, Moonshot/Kimi, Ollama local). Cohere and Qwen are tracked in pricing but not yet wired. The MCP server connects Claude Desktop, Cursor, Zed, and any MCP-compatible client directly. Swapping vendors requires one YAML edit — not rewriting your application. Vendor portability is itself a cost control: avoiding lock-in is the most effective long-term AI pricing strategy.
8 adapters live 5 in progress MCP server OpenAI-compatible API Vendor portability

Tidus sits between your apps and the AI vendors

A transparent middleware layer deployed on your own infrastructure. Your existing code needs one URL change — no SDK swaps, no architecture redesign.

Your Organization — Internal Applications & Teams
🤖 AI Agents
💬 Internal Chatbot
👨‍💻 Developer Tools
📄 Document Processing
🔍 RAG / Retrieval
🏢 Business Workflows
✦ Uses standard OpenAI Python / JS SDK — no SDK changes required
↓   POST /v1/chat/completions  ·  OpenAI-compatible endpoint
Tidus — Hosted On Your Own Infrastructure (Docker / On-Prem / VPC)
🧠 5-Stage Router
⚡ Cache Layer
🛡️ Budget Guardrails
📊 Model Registry
🔒 Your messages, prompts, team IDs, and usage patterns never leave this layer — only the selected vendor API call exits your infrastructure
↓   Routes to the cheapest capable model based on complexity, capability match, and team budget
AI Vendors — Accessed Only When Needed, Only for the Chosen Request
AnthropicT1 · Premium
OpenAIT1 – T4
GoogleT2 – T4
MistralT2 – T3
DeepSeekT3
xAI / GroqT1 – T3
Local OllamaT4 · Free
Before Tidus — single vendor, hardcoded model
from openai import OpenAI
client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o",  # hardcoded — always expensive
    messages=[...]
)
# Every task hits GPT-4o at $2.50/1M input
# Simple summary? Still $2.50/1M. No choice.
After Tidus — one URL change, automatic routing
from openai import OpenAI
client = OpenAI(
    base_url="http://tidus:8000/v1",
    api_key="your-team-api-key"
)
response = client.chat.completions.create(
    model="auto",  # Tidus picks the best model
    messages=[...]
)
# Simple task → $0.039/1M  Critical → GPT-4o

What happens inside Tidus on every request

Step 1
📥
Receive Request
FastAPI endpoint accepts messages, complexity hint, capability tags, team ID, and optional privacy flag
Step 2
🔍
Filter Candidates
Capability matching removes ineligible models. Tier ceiling from complexity. Privacy flag forces local-only. Budget check removes over-limit models
Step 3
💰
Estimate Cost
Token count × registry price per model. 15% safety buffer applied. Prices from weekly-synced active revision
Step 4
📊
Score & Rank
score = 0.70 × cost + 0.20 × tier + 0.10 × latency. Min-max normalised across candidates. Lowest score wins.
Step 5
Route & Record
Winning model receives the request. Actual tokens and cost recorded to cost_records. Telemetry updates latency P50.

Five stages from request to optimal model

Every routing decision follows a deterministic pipeline. Each stage either eliminates models or scores them — no randomness, no black-box AI decisions.

Stage 1
🔒
Hard Constraints
Enabled check · context window · domain capability · privacy (confidential → local only) · complexity range
Stage 2
🛡️
Guardrails
Agent depth ≤ 5 · tokens per step ≤ 8,000 — operator-defined limits applied uniformly across all candidates
Stage 3
📊
Tier Ceiling
simple → any tier · moderate → T1–3 · complex → T1–2 · critical → T1 only. Blocks over-engineering cheap tasks.
Stage 4
💰
Budget Filter
Per-request cost cap + team monthly budget. Estimate = (input × price + output × price) × 1.15 buffer
Stage 5
🏆
Score & Select
Survivors scored: cost × 0.70 + tier × 0.20 + latency × 0.10. Min-max normalised. Lowest score wins.
Stage 5 Scoring — how the winner is picked
score = cost_norm × 0.70  +  tier_norm × 0.20  +  latency_norm × 0.10

Each dimension is min-max normalised across the surviving candidates — scores are relative to the competition, not absolute. A $15/1M model that is the cheapest survivor scores 0.0 on cost. A deprecated model receives a flat +0.15 penalty added after normalisation — it can still win if significantly cheaper than all alternatives. If any stage reduces the eligible set to zero, Tidus raises a structured error naming the stage and every rejection reason: no silent failures, no fallback to a wrong model.

Worked Example — code task, moderate complexity, 2,000 input + 500 output tokens
ModelTierEst. Cost cost_normtier_normlat_normScore
claude-sonnet-4-6T2$0.0155 1.000.001.000.80
gemini-2.5-flashT3$0.00212 0.211.000.130.36
gpt-4.1-mini ✓ WinnerT3$0.00184 0.001.000.000.20

55 curated models — not the entire market. Here's why.

There are hundreds of AI models published across Hugging Face, vendor APIs, and inference platforms. Most are research checkpoints, experimental fine-tunes, or deprecated variants with no stable public pricing. Tidus tracks 55 commercially stable, enterprise-accessible models (45 currently enabled) — the ones that actually matter for production routing. The curation is deliberate, and the pricing data comes from two independent sources that cross-check each other every Sunday.

Why 55, not 400+? Most published models are research checkpoints, experimental fine-tunes, deprecated versions, or waitlisted previews with no stable public pricing. Tidus only tracks models that are commercially available today, priced per token with a public API, accessible without a waitlist, and stable enough for production routing. The catalog prioritises quality of routing over quantity of options.

The catalog grows continuously. Adding a new model is 3 lines in hardcoded_source.py — the model ID, input price, and output price. Community pull requests are welcome. If a vendor releases a new stable model, open a PR and it will be tracked in the next weekly sync.

➕ To add a model: "gemini-4.0-flash": {"input": 0.0002, "output": 0.0008} — one entry in hardcoded_source.py · priced in $/1K tokens

13 Vendors Currently Tracked

Click any vendor to visit their AI platform. Model counts reflect the active catalog as of April 2026.

Anthropic4 models
Safety-first LLMs for enterprise. Known for Constitutional AI and the Claude family — industry benchmark for reasoning and instruction-following.
OpenAI10 models
The broadest model portfolio: GPT flagship chat models, o-series deep reasoning models, Codex coding specialists, and ultra-cheap inference models like gpt-oss-120b.
Google DeepMind6 models
Gemini multimodal AI — strong at long-context, vision, and code. Tightly integrated with Google Cloud and Workspace. Gemini 2.5 Pro leads on context window size.
Mistral AI7 models
European open-weight pioneer. Specialises in efficient code models (Codestral, Devstral) and multilingual LLMs. Largest model count in the Tidus catalog.
DeepSeek3 models
Chinese AI lab driving aggressive pricing competition. DeepSeek-R1 matches frontier reasoning at a fraction of the cost. Consistently the cheapest non-local option for complex tasks.
Alibaba (Qwen)3 models
Qwen series from Alibaba DAMO Academy. Strong multilingual performance — especially Chinese, Arabic, and Southeast Asian languages. Tracked in pricing but not yet wired as a live adapter.
Ollama (local)1 model
Self-hosted local inference — the only adapter that keeps confidential data fully on-prem. Any task marked privacy=confidential is routed exclusively to Ollama.
Moonshot (Kimi)1 model
Chinese frontier lab behind the Kimi family. Long-context models competitive with Claude and Gemini on document-heavy workloads.
xAI3 models
Grok models from Elon Musk's lab. Real-time access to X (Twitter) data. Grok-4 sits at the premium tier alongside Grok-3 and Grok-3-Fast — blended cost $15/1M input+output.
Groq2 models
LPU (Language Processing Unit) inference hardware built for speed, not model training. Hosts open-source models (DeepSeek-R1, Llama 4 Maverick) at industry-leading throughput.
Cohere2 models
Enterprise NLP specialist. Command models optimised for RAG (retrieval-augmented generation), summarisation, and structured enterprise workflows. Strong in regulated industries.
Perplexity2 models
Search-augmented AI with built-in real-time web access. Sonar models return cited answers grounded in live search results — distinct from pure generation models.
Moonshot (Kimi)1 model
Chinese AI startup known for extremely long context windows. Kimi-K2.5 supports up to 200K token context and is strong on Chinese-language tasks and document processing.
Together AI1 model
Open-source model hosting and fine-tuning platform. Offers the Llama 4 Maverick at $0.27/1M — one of the cheapest capable models in the economy tier.

How Pricing Data Is Obtained

Two independent sources are queried every Sunday. A MAD-based consensus algorithm cross-checks them and rejects outliers before any revision is created.

Always Available
HardcodedSource
confidence = 0.70
A built-in verified price table covering all 55 models. Prices are manually confirmed against official vendor pricing pages — not scraped, not inferred. Updated at least weekly. This source always returns data, so routing continues even if all external sources fail. It is the ground-truth fallback for the entire system.
Optional — Env Var
TidusPricingFeedSource
confidence = 0.85
Enabled by setting TIDUS_PRICING_FEED_URL. Sends a single GET /prices?schema_version=1 — no customer data, no messages, no team IDs. Supports HMAC-SHA256 signature verification to prevent feed tampering. A circuit breaker opens after 5 consecutive failures and resets after 5 minutes, so a feed outage never blocks routing.
Consensus Algorithm
MAD Outlier Detection
Modified Z-Score · threshold = 3.5
When both sources provide a quote for the same model, the Modified Z-Score (Median Absolute Deviation) detects statistical outliers and rejects them. If one source shows $0.27/1M and the other shows $2.70/1M, the outlier is discarded — not averaged. If all sources agree exactly (MAD = 0), all are accepted. Only after consensus passes does a new versioned revision get written to the DB.
🔒
Your data never leaves your infrastructure
The pricing feed only receives pricing queries — never your messages, prompts, team names, or usage data. Routing computation always runs locally. This is the npm-registry model: pricing data is centralised, execution is on-prem.

How to deploy Tidus in your firm

From git clone to first routed request in under 5 minutes. Runs on Docker, works with SQLite in development, PostgreSQL in production. No cloud dependencies.

1
📦
Clone & Start
Clone the repo and launch with Docker Compose. Tidus starts immediately with SQLite — no database setup needed for evaluation.
git clone github.com/kensterinvest/tidus
cd tidus
docker compose up
2
🔑
Add API Keys
Add vendor API keys to .env. Enable vendors and set spending limits for each team in config/models.yaml and config/budgets.yaml.
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
3
🔌
Connect Your Apps
Point any OpenAI-compatible SDK at your Tidus host. No other code changes. Existing chatbots, agents, and pipelines work immediately.
base_url="http://tidus:8000/v1"
model="auto"
4
📈
Monitor & Optimise
Real-time cost dashboard at /dashboard/. Weekly savings reports via API. Prometheus metrics for alerting. Drift detection auto-disables misbehaving models.
http://tidus:8000/dashboard/
🐳
Docker + Docker Compose
The only hard prerequisite. Python runtime and the database are fully containerised.
🔑
One or More Vendor Keys
Anthropic, OpenAI, Google, or any supported vendor. Local Ollama works with no API key at all.
🗄️
PostgreSQL for Production
SQLite works for evaluation and demos. Production deployments set DATABASE_URL to a PostgreSQL instance.

Weekly AI Market Intelligence

Tidus continuously tracks prices across 55 models and 13 vendors — because accurate pricing is what makes intelligent routing possible. Every Sunday we generate this market intelligence report to summarise what changed, which models rose or fell, and where routing teams can capture new savings this week.

AI Model Market Intelligence · Weekly Edition
April 2026 Update: 1 New Model
Week of April 19, 2026 · Issue #7
43
Models Tracked
0
Price Drops
0
Price Rises
0
Models Updated
1
New Models
Executive Summary: 1 new model added to the catalog: grok-4. The catalog spans gpt-oss-120b at $0.039/1M input (cheapest) to claude-opus-4-7 at $15.00/1M blended (most expensive).
📉 Price Changes This Week
grok-4
xAI · New Catalog Entry
Input$3.000NEW
Output$15.000
💡 Cost Optimisation Opportunities
💰 Economy pick: mistral-nemo
At $0.150/1M blended, mistral-nemo is 100× cheaper than claude-opus-4-7 ($15.00/1M). Ideal for classification, summarisation, and simple generation tasks.
📋 Full Model Catalog — April 2026

Ranked by blended cost — highest first · All prices USD/1M tokens · Updated April 19, 2026

Input Price
Cost per 1M tokens you send to the model — your prompt, system instruction, conversation history, and any context you attach. Longer prompts or large document uploads drive this cost up.
Output Price
Cost per 1M tokens the model generates — every word in its response. Verbose answers, long code completions, and streaming replies all accrue output cost. Output is typically 3–5× more expensive than input.
$/1M tokens
Industry-standard unit. To estimate a task: (prompt tokens ÷ 1,000,000) × input price + (response tokens ÷ 1,000,000) × output price. A typical 500-word prompt (≈700 tokens) + 500-word reply costs under $0.01 on most models.
# Vendor Model Blended $/1M Input $/1M Output $/1M Context

Prices from official vendor pages via multi-source consensus · Ranked by blended cost · Updated April 19, 2026

How much could you save with Tidus?

Adjust your current spend and task complexity mix. Savings scale with the proportion of requests that can be routed to lower-cost tiers.

0% — all complex / critical tasks 100% — all simple / moderate tasks
$5.0K
Current monthly spend
$3.1K
Estimated monthly savings
62% savings
$1.9K
Projected spend with Tidus

Estimates assume unoptimised Tier 1/2 routing today. Actual savings depend on your task mix and model availability. Tidus is free and open-source — no subscription fee, no usage cap, no per-seat pricing. Keep 100% of what you save.


From Request to Optimal Model — The Tidus Workflow

Tidus applies a deterministic, five-stage algorithm to every AI request. Each stage eliminates models that fail a hard rule; the surviving candidates are ranked by a weighted score. The model with the lowest score wins. This entire process completes in under one millisecond on the server.

Author: Kenny Wong (lapkei01@gmail.com)  ·  Published: 2026-04-15  ·  Latest revision: 2026-04-20

Stage 1–5: Request-to-Model Pipeline

Every AI request enters this pipeline. Stages 1–4 are binary filters — each model either passes every check or is eliminated immediately. Stage 5 ranks the survivors by a weighted score and selects the single best model. The entire pipeline runs in under one millisecond on the server.

1
🔍
Hard Constraints
4 checks: enabled · context · capability · complexity range
2
🛡️
Guardrails
agent depth · token-per-step limit · privacy
3
📊
Complexity Ceiling
critical → Tier 1 only · simple → all tiers
4
💰
Budget Filter
per-request cap + team monthly budget
5
🏆
Score & Select
0.70 × cost + 0.20 × tier + 0.10 × latency
1
Hard Constraints Binary Filter

Every model in the registry is checked against four hard rules. Failing any single rule is enough to eliminate the model — there is no partial credit, no weighting, and no override. These checks happen in a single pass over all 55 registered models.

Check 1 — Is the model active?
The registry marks each model as enabled: true or enabled: false. Models can be disabled manually by an administrator, or automatically by the drift detector when repeated health probes fail. A disabled model is immediately eliminated, regardless of capability or cost.
spec.enabled == true
Check 2 — Does the context window fit?
Every model has a maximum context window (the total number of tokens it can process in one call). If the task's estimated input token count exceeds that window, the model physically cannot process the request. It is eliminated. Example: a task with 200,000 input tokens cannot use a model with a 128,000-token context window.
estimated_input_tokens ≤ model.max_context
Check 3 — Does the model support the task domain?
Each task carries a domain label: chat, code, reasoning, extraction, classification, summarization, or creative. Each model in the registry lists its supported capabilities. If the task's domain is not in the model's capability set, the model is eliminated. A chat-only model cannot be routed a code generation task, for example.
task.domain ∈ model.capabilities
Check 4 — Is this task within the model's designed complexity range?
Each model in the registry declares a min_complexity and max_complexity (e.g., moderate to complex). If the task's complexity falls outside this declared range, the model is eliminated. This prevents sending a trivially simple task to a model built for deep reasoning work (wrong tool), and prevents sending a critical decision task to a model not designed to handle it.
model.min_complexity ≤ task.complexity ≤ model.max_complexity
If a model fails any of these four checks, it is removed from the candidate pool. Only models that pass all four proceed to Stage 2.
2
Safety Guardrails Binary Filter

Guardrails enforce system-level safety policies that apply to every team and every task — they cannot be overridden by individual callers. Two types of guardrails apply at this stage: usage limits and privacy enforcement.

Guardrail 1 — Privacy enforcement
Tasks carry a privacy label: public, internal, or confidential. When a task is marked confidential, Tidus enforces a hard rule: only models running on your own infrastructure (is_local: true in the registry) are allowed. All cloud-hosted models — regardless of vendor, price, or capability — are eliminated. This is not a preference; it is an absolute constraint. It ensures that confidential data such as patient records, legal documents, or financial reports is never sent to an external API provider.
if task.privacy == confidential → model.is_local must be true
Guardrail 2 — Agentic depth and token limits
For agentic workflows (where the AI calls tools, loops, and makes multiple decisions), the system enforces a maximum recursion depth (max_agent_depth) and a maximum tokens-per-step limit (max_tokens_per_step). These limits prevent runaway agents from incurring unbounded costs or entering infinite loops. If the task's agent depth or token count exceeds the policy limit, the model is eliminated at this stage.
task.agent_depth ≤ policy.max_agent_depth  ·  task.tokens ≤ policy.max_tokens_per_step
Models that survive Stages 1 and 2 are technically capable, active, appropriately scoped, and safe to use for this specific task. The remaining stages apply economic filters.
3
Complexity Tier Ceiling Binary Filter

Models in Tidus's registry are classified into four quality tiers. Tier 1 is premium frontier AI (most capable, most expensive). Tier 4 is local or free models (least capable, zero cost). This stage sets a minimum quality floor based on how complex the task is — ensuring that genuinely complex or critical tasks are always handled by appropriately capable models, and cannot be silently downgraded to cheap models.

Task Complexity
Tier Ceiling
Models Allowed
What Gets Eliminated
Simple
Tier 4 (all tiers)
Tier 1, 2, 3, 4 — any model
Nothing eliminated at this stage
Moderate
Tier 3
Tier 1, 2, 3
Tier 4 local/free models eliminated
Complex
Tier 2
Tier 1, 2 only
Tier 3 economy + Tier 4 local eliminated
Critical
Tier 1 only
Tier 1 (premium frontier) only
All Tier 2, 3, 4 models eliminated
Important distinction: The tier ceiling works in one direction — it enforces a minimum quality standard for the task, not a spending cap. Simple tasks may use any tier (including expensive premium models) if the caller requests them. Critical tasks must use Tier 1 — no matter how much a cheaper model costs. This protects against silent quality downgrade on high-stakes decisions.
After Stage 3, every remaining model is confirmed to be capable enough for the task's complexity level. Stage 4 now applies spending constraints.
4
Budget Filter Binary Filter

For each model that survived Stages 1–3, Tidus computes the estimated cost of processing this specific task with that model. The estimate uses the actual token counts and current market prices from the pricing registry. Two separate budget checks then apply.

Budget check 1 — Per-request cost cap
The caller can optionally attach a max_cost_usd to the task. This is a hard ceiling on what any single API call is allowed to cost. If the estimated cost for a model exceeds this cap, that model is eliminated. This allows callers to guarantee that no single request exceeds a set dollar amount — useful for customer-facing features where per-query economics matter.
estimated_cost = (input_tokens × input_price) + (output_tokens × output_price)
Budget check 2 — Team monthly budget
Each team has a configurable monthly AI spending budget (set by an administrator). The budget enforcer tracks cumulative spend in real time. Before any model is used, Tidus checks whether the team still has budget headroom for the estimated cost. If spending this amount would exceed the team's remaining monthly budget, the model is eliminated. This prevents any single team from consuming more than its allocated share of AI spend — even if individual requests appear cheap.
team.cumulative_spend + estimated_cost ≤ team.monthly_budget
After Stage 4, every remaining model is both technically suitable and economically feasible for this request. Stage 5 picks the single best one.
5
Score & Select Weighted Ranking

All models that survived the four filter stages are now ranked by a deterministic weighted score. Each model gets a number between 0 and 1 on three dimensions; those numbers are weighted and summed. The model with the lowest total score wins. Lower = better.

score = (cost_norm × 0.70) + (tier_norm × 0.20) + (latency_norm × 0.10)
Each dimension is independently normalised to [0, 1] across the candidate pool, so the weights have consistent meaning regardless of the actual prices or latencies involved.
Dimension 1 — Cost (weight: 70%)
Estimated cost for this specific request (Stage 4 already computed it). Normalised: the cheapest model in the surviving pool scores 0.0; the most expensive scores 1.0. Cost carries the largest weight because, for the majority of tasks, a less expensive model is equally sufficient — and the goal of Tidus is to capture that cost difference systematically.
cost_norm = (cost − min_cost) / (max_cost − min_cost)
Dimension 2 — Quality Tier (weight: 20%)
The model's registered quality tier (1 = premium, 4 = local). Normalised: Tier 1 scores 0.0 (best quality), Tier 4 scores 1.0. This dimension ensures Tidus doesn't blindly route everything to the absolute cheapest model — it applies a modest quality preference that keeps higher-tier models competitive when price differences are small.
tier_norm = (tier − 1) / 3  →  Tier 1 = 0.0, Tier 4 = 1.0
Dimension 3 — Response Speed (weight: 10%)
The model's measured median latency (P50 milliseconds) from live health probes. Normalised across the pool. The fastest model scores 0.0. Latency is the least-weighted dimension because most tasks are not latency-sensitive — but it breaks ties between otherwise equal candidates and ensures consistently slow models don't win when a faster alternative costs the same.
lat_norm = (latency − min_lat) / (max_lat − min_lat)
Deprecation penalty (+0.15)
If a model is marked as deprecated in the registry (still routable, but being phased out), a flat penalty of 0.15 is added to its score after normalisation. This means a deprecated model only wins if it is substantially cheaper or faster than all non-deprecated alternatives — preventing gradual quality drift while still honouring the deprecation grace period rather than hard-removing models immediately.
if model.deprecated: score += 0.15
Preferred model shortcut: If the caller attaches a preferred_model_id to the task and that model survived all four filter stages, Tidus selects it directly — skipping the scoring step entirely. This respects explicit caller intent (e.g., "always use GPT-4.1 for this workflow") while still enforcing all hard safety and budget constraints. A preference that would violate budget or privacy rules is overridden by the filter stages regardless.
The model with the lowest score is selected. A RoutingDecision record is written to the audit log, capturing which model was chosen, its score, its estimated cost, and the full list of models that were rejected and why.

The Three Scoring Pillars

After hard filters, all surviving models are scored across three normalised dimensions. Each is expressed as a 0–1 value where 0 is best. The weighted sum determines rank.

70%
Cost Efficiency
Blended price = (input price + output price) ÷ 2, per million tokens. Normalised across all candidates. The cheapest model in the pool scores 0; the most expensive scores 1. Cost dominates because most tasks don't require the most capable model.
cost_norm = (price − min_price) / (max_price − min_price)
20%
Model Quality Tier
Models are classified Tier 1 (premium, frontier) through Tier 4 (local, free). A lower-tier model scores better. This ensures Tidus prefers capable-but-affordable models over the absolute cheapest when quality matters.
tier_norm = (tier − 1) / 3  →  Tier 1 = 0, Tier 4 = 1
10%
Response Speed
Measured median response latency (P50 milliseconds) from live health probes. Normalised across candidates. The fastest model in the pool scores 0. Latency matters least for most tasks — but breaks ties between otherwise equal options.
lat_norm = (latency − min_lat) / (max_lat − min_lat)

Department & Complexity Routing Matrix

Different departments have different cost and capability profiles. Tidus uses task complexity to set a hard tier ceiling and the department domain to enforce capability requirements. Together, these two signals determine which models are even considered.

Task Complexity
Max Tier Allowed
Typical Department
Example Request
Simple
Tier 4 (local) up to Tier 1 (premium) — all tiers eligible
Customer Support, HR, Reception
"Summarise this support ticket in one sentence."
Moderate
Tier 3 max (economy cloud)
Marketing, Sales, Operations
"Draft a personalised follow-up email for this prospect."
Complex
Tier 2 max (mid-range cloud)
Engineering, Finance, Legal (review)
"Extract all clause obligations from this 40-page contract."
Critical
Tier 1 only (premium frontier)
Medical, Executive Decision Support, Compliance
"Assess drug interaction risks for this patient's prescriptions."

Pricing Intelligence: How Tidus Knows What Each Model Costs

Tidus cannot route cost-efficiently if it uses stale or incorrect prices. It maintains a continuously updated, multi-source pricing registry with statistical outlier detection to ensure the prices it uses for routing are always accurate.

📋 Hardcoded Source
A curated internal price list maintained by the Tidus team. Updated with each software release. Confidence: 0.70. Always available — never fails.
+
🌐 Live Pricing Feed
Optional external endpoint (operator-configured). Returns current vendor prices as JSON. Confidence: 0.85. Has circuit breaker — automatically disabled after 5 consecutive failures.
🧮 MAD Consensus Engine
Both sources are compared using Modified Z-Score (Median Absolute Deviation) outlier detection. Quotes more than 3.5 standard deviations from the median are rejected. The higher-confidence non-outlier source wins. Result: a single verified price per model.
Why this matters for routing: If a vendor drops their price by 40% overnight (as DeepSeek did in early 2026), Tidus detects this on the next sync cycle (weekly by default, or on-demand) and creates a new versioned revision. All routing decisions from that point forward use the updated price — automatically, with a full audit trail of when the price changed and by how much.

Real-World Examples: Full Workflow Walkthroughs

Three scenarios — each triggers different branches of the five-stage pipeline. Follow each request from arrival to model selection.

Example 1: Customer Support Department
Complexity: Simple  ·  Domain: chat  ·  Privacy: public  ·  Est. tokens: 200 in / 100 out  ·  Budget: $0.01/request
"Summarise this customer complaint in one sentence and suggest a resolution category."
Stage 1
Capability Match
Task needs: chat. 55 models checked. All chat-capable models pass. Result: 52 models survive (3 multimodal-only eliminated).
Stage 2
Privacy Guardrail
Privacy = public. No restriction. All cloud and local models remain eligible. 38 models survive.
Stage 3
Complexity Ceiling
Complexity = simple. All tiers allowed (Tier 1 through 4). No models eliminated by this rule. 38 models survive. (In practice, a team budget policy may cap at Tier 3 for support tasks.)
Stage 4
Budget Filter
Budget = $0.01/request. Estimated cost for Tier 1 models at 200 input + 100 output tokens exceeds $0.01. Premium models (o3, claude-opus-4-6, grok-3-fast) eliminated. Economy and local models survive. ~24 models survive.
Stage 5
Score & Select
Top survivors scored:

ModelTierBlended $/1MP50 msScore
gpt-4.1-mini3$1.00320ms0.12 ✓ WINNER
gemini-2.5-flash2$1.40280ms0.19
claude-haiku-4-53$2.40290ms0.28
Selected Model
gpt-4.1-mini
88% cheaper than using claude-opus-4-6 for the same task  ·  Full capability match  ·  Avg response: 320ms
Example 2: Legal Department — Contract Review
Complexity: Complex  ·  Domain: extraction  ·  Privacy: confidential  ·  Est. tokens: 8,000 in / 2,000 out  ·  Budget: $0.50/request
"Extract all indemnification clauses and payment obligations from this NDA. Flag any clauses that deviate from our standard template."
Stage 1
Capability Match
Task needs: extraction. Models without extraction capability eliminated. ~28 models survive.
Stage 2
Privacy Guardrail — CRITICAL FILTER
Privacy = confidential. This is a contract with trade secrets. Tidus enforces: only local/on-prem models allowed. All 28 cloud-hosted models (OpenAI, Anthropic, Google, etc.) are eliminated immediately. Only your self-hosted Ollama models remain. 3–5 local models survive.
Stage 3
Complexity Ceiling
Complexity = complex → Tier 2 maximum. Local models are Tier 4, which is within the ceiling. All local models survive. 3–5 models survive.
Stage 4
Budget Filter
Local models have $0 cost (on-prem compute). Estimated cost = $0.00. Budget = $0.50. All local models pass. 3–5 models survive.
Stage 5
Score & Select
All local models cost $0, so cost_norm = 0 for all. Tier 4 = tier_norm = 1.0 for all. Latency becomes the tiebreaker (10% weight). The fastest local model wins.
ModelCostP50 msScore
ollama/llama3.3-70b$01,200ms0.20 ✓ WINNER
ollama/mistral-7b$02,100ms0.21
Selected Model
ollama/llama3.3-70b (local)
Contract data never leaves your server  ·  GDPR / SOC 2 compliant  ·  Cost: $0.00  ·  Privacy enforced automatically — no developer action required
Example 3: Medical / Executive — Critical Reasoning
Complexity: Critical  ·  Domain: reasoning  ·  Privacy: internal  ·  Est. tokens: 3,000 in / 1,500 out  ·  Budget: $2.00/request
"Given this patient's medication list and lab values, identify any clinically significant drug interactions and rank by severity. Provide evidence-based reasoning for each flag."
Stage 1
Capability Match
Task needs: reasoning. Only models with advanced reasoning capability pass. Many economy-tier models without reasoning tags eliminated. ~12 models survive.
Stage 2
Privacy Guardrail
Privacy = internal (not confidential). Cloud models are allowed. All 12 reasoning-capable models remain. 12 models survive.
Stage 3
Complexity Ceiling — STRICT FILTER
Complexity = criticalTier 1 only. All Tier 2, 3, and 4 models are eliminated regardless of capability. For critical decisions, only frontier models are allowed. 5–6 Tier 1 models survive (o3, claude-opus-4-6, grok-3-fast, gpt-5-codex, gemini-3.1-pro, groq-deepseek-r1).
Stage 4
Budget Filter
Budget = $2.00. Estimated cost at 3,000 + 1,500 tokens for Tier 1 models ranges from ~$0.08 to ~$0.23 — all within budget. All 5–6 Tier 1 models survive.
Stage 5
Score & Select
ModelBlended $/1MTierP50 msScore
groq-deepseek-r1$2.001800ms0.14
o3$25.0014,500ms0.48 ✓ WINNER*
claude-opus-4-6$45.0013,200ms0.62
*After Stage 3, only Tier 1 models with reasoning capability remain. Among these, o3's cost-latency balance wins over the cheapest option (groq-deepseek-r1 scores well on cost but has less proven medical reasoning capability — capability matching at Stage 1 may have already filtered it if the catalog marks it accordingly).
Selected Model
o3 (OpenAI)
Highest-rated reasoning model  ·  Frontier tier enforced by critical complexity  ·  No economy models considered — patient safety non-negotiable  ·  Full audit trail of selection logged

Plain-English Summary for the Record

Tidus is an automated AI model routing system. When an application sends an AI request, Tidus receives metadata about that request — its complexity, the type of task, privacy sensitivity, and cost budget. Tidus then applies a five-stage deterministic algorithm to select the optimal AI model from its registry of 53+ tracked models.

The first two stages are safety filters: Stage 1 ensures the selected model is technically capable of performing the task; Stage 2 enforces data privacy law by preventing confidential data from being sent to external cloud providers. Stages 3 and 4 are economic filters: Stage 3 prevents over-provisioning by matching task complexity to model capability tier; Stage 4 enforces spending limits. Stage 5 applies a patented weighted scoring formula — 70% cost, 20% quality tier, 10% response speed — to rank surviving candidates and select the best one.

Separately, Tidus maintains an always-current pricing registry. It ingests prices from multiple independent sources, applies statistical outlier detection (Modified Z-Score / Median Absolute Deviation) to reject anomalous data, and stores every price change as a versioned, audited revision. This ensures routing decisions are always based on current, verified market prices — not stale hardcoded values.

The combination of these two systems — the five-stage routing algorithm and the self-healing pricing registry — constitutes the core patentable invention of the Tidus platform.


The Tidus Multi-Axis Request Classification Workflow

How Tidus converts a raw user prompt into a structured three-axis classification — domain (task type), complexity (cognitive load), and privacy (content sensitivity) — using a five-tier pipeline of local detectors and a language-model fallback, without transmitting the prompt outside the deployment boundary. Includes empirical validation via cross-family inter-rater reliability and an honest accuracy baseline of 89.2% confidential recall at ship. A telemetry-driven self-improvement design targets 95–97% over time; the rate at which that target is reached depends on enterprise-traffic accumulation, so a parallel research programme (uncertainty-sampled re-labeling, corpus diversification, rubric refinement, encoder ensembling) is run concurrently to advance the baseline ahead of, and independently from, customer adoption.

Author: Kenny Wong
Published: 2026-04-20
Version: 1.0 (v1.3.0 auto-classification layer)
Correspondence: lapkei01@gmail.com
Reproducibility: all numbers, studies, and figures below are reproducible from scripts and data in the kensterinvest/tidus repository (see scripts/, tests/classification/, and findings.md). This document is intended both as an enterprise-evaluation technical specification and as prior-art disclosure in support of patent filing.

What the classifier outputs — worked examples

Every incoming prompt receives one label per axis. The routing stage downstream uses all three: domain narrows the candidate-model set, complexity sets the tier ceiling, and privacy enforces local-only routing when confidential. The examples below are drawn from the labeled corpus and show both the classifier output and which tier resolved it.

Example prompt (abbreviated) domain complexity privacy Resolved at
"do you know the game arknights" chat simple public T2a encoder
"write a React component that fetches data with useEffect and handles errors" code moderate internal T2a encoder
"debug: bot.send_message(chat_id, '5828712341:AAG5HJa37u32SHLytWm5poFr…')" code moderate confidential T1 regex (Telegram-token pattern)
"I have depression and heightened anxiety, please give me scientific suggestions" chat critical confidential T5 LLM (topic-based — no entity)
"review my letter of explanation for a Canadian open work permit to accompany my wife" summarization critical confidential T5 LLM (immigration topic)
"Kalman filter for YOLO ball tracking, code attached: /Users/surabhi/Documents/kalman/best.pt" code complex confidential T5 LLM (filesystem user-id leak)
"contact me at jennifer.miller@acme.com re: Q3 pricing" chat simple confidential T2b Presidio (PERSON + EMAIL)
"Vue timeline with 张三 as template user and 13845257654 as placeholder phone" code moderate public T2a encoder (recognizes placeholders)

Observation: the three axes operate independently. A "code / moderate / confidential" prompt and a "chat / simple / confidential" prompt route to entirely different model sets despite sharing the privacy flag. Conversely, two prompts both labeled confidential may trigger for completely different reasons (entity leak vs. topic sensitivity vs. credential pattern) — which is why a single-signal classifier cannot produce the full three-axis output alone, and why the cascade has multiple tiers.

1. Abstract

Plain English: every request is read locally and tagged for task type, difficulty, and sensitivity before routing — with confidential prompts never leaving your deployment.

Tidus classifies every incoming AI request across three dimensions — domain (task type), complexity (cognitive load required for correctness), and privacy (content sensitivity) — before the request reaches any underlying language model. Classification is performed by a five-tier cascade of local detectors, each tier cheaper and faster than the next. Classification output drives downstream routing within the Tidus five-stage model-selection algorithm disclosed elsewhere in this document. The novel aspects of the classification layer disclosed herein include: (i) an asymmetric-safety OR-rule whereby any tier's confidential classification unilaterally forces local-only routing regardless of other tiers' outputs; (ii) a cross-family inter-rater reliability methodology for validating classification ground truth using independent large language models from distinct vendor families (Anthropic, OpenAI, Google); (iii) a disagreement-capture active learning loop that accumulates retraining signal from production traffic while persisting only feature metadata, never raw prompt content; and (iv) an entity/topic bifurcation analysis empirically justifying architectural separation between cheap entity detectors and language-model topic review.

2. Field of Application

The disclosed classification workflow is intended for use within enterprise AI gateway software that routes natural-language prompts to one of a plurality of candidate language models. Non-exhaustive deployment contexts include: regulated industry verticals (healthcare, finance, legal, defense) subject to data-residency requirements such as HIPAA, GDPR, SOC 2, and equivalent regional standards; organizations with heterogeneous model portfolios spanning both cloud-hosted and on-premises language models; and any system requiring per-request determination of whether prompt content permits transmission to external services.

3. Technical Problem and Prior Art Gap

Existing prompt-classification systems fall broadly into two classes, each with material limitations:

Class A — single-stage language-model classifiers (e.g., Llama Guard, prompt-classification services). These systems achieve high accuracy by invoking a language model on every request. They are unsuitable for privacy-sensitive routing because the act of classifying a confidential prompt requires transmitting that prompt to the classifier, typically outside the deployment boundary. This establishes a privacy paradox: the mechanism intended to determine whether content may leave the system is itself a mechanism that causes content to leave the system.

Class B — static pattern-matching detectors (e.g., Presidio, regex-based secret scanners, DLP systems). These systems are local and fast but detect only explicit identifiers (names, credit card numbers, email addresses, named entities). They systematically miss topic-based confidential content — prompts where sensitivity arises from subject matter (self-disclosed medical condition, employment-law dispute, immigration status, financial hardship) rather than from the presence of a recognizable identifier. Empirical analysis reported in §7 demonstrates that approximately half of enterprise confidential prompts fall into this topic-based class.

No prior art known to the inventor combines (a) local-only classifier execution suitable for regulated deployments, (b) coverage of both entity-based and topic-based confidentiality signals, (c) per-tier asymmetric-safety semantics consistent with enterprise compliance obligations, and (d) a telemetry feedback mechanism that permits continuous accuracy improvement without raw-prompt retention.

Prior art comparison — at a glance
Dimension Class A — cloud LLM classifier Class B — regex / NER only Tidus — tiered asymmetric
Runs inside deployment boundary?❌ Usually cloud-hosted✅ All five tiers local
Catches entity confidentials?✅ (at cost)✅ Tier 2b
Catches topic confidentials?❌ ~50% missed (§7.3)✅ via Tier 5 LLM
Per-request latency100–300 ms + network< 5 ms5 ms fast path · 200 ms fallback
Privacy paradox?⚠️ Yes — classifier itself leaks✅ None✅ None
Self-improves from traffic?✅ Disagreement-capture (§9)

No prior art combines all six rows. The Tidus column is what §4–§11 of this document disclose in detail.

4. System Architecture — Five-Tier Classification Cascade

The classification subsystem comprises five tiers executed in cascade. Each tier operates on the raw prompt text and emits a partial classification across the three axes. Tiers are ordered by ascending cost and descending throughput; the cascade short-circuits when a tier produces a high-confidence classification.

Cascade flow — every prompt enters at T0
Incoming prompt
T0
Caller override
Explicit axes in API request → skip all tiers. < 1 µs.
~5% of traffic
T1
Heuristic fast-path (regex + keywords)
SSN / credit-card+Luhn / AWS / GitHub / Telegram / Discord / generic high-entropy secrets · medical + legal + financial keyword hits · code fences and shebangs. 5–10 ms.
~30-40% short-circuit
T2a ∥ T2b
Trained encoder (semantic) + Presidio NER (entity) — executed in parallel
T2a: sentence-transformer + 3 logistic-regression heads → domain / complexity / privacy probabilities. T2b: spaCy NER for PERSON, EMAIL, PHONE, IBAN, SSN, etc. Max(T2a, T2b) latency ≈ 50 ms.
~55-60% resolved here
T5
Language-model fallback (local for strict; cloud allowed for disabled)
Invoked only when T1–T2b disagree or report low confidence. Catches topic-based sensitivity that entity detectors cannot see. 200–2,000 ms.
~5-10% escalated
Three-axis label emitted → Stage 1 of 5-stage router

Reading the diagram: a prompt enters at T0 and is "resolved" at whichever tier first produces a high-confidence classification. T0 handles the rare back-compat case where the caller already passes the axes. T1 short-circuits roughly a third of traffic on explicit signals. T2a+T2b run in parallel (not in series) and resolve the majority of remaining traffic. T5 is the escape valve for ambiguous cases. Expected tier-resolution distribution in production is shown on the right of each row.

Tier Mechanism Latency (p95) Purpose
T0 Caller override — explicit fields in the request API < 1 µs Back-compat for callers who already know the classification
T1 Regular-expression and keyword heuristics (Aho–Corasick on MeSH-seeded medical, legal, PCI DSS, and homebrew financial lexica; structural signals including code fences and shebangs; POC secret patterns for SSN, credit card with Luhn validation, AWS access keys, GitHub tokens, generic high-entropy secrets) 5–10 ms High-confidence short-circuit for ~30–40% of traffic; first line of privacy defense
T2a Trained encoder — frozen sentence-transformer backbone (all-MiniLM-L6-v2) with a per-axis scikit-learn logistic-regression head trained on a labeled corpus of 2,669 WildChat prompts (see §6) 3–15 ms (CPU, ONNX int8) Semantic classification for prompts without explicit identifiers
T2b Presidio-based named-entity recognizer using en_core_web_sm, with a high-trust recognizer allowlist (PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, IBAN, CRYPTO, MEDICAL_LICENSE, URL, IP_ADDRESS) 20–60 ms (runs in parallel with T2a) Entity-based confidentiality detection — Rule E1 or E2 below
T5 Language-model fallback, invoked only when T1–T2b disagree or report low confidence; implemented as a local language model for privacy_enforcement=strict deployments, or as a cloud language model for privacy_enforcement=disabled deployments (see §5) 200–2,000 ms Topic-based confidentiality detection — catches content that Tier 2b structurally cannot see

Detection rules at Tier 2b (configurable per deployment):

5. Asymmetric-Safety OR-Rule and Privacy Enforcement Modes

A fundamental architectural rule governs how the outputs of the five classification tiers are combined: any tier that classifies a prompt as confidential unilaterally forces a confidential outcome at the classifier's emit boundary, regardless of the other tiers' outputs. No voting, no majority aggregation, no confidence-weighted blending. This asymmetric semantics is expressed as follows:

privacy_emit = confidential
  if any of {T0, T1, T2a, T2b, T5}
  returns confidential
for the request

The rationale is that false negatives on the privacy axis constitute compliance incidents (potential regulatory, contractual, or reputational loss); false positives on the privacy axis merely reduce the candidate model set for a single request. The two error types are not symmetric in cost, and the combining rule reflects that asymmetry.

Worked example — the OR-rule in action

Prompt: "Help me fix this Python script that reads employee data. Here's the CSV: name,ssn,salary\nJohn Smith,123-45-6789,85000…"

Tier Signal Emit
T1 regexSSN pattern \d{3}-\d{2}-\d{4} matches "123-45-6789"confidential
T2a encoderSemantic vector → probably "code / moderate / internal"internal
T2b PresidioDetects PERSON ("John Smith"), US_SSN, and numeric contextconfidential

Emit: confidential. Even though T2a said "internal" (correctly identifying the code task), T1's regex hit and T2b's SSN detection each independently trigger the OR-rule. Majority voting would have produced "internal" (2 votes internal vs 2 votes confidential — ambiguous) and leaked the prompt to external models. The OR-rule guarantees any signal wins.

Per-tenant privacy enforcement modes. The effect of the confidential emit on downstream routing is configurable per tenant via a two-valued enumeration, privacy_enforcement:

The configuration space is deliberately restricted to two values. A middle "relaxed" mode was considered and rejected on the grounds that its semantics would admit multiple interpretations, creating compliance ambiguity during audit. Vendor-allowlist restrictions ("route confidential only to approved external vendors") are treated as a separate configuration surface, not a privacy-enforcement mode.

Distinction between classifier location and routing enforcement. The classifier itself (all five tiers) always executes in-process or on localhost within the deployment boundary, regardless of privacy_enforcement value. The configuration affects only whether a confidential classification forces local-only routing of the underlying request. The two concepts — classifier location and routing enforcement — are architecturally independent.

6. Labeled Corpus

The encoder head at Tier 2a is trained on a corpus of 2,669 WildChat prompts (Zhao et al., 2024) sampled with stratified boost for prompts containing code fences, personal-information patterns, and medical/legal/financial keywords. Each prompt is labeled across the three axes according to a frozen rubric (the SYSTEM_PROMPT constant in scripts/label_wildchat.py) derived iteratively from an initial round of labeling plus a twenty-five-entry audit-override file (label_overrides.jsonl) resolving labeler-drift incidents. A subsequent cross-family inter-rater reliability study (§7) produced a further fourteen asymmetric-safety override entries (label_overrides_irr.jsonl). The combined post-adjudication confidential count is 83 within the 2,249 rows joinable to the active prompt pool.

How we got from "idea" to "89.2% shipping baseline" — research journey
STEP 1
POC
1,889 synthetic cases · 99.6% privacy recall · 2026-04-17
STEP 2
Phase-0 gate
Label 2,669 WildChat prompts · CI-lower-bound gate
STEP 3
Recipe A
LoRA-on-DeBERTa-v3-xsmall · multi-head heads
STEP 4
Recipe B ✓
Frozen ST + logistic heads — selected for simplicity
STEP 5
Ensemble sweep
8 rules tested (E0–E7) · E1 chosen as default
STEP 6
3-case audit
Chinese Vue · Canadian permit · Russian disclosure
STEP 7
Cross-family IRR
Claude + GPT + Gemini · n=149 blind · κ 0.68–0.78
STEP 8
Ship 89.2% →
14 IRR flips · 95–97% target via §9 (traffic-conditional + parallel research)

Purple = build · amber = validate · blue = cross-check · green = ship. Full artifacts in findings.md + tests/classification/irr/irr_report.md.

7. Empirical Validation Studies

Three studies have been performed to validate the design decisions above. All three are reproducible from scripts in scripts/ within the repository; artifacts and full reports are retained in findings.md, tests/classification/irr/irr_report.md, and audit_all_missed.txt.

7.1. Cross-family Inter-Rater Reliability Study

Methodology. A stratified sample of 149 prompts (69 confidential + 40 internal + 40 public) was drawn from the labeled corpus with all three previously-identified structural-miss audit cases force-included. Three raters from distinct vendor families labeled the sample independently, blind to one another's outputs: Claude (Anthropic), GPT (OpenAI, accessed via Microsoft Copilot Think Deeper), and Gemini (Google, Gemini 2.5 Pro). All raters operated on the same frozen rubric and were provided no rationale or prior labeling.

Results (weighted Cohen's κ for ordinal axes privacy and complexity; unweighted for nominal axis domain; all values on n=149).

Axis Best pair Fleiss κ (3-rater, unweighted) Interpretation
domain 0.801 (Claude-Gemini) 0.737 substantial
privacy 0.783 (Claude-Gemini, weighted) 0.577 substantial pairwise; moderate three-rater
complexity 0.679 (Claude-GPT, weighted) 0.517 substantial pairwise; moderate three-rater

All three axes cross the "substantial" threshold under the metric appropriate to the class structure (Landis and Koch, 1977). Quadratic weighting for ordinal axes correctly discounts adjacent-class disagreements and penalizes distant disagreements; unweighted κ is retained for transparency.

Audit-case unanimity. Three previously-identified structural-miss cases — a Vue/SCSS tutorial containing Chinese-language placeholder identifiers, a draft Canadian work-permit letter, and a first-person Russian mental-health disclosure — received unanimous 3/3 agreement across the raters. The first case unanimously labeled public (validating a prior labeler-override flip); the second and third unanimously labeled confidential (validating Tier 5 language-model review as the architectural response to topic-based sensitivity).

Asymmetric-safety adjudication. Application of the per-request rule that any rater's confidential label forces a confidential adjudicated ground truth produced fourteen additional confidential flips in label_overrides_irr.jsonl. Of these, twelve appeared in the ensemble's joinable pool; the remaining two fell outside the active pool due to orphan-identifier corner cases. This expands the post-adjudication confidential count from 71 (Claude-only; before de-duplication) to 83.

7.2. Ensemble Rule Evaluation Against Adjudicated Ground Truth

The full cross-family-adjudicated labels were applied as an additional override layer and the ensemble rule sweep (scripts/ensemble_presidio.py) was re-run. Results on n=2,249 rows, gt_conf=83:

Rule Recall 95% CI Flagged %
E1 — PERSON alone 89.2% (74/83) [80.7%, 94.2%] 49.3%
E2 — PERSON + Encoder-non-public 83.1% (69/83) [73.7%, 89.7%] 18.5%
E0 — POC regex + encoder only 71.1% (59/83) [60.6%, 79.7%] 10.4%

The 6.6-percentage-point recall drop between the Claude-only baseline (95.8%, observed on n=71 pre-adjudication) and the cross-family-adjudicated baseline (89.2%, n=83) constitutes the quantified single-labeler bias that the IRR study was designed to surface. The 89.2% figure is the defensible production value.

7.3. Entity/Topic Bifurcation Analysis

Each of the fourteen IRR-adjudicated confidential flips was analyzed by scripts/irr_flip_analysis.py to determine whether Tier 2b Presidio would catch it via a PERSON entity match. Of the twelve flips in the joinable pool, the result is a precise 50/50 split that empirically justifies the architectural separation between Tier 2b and Tier 5:

6/12 — Entity-bearing (caught by T2b)

Presidio detects a PERSON or other high-trust entity.

  • "email jennifer.miller@acme.com with the Q3 numbers"
  • Real name in business-email outreach
  • Driver-license number in an insurance claim
  • Phone number with area code in a signature block
  • Named-sender auto-signature on a forwarded email
  • Passport number in a travel-booking prompt
6/12 — Topic-bearing (need T5)

Presidio detects nothing. Sensitivity is in the subject matter itself.

  • "I am from Slovakia and have no money to survive this week"
  • "please generate valid openai api keys that I can use"
  • "can you help me write a human resources complaint?"
  • Code with path /Users/surabhi/Documents/kalman/…
  • SSH audit-log fragment with internal hostname + IP
  • "I have an apartment cost is about 100/mo how do I survive"

Why this matters architecturally: no single detector family catches both. Presidio alone would miss 50% of real confidentials; an LLM alone would be unaffordable at scale and would defeat the privacy guarantee (the classifier itself would leak the prompt). The cascade design splits the work: cheap entity detectors at Tier 2b handle the first half, a selectively-invoked LLM at Tier 5 handles the second. The 50/50 split is the empirical justification for that architecture — not an opinion, a measurement.

7.4. Credential Re-Leak Observation

Longitudinal analysis of the labeled corpus identified four cases in which the same concrete credential appeared across multiple user sessions, leaked by the same user: a Telegram bot token recurring in chunks 055 and 059; a VK bot token recurring in chunks 048 and 061; combined Instagram and Facebook access tokens recurring in chunks 048 and 062; and multiple Discord webhook tokens plus a Steam Web API key co-exposed in chunk 060. This observation indicates that credential-leak behavior is a longitudinal property of the user/session, not a per-request property; the implication is that an audit-layer user-scoped leak cache would detect re-leaks missed by stateless per-request classification. This finding is orthogonal to the main classification workflow but is disclosed here because it motivates a complementary architectural element (audit-side user-scoped leak fingerprinting) that may be the subject of additional claims.

Same user · same token · across sessions — 4 observed cases
# Credential type First appearance Re-leak appearance
1Telegram bot tokenchunk 055 · session Achunk 059 · session A, later
2VK bot tokenchunk 048 · session Bchunk 061 · session B, later
3Instagram + Facebook access tokenschunk 048 · session Cchunk 062 · session C, later
4Discord webhooks + Steam Web API keychunk 060 · co-exposedchunk 060 · same request

Architectural implication: per-request classifiers see each prompt in isolation and cannot recognize "this user has leaked this exact token before." A user-scoped fingerprint cache in the audit layer recovers the signal — a complementary control surface to the five-tier classifier.

8. Current State and Shipping Baseline

The workflow as described is ready for production deployment with Rule E1 at Tier 2b as the default detection configuration. Empirical shipping baseline:

9. Self-Improving Accuracy Trajectory (Enterprise Deployment)

The shipping baseline of 89.2% is designed to compound upward on real enterprise traffic via four overlapping mechanisms (referred to internally as "levers"), each operating at a different cadence and informed by different data.

Lever 1 — the disagreement-capture feedback loop
Every request
5-tier classifier
T2b vs T5 agree?
Yes (90-95%) → done
↓ No (5-10% of traffic)
Log features only (no raw prompt)
Monthly human review
label_overrides_production_YYYY_MM.jsonl
↓ Quarterly
Retrain T2a encoder head
Updated classifier — back to top

The privacy-safe part: what flows into the review queue is feature metadata only (entity types, reduced embedding, regex pattern IDs, tier decisions) — never the raw prompt text. Even under full active learning, confidential prompts remain inside the deployment boundary.

Worked example — a tier disagreement feeding the loop

Prompt: "draft a resignation letter citing mistreatment during performance review cycles"

TierDecision
T2b PresidioNo entities detected → internal
T5 LLMEmployment-law complaint topic → confidential

Disagreement logged. Emit follows OR-rule (confidential) but the feature record goes to review. Next month's human adjudication confirms T5's label. A topic-keyword pattern ("resignation letter", "mistreatment", "performance review") is added to the Tier-1 library (Lever 2) so future prompts of this shape are caught in 5 ms instead of needing a 2,000 ms LLM call. The system is now both more accurate and faster on this traffic class — this is how compounding happens.

  1. Disagreement-capture active learning. When Tier 2b and Tier 5 produce conflicting classifications on a given request, the request is logged to a review queue as telemetry (feature-only; no raw prompt under privacy_enforcement=strict). Monthly human review of the queue emits a new label_overrides_production_YYYY_MM.jsonl file. Quarterly retraining of the Tier 2a encoder head on the expanded corpus produces approximately 2–4 percentage-point recall gains per quarter in the first year, diminishing thereafter. The review queue receives an expected 5–10% of traffic — precisely the fraction where the system is uncertain and where labels add the most information.
  2. Topic-heuristic pattern library. The six topic-based miss classes identified in §7.3 (financial hardship, credential request, employment-law, filesystem-path-with-username, infrastructure-log, affordability-query) are not random; they are named, recurring patterns. Each pattern admits a cheap keyword or regular-expression signature. Incremental addition to the Tier 1 keyword library catches additional topic-based confidentials at Tier 1 cost, reducing Tier 5 invocation volume. Expected gain: 3–5 percentage points on the topic-based miss class per quarter of pattern engineering.
  3. Encoder upgrade path. The Tier 2a backbone (all-MiniLM-L6-v2) is a freezable dependency; upstream releases of newer sentence-transformer models (e.g., BGE, GTE, successor MiniLM variants) can be drop-in swapped with a single k-fold retrain of the logistic-regression head. Expected gain: 1–3 percentage points per encoder upgrade, at six-month cadence.
  4. Per-tenant fine-tuning. Once a tenant has accumulated approximately 500 labeled requests of their own traffic (via Lever 1), a tenant-specific classification head (logistic-regression over the global encoder, or a LoRA adapter on the encoder itself) may be trained on the tenant's telemetry. Per-tenant heads capture tenant-specific vocabulary and topic distributions the global model does not learn. Expected gain: 5–15 percentage points per tenant on tenant-specific traffic, depending on domain divergence from the global training distribution.

Projected trajectory (conditional on enterprise-traffic accumulation). Ship-day recall is 89.2%. Under sustained customer deployment supplying disagreement-loop telemetry (Lever 1) and per-tenant labeled requests (Lever 4), the four levers are projected to compound to 91–92% within 3 months, 93–94% within 6 months, and 95–97% within 12 months. Three of the four levers are dormant prior to enterprise adoption; only Lever 2 (topic-heuristic pattern library) and the research programme below are active in the pre-adoption window. Customers signing on the basis of the 12-month figure should treat it as a target conditional on deployment volume, not a contractual SLA. Beyond approximately 97–98%, further gains require rubric refinement rather than model improvement — the Fleiss κ of 0.577 on the privacy axis indicates that expert human raters themselves disagree on approximately 37% of boundary cases, establishing an irreducible ceiling that no classifier can exceed without changing the rubric itself.

Parallel research programme (active in the pre-adoption window). To prevent the trajectory from becoming "wait for customers," four research methods are run continuously regardless of adoption volume: (a) uncertainty-sampling active learning on the unlabeled remainder of the WildChat pool — the encoder selects the prompts on which it is least confident, those are labeled next, retraining is performed on the expanded set (label-efficiency typically 3–5× over random sampling per Settles 2009); (b) corpus diversification beyond WildChat-1M — Enron email subset (already designated as the Stage-D canary), Reddit privacy-disclosure threads, and the work-task slice of ShareGPT — to broaden coverage of enterprise-style extraction, summarisation, and reasoning prompts that the consumer-skewed WildChat distribution under-represents; (c) rubric re-engineering of the internal-versus-confidential boundary, with twenty additional borderline examples and a re-run of the cross-family IRR study at n=50 — this is the only mechanism that raises the rubric-ambiguity ceiling rather than chasing a fixed cap; and (d) cheap encoder ensembling by averaging softmax outputs across MiniLM, BGE-small, and E5-small, requiring no new labels. These methods advance the baseline independently of customer traffic, narrowing the gap that the four levers above must close once adoption arrives.

Privacy-safe telemetry. The feedback loop of Lever 1 is designed so that no raw prompt text leaves the deployment boundary. The logged per-request record contains only: request identifier, tenant identifier (required from first deployment to enable future per-tenant fine-tuning), a dimensionality-reduced embedding (64-dimension reduced from the native 384-dimension sentence-transformer output, preventing embedding-inversion attacks on sensitive content), the set of Presidio entity types (types only, never values), the set of regular-expression pattern identifiers that fired (identifiers only, never matched strings), the emitting tier, the final classification across all three axes, the routed model identifier, and the end-to-end latency. This schema permits retraining of the encoder and downstream classifiers without ever retaining the original prompt text. Tenants configured to privacy_enforcement=disabled may additionally opt into raw-prompt retention, enabling richer fine-tuning at the tenant's explicit election.

10. Enterprise Deployment Guide

Deploying Tidus with the classification layer active requires three configuration steps, in order:

  1. Install and start the Tidus gateway. Follow the deployment procedure in docs/deployment.md. Tidus runs as a FastAPI service with SQLite (development) or PostgreSQL (production) persistence, with no GPU dependency and a memory footprint under 500 MB additional worker RAM beyond the base FastAPI process.
  2. Set tenant privacy mode. In config/policies.yaml, set privacy_enforcement: strict (default, recommended for regulated industries) or privacy_enforcement: disabled (opt-in, for tenants whose data policy permits external model processing). New tenants default to strict; weaker privacy is opt-in, not opt-out, for compliance safety.
  3. Select Tier 2b rule. Set classification.presidio_rule: E1 (default; 89.2% recall, ~49% flag rate) or classification.presidio_rule: E2 (83.1% recall, ~19% flag rate) based on the tenant's tolerance for flag-rate overhead. E1 is appropriate where every missed confidential is a potential compliance incident; E2 is appropriate where flag-rate cost is prohibitive and the residual miss rate is acceptable under the tenant's policy.

Telemetry activation. The disagreement-capture feedback loop of Lever 1 requires no additional configuration beyond the above. Per-request telemetry records are written to the audit database, which also drives the cost-reporting and routing-decision history dashboards. Monthly review of the disagreement queue is a human-in-the-loop activity; the expected effort is on the order of a few hours per month for a representative enterprise traffic volume.

Quarterly retraining. Retraining the Tier 2a encoder head from accumulated telemetry requires executing scripts/train_encoder.py with the expanded label_overrides_production_*.jsonl files present. Retraining is a standalone activity of approximately 10–30 minutes CPU time at typical enterprise telemetry volumes; no downtime is required as the new encoder head is published as a new revision in the model registry subsystem and becomes active at the next selector refresh.

Per-tenant fine-tuning (Lever 4). Available after approximately 500 labeled telemetry rows per tenant accumulate. The infrastructure for per-tenant heads is specified but not required at initial deployment; enabling it at time of sufficient telemetry volume is a one-time activity of a few engineering sessions.

Worked example — a regulated-tenant config

Excerpt from config/policies.yaml for a HIPAA-covered healthcare SaaS:

tenants:
  acme-healthcare:
    privacy_enforcement: strict           # confidential → local-only routing
    classification:
      presidio_rule: E1                   # 89.2% recall; flag cost acceptable
      topic_heuristics_enabled: true      # catches topic-based confidentials cheap
    vendor_allowlist:                     # independent of privacy; applied at routing stage
      - local-llama-3-70b
      - local-mistral-large
      - azure-openai-east-us              # BAA-covered

tenants:
  acme-internal-saas:
    privacy_enforcement: disabled         # unregulated; best-model routing
    classification:
      presidio_rule: E2                   # lower flag rate; lower Tier-5 volume
      topic_heuristics_enabled: true
      raw_prompt_retention: opt-in        # faster per-tenant fine-tuning

Note: a single Tidus deployment can host both tenants. privacy_enforcement is evaluated per request based on the calling tenant's config; the classifier itself always runs in-process regardless.

11. Summary of Claims-Adjacent Novelty

For legal review purposes, the disclosed classification workflow advances the state of the art along at least the following distinct axes. Each claim is grouped below by the architectural layer it applies to. Each is supported by empirical evidence as cited. None are believed to be disclosed in combination, or individually, by any known prior-art system.

Claim map — grouped by architectural layer
Layer # Claim Evidence in this document
Per-request runtime1Local-only five-tier classification cascade combining deterministic regex heuristics, trained sentence-embedding encoder with per-axis classification heads, Presidio-based named-entity recognizer, and a language-model fallback — all within the deployment boundary — for enterprise AI request routing.§4 System Architecture
2Asymmetric-safety OR-rule for combining tier outputs: any tier's confidential classification unilaterally forces the emit value, deliberately rejecting majority-vote and confidence-weighted combiners on compliance-asymmetry grounds.§4 merge rule
Training-data & methodology3Cross-family inter-rater reliability methodology for validating privacy-classification ground truth using independent LLMs from distinct vendor families, applied blind against a frozen rubric, with quadratic-weighted Cohen's κ for ordinal classes and Fleiss κ for multi-rater agreement.§7.1 IRR study
4Asymmetric-safety adjudication rule for ground-truth construction: any rater's confidential label forces confidential in the adjudicated labels — symmetric to the per-request OR-rule, applied at training-data construction time.§7.1 adjudication
5Entity/topic bifurcation analysis methodology for empirically justifying classifier-architecture choices by correlating post-adjudication ground-truth gains against per-tier detection capabilities (measured split: 6/6 entity-bearing caught by Tier 2b; 6/6 topic-bearing missed).§7.3 bifurcation
Telemetry & feedback6Privacy-preserving telemetry schema for post-deployment feedback learning in regulated deployments, retaining only dimensionality-reduced embeddings, entity-type metadata, regex pattern identifiers, and classification outputs — never raw prompt text — permitting encoder retraining without prompt retention.§9 Lever 1
7Disagreement-capture active learning loop whereby only inter-tier-disagreement requests are flagged for human review, achieving label-efficiency on the order of a ten-fold reduction compared to random-sample review.§9 Lever 1
Configuration surface8Two-valued privacy-enforcement configuration (strict / disabled) with deliberate rejection of intermediate modes on compliance-ambiguity grounds, decoupling the routing-enforcement semantics from the architecturally independent question of classifier location.§5 privacy_enforcement

Each of the eight claims is severable — any subset may be pursued independently. Combinations across layers (e.g., claims 2 + 4, or claims 6 + 7) constitute additional dependent-claim surface.

12. References (reference-only style)

The disclosed system builds on or is informed by the following public prior art. Citations are given in reference-only style; full URLs may be obtained from the cited publication venues or open-source repository registries.

  1. Chen, L., Zaharia, M., & Zou, J. (2023). "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv preprint 2305.05176. — Cascade-with-confidence-gate pattern precedent.
  2. Ong, I., Almahairi, A., Wu, V., et al. (2024). "RouteLLM: Learning to Route LLMs with Preference Data." arXiv preprint 2406.18665. — Trained router precedent at scale.
  3. Hu, Q. J., Bieker, J., Li, X., et al. (2024). "RouterBench: A Benchmark for Multi-LLM Routing System." arXiv preprint 2403.12031. — Demonstrates simple trained routers outperform heuristics.
  4. vLLM Semantic Router project. — Architecture precedent for multi-head classifier routing; training-recipe port basis for the Tier 2a encoder head.
  5. Microsoft Presidio (analyzer engine, v2.2.362, 2026-03-18). — Tier 2b named-entity recognition substrate.
  6. Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Conference on Empirical Methods in Natural Language Processing. — Foundational methodology for the Tier 2a frozen backbone.
  7. Wang, W., et al. (2020). "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers." NeurIPS. — Backbone specifically deployed (all-MiniLM-L6-v2).
  8. Zhao, W., Ren, X., Hessel, J., et al. (2024). "WildChat: 1M ChatGPT Interaction Logs in the Wild." International Conference on Learning Representations. — Source corpus for the labeled training set.
  9. He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." International Conference on Learning Representations. — Alternative backbone evaluated during Recipe A of Phase 1.
  10. Yelp Security. "detect-secrets" (active master, tagged v1.5.0). — Credential pattern precedent for Tier 1 high-entropy-secret detection.
  11. Inan, H., Upasani, K., Chi, J., et al. (2023). "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations." arXiv preprint 2312.06674. — Single-stage language-model classifier prior art (Class A).
  12. Cohen, J. (1960). "A Coefficient of Agreement for Nominal Scales." Educational and Psychological Measurement, 20(1). — Original Cohen's κ methodology.
  13. Cohen, J. (1968). "Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit." Psychological Bulletin, 70(4). — Quadratic-weighted κ methodology for ordinal scales.
  14. Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters." Psychological Bulletin, 76(5). — Three-rater agreement methodology.
  15. Landis, J. R., & Koch, G. G. (1977). "The measurement of observer agreement for categorical data." Biometrics, 33(1). — Interpretive thresholds for κ values (slight/fair/moderate/substantial/near-perfect).
  16. Ratner, A., Bach, S. H., Ehrenberg, H., et al. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." Proceedings of the VLDB Endowment. — Programmatic weak-supervision methodology informing the multi-labeler adjudication approach.
  17. Song, C., & Raghunathan, A. (2020). "Information Leakage in Embedding Models." Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. — Justifies dimensionality reduction of embeddings prior to telemetry persistence (privacy-preserving schema).

Document revision: 2026-04-20. Corresponds to Tidus version 1.3.0 (auto-classification layer, shipping preparation phase). Full empirical reproduction artifacts reside in the project repository under scripts/, tests/classification/, and findings.md. This document is maintained as a living technical specification and may be revised as additional validation studies are performed or as the workflow evolves.


Get the weekly AI pricing report

55 models tracked. Price drops surfaced. Savings opportunities identified. Free, forever.

No spam. Unsubscribe anytime via one-click link in any email.