Free and open-source for every organisation — any size, any industry. Tidus selects the cheapest capable model for every AI request, automatically. 5-stage intelligence. 70% cost savings. Fully self-hosted. Every dollar saved stays with you.
What is Tidus?
Tidus sits in front of your AI workloads as a self-hosted FastAPI service. You call one endpoint — Tidus picks the right model automatically.
Most teams waste money sending every request to the same premium model. A simple classification task costs $15/1M tokens on Claude Opus when Gemini 2.0 Flash at $0.10/1M returns an equivalent result.
Tidus analyses each request's complexity, privacy requirements, capability needs, and budget constraints — then scores every eligible model using a weighted algorithm (70% cost, 20% tier, 10% latency) to find the optimal route.
Because Tidus is fully self-hosted and open source, no request data ever leaves your environment. It works with any AI vendor and supports local Ollama models for confidential workloads where cloud APIs are prohibited.
The weekly pricing registry keeps costs accurate without manual intervention — syncing prices from multiple sources every Sunday, detecting outliers with MAD-based consensus, and creating versioned, auditable revisions.
Five Cost-Control Pillars
Five complementary controls — each tackling a distinct source of AI cost waste. Together they form a complete governance layer over your AI infrastructure.
simple tasks can use any tier; critical tasks are restricted to T1 only. This single rule removes the most expensive models from 80% of typical workloads without any manual configuration.policies.yaml and enforced before any API call is made: max_agent_depth (default 5) prevents infinite recursion loops; max_tokens_per_step (default 8,000) caps per-step cost uniformly across all models; max_retries_per_task (default 3) stops retry storms from multiplying costs; max_parallel_sessions_per_team (default 10) prevents concurrency explosions. Each violated guardrail produces a named rejection reason in the API response — no silent failures.How It Works
A transparent middleware layer deployed on your own infrastructure. Your existing code needs one URL change — no SDK swaps, no architecture redesign.
from openai import OpenAI client = OpenAI(api_key="sk-...") response = client.chat.completions.create( model="gpt-4o", # hardcoded — always expensive messages=[...] ) # Every task hits GPT-4o at $2.50/1M input # Simple summary? Still $2.50/1M. No choice.
from openai import OpenAI client = OpenAI( base_url="http://tidus:8000/v1", api_key="your-team-api-key" ) response = client.chat.completions.create( model="auto", # Tidus picks the best model messages=[...] ) # Simple task → $0.039/1M Critical → GPT-4o
Selection Algorithm
Every routing decision follows a deterministic pipeline. Each stage either eliminates models or scores them — no randomness, no black-box AI decisions.
Each dimension is min-max normalised across the surviving candidates — scores are relative to the competition, not absolute. A $15/1M model that is the cheapest survivor scores 0.0 on cost. A deprecated model receives a flat +0.15 penalty added after normalisation — it can still win if significantly cheaper than all alternatives. If any stage reduces the eligible set to zero, Tidus raises a structured error naming the stage and every rejection reason: no silent failures, no fallback to a wrong model.
| Model | Tier | Est. Cost | cost_norm | tier_norm | lat_norm | Score |
|---|---|---|---|---|---|---|
claude-sonnet-4-6 | T2 | $0.0155 | 1.00 | 0.00 | 1.00 | 0.80 |
gemini-2.5-flash | T3 | $0.00212 | 0.21 | 1.00 | 0.13 | 0.36 |
gpt-4.1-mini ✓ Winner | T3 | $0.00184 | 0.00 | 1.00 | 0.00 | 0.20 |
Model Discovery & Pricing Sources
There are hundreds of AI models published across Hugging Face, vendor APIs, and inference platforms. Most are research checkpoints, experimental fine-tunes, or deprecated variants with no stable public pricing. Tidus tracks 55 commercially stable, enterprise-accessible models (45 currently enabled) — the ones that actually matter for production routing. The curation is deliberate, and the pricing data comes from two independent sources that cross-check each other every Sunday.
Why 55, not 400+? Most published models are research checkpoints, experimental fine-tunes, deprecated versions, or waitlisted previews with no stable public pricing. Tidus only tracks models that are commercially available today, priced per token with a public API, accessible without a waitlist, and stable enough for production routing. The catalog prioritises quality of routing over quantity of options.
The catalog grows continuously. Adding a new model is 3 lines in hardcoded_source.py — the model ID, input price, and output price. Community pull requests are welcome. If a vendor releases a new stable model, open a PR and it will be tracked in the next weekly sync.
"gemini-4.0-flash": {"input": 0.0002, "output": 0.0008}
— one entry in hardcoded_source.py · priced in $/1K tokens
Click any vendor to visit their AI platform. Model counts reflect the active catalog as of April 2026.
privacy=confidential is routed exclusively to Ollama.Two independent sources are queried every Sunday. A MAD-based consensus algorithm cross-checks them and rejects outliers before any revision is created.
TIDUS_PRICING_FEED_URL. Sends a single GET /prices?schema_version=1 — no customer data, no messages, no team IDs. Supports HMAC-SHA256 signature verification to prevent feed tampering. A circuit breaker opens after 5 consecutive failures and resets after 5 minutes, so a feed outage never blocks routing.Getting Started
From git clone to first routed request in under 5 minutes. Runs on Docker, works with SQLite in development, PostgreSQL in production. No cloud dependencies.
.env. Enable vendors and set spending limits for each team in config/models.yaml and config/budgets.yaml./dashboard/. Weekly savings reports via API. Prometheus metrics for alerting. Drift detection auto-disables misbehaving models.DATABASE_URL to a PostgreSQL instance.Latest Report
Tidus continuously tracks prices across 55 models and 13 vendors — because accurate pricing is what makes intelligent routing possible. Every Sunday we generate this market intelligence report to summarise what changed, which models rose or fell, and where routing teams can capture new savings this week.
Ranked by blended cost — highest first · All prices USD/1M tokens · Updated April 19, 2026
| # | Vendor | Model | Blended $/1M | Input $/1M | Output $/1M | Context |
|---|
Prices from official vendor pages via multi-source consensus · Ranked by blended cost · Updated April 19, 2026
ROI Calculator
Adjust your current spend and task complexity mix. Savings scale with the proportion of requests that can be routed to lower-cost tiers.
Estimates assume unoptimised Tier 1/2 routing today. Actual savings depend on your task mix and model availability. Tidus is free and open-source — no subscription fee, no usage cap, no per-seat pricing. Keep 100% of what you save.
How It Works
Tidus applies a deterministic, five-stage algorithm to every AI request. Each stage eliminates models that fail a hard rule; the surviving candidates are ranked by a weighted score. The model with the lowest score wins. This entire process completes in under one millisecond on the server.
Author: Kenny Wong (lapkei01@gmail.com) · Published: 2026-04-15 · Latest revision: 2026-04-20
Every AI request enters this pipeline. Stages 1–4 are binary filters — each model either passes every check or is eliminated immediately. Stage 5 ranks the survivors by a weighted score and selects the single best model. The entire pipeline runs in under one millisecond on the server.
Every model in the registry is checked against four hard rules. Failing any single rule is enough to eliminate the model — there is no partial credit, no weighting, and no override. These checks happen in a single pass over all 55 registered models.
enabled: true or enabled: false. Models can be disabled manually by an administrator, or automatically by the drift detector when repeated health probes fail. A disabled model is immediately eliminated, regardless of capability or cost.chat, code, reasoning, extraction, classification, summarization, or creative. Each model in the registry lists its supported capabilities. If the task's domain is not in the model's capability set, the model is eliminated. A chat-only model cannot be routed a code generation task, for example.min_complexity and max_complexity (e.g., moderate to complex). If the task's complexity falls outside this declared range, the model is eliminated. This prevents sending a trivially simple task to a model built for deep reasoning work (wrong tool), and prevents sending a critical decision task to a model not designed to handle it.Guardrails enforce system-level safety policies that apply to every team and every task — they cannot be overridden by individual callers. Two types of guardrails apply at this stage: usage limits and privacy enforcement.
public, internal, or confidential. When a task is marked confidential, Tidus enforces a hard rule: only models running on your own infrastructure (is_local: true in the registry) are allowed. All cloud-hosted models — regardless of vendor, price, or capability — are eliminated. This is not a preference; it is an absolute constraint. It ensures that confidential data such as patient records, legal documents, or financial reports is never sent to an external API provider.max_agent_depth) and a maximum tokens-per-step limit (max_tokens_per_step). These limits prevent runaway agents from incurring unbounded costs or entering infinite loops. If the task's agent depth or token count exceeds the policy limit, the model is eliminated at this stage.Models in Tidus's registry are classified into four quality tiers. Tier 1 is premium frontier AI (most capable, most expensive). Tier 4 is local or free models (least capable, zero cost). This stage sets a minimum quality floor based on how complex the task is — ensuring that genuinely complex or critical tasks are always handled by appropriately capable models, and cannot be silently downgraded to cheap models.
For each model that survived Stages 1–3, Tidus computes the estimated cost of processing this specific task with that model. The estimate uses the actual token counts and current market prices from the pricing registry. Two separate budget checks then apply.
max_cost_usd to the task. This is a hard ceiling on what any single API call is allowed to cost. If the estimated cost for a model exceeds this cap, that model is eliminated. This allows callers to guarantee that no single request exceeds a set dollar amount — useful for customer-facing features where per-query economics matter.All models that survived the four filter stages are now ranked by a deterministic weighted score. Each model gets a number between 0 and 1 on three dimensions; those numbers are weighted and summed. The model with the lowest total score wins. Lower = better.
deprecated in the registry (still routable, but being phased out), a flat penalty of 0.15 is added to its score after normalisation. This means a deprecated model only wins if it is substantially cheaper or faster than all non-deprecated alternatives — preventing gradual quality drift while still honouring the deprecation grace period rather than hard-removing models immediately.preferred_model_id to the task and that model survived all four filter stages, Tidus selects it directly — skipping the scoring step entirely. This respects explicit caller intent (e.g., "always use GPT-4.1 for this workflow") while still enforcing all hard safety and budget constraints. A preference that would violate budget or privacy rules is overridden by the filter stages regardless.
RoutingDecision record is written to the audit log, capturing which model was chosen, its score, its estimated cost, and the full list of models that were rejected and why.After hard filters, all surviving models are scored across three normalised dimensions. Each is expressed as a 0–1 value where 0 is best. The weighted sum determines rank.
Different departments have different cost and capability profiles. Tidus uses task complexity to set a hard tier ceiling and the department domain to enforce capability requirements. Together, these two signals determine which models are even considered.
Tidus cannot route cost-efficiently if it uses stale or incorrect prices. It maintains a continuously updated, multi-source pricing registry with statistical outlier detection to ensure the prices it uses for routing are always accurate.
Three scenarios — each triggers different branches of the five-stage pipeline. Follow each request from arrival to model selection.
chat. 55 models checked. All chat-capable models pass. Result: 52 models survive (3 multimodal-only eliminated).| Model | Tier | Blended $/1M | P50 ms | Score |
|---|---|---|---|---|
| gpt-4.1-mini | 3 | $1.00 | 320ms | 0.12 ✓ WINNER |
| gemini-2.5-flash | 2 | $1.40 | 280ms | 0.19 |
| claude-haiku-4-5 | 3 | $2.40 | 290ms | 0.28 |
extraction. Models without extraction capability eliminated. ~28 models survive.| Model | Cost | P50 ms | Score |
|---|---|---|---|
| ollama/llama3.3-70b | $0 | 1,200ms | 0.20 ✓ WINNER |
| ollama/mistral-7b | $0 | 2,100ms | 0.21 |
reasoning. Only models with advanced reasoning capability pass. Many economy-tier models without reasoning tags eliminated. ~12 models survive.| Model | Blended $/1M | Tier | P50 ms | Score |
|---|---|---|---|---|
| groq-deepseek-r1 | $2.00 | 1 | 800ms | 0.14 |
| o3 | $25.00 | 1 | 4,500ms | 0.48 ✓ WINNER* |
| claude-opus-4-6 | $45.00 | 1 | 3,200ms | 0.62 |
reasoning capability remain. Among these, o3's cost-latency balance wins over the cheapest option (groq-deepseek-r1 scores well on cost but has less proven medical reasoning capability — capability matching at Stage 1 may have already filtered it if the catalog marks it accordingly).
Tidus is an automated AI model routing system. When an application sends an AI request, Tidus receives metadata about that request — its complexity, the type of task, privacy sensitivity, and cost budget. Tidus then applies a five-stage deterministic algorithm to select the optimal AI model from its registry of 53+ tracked models.
The first two stages are safety filters: Stage 1 ensures the selected model is technically capable of performing the task; Stage 2 enforces data privacy law by preventing confidential data from being sent to external cloud providers. Stages 3 and 4 are economic filters: Stage 3 prevents over-provisioning by matching task complexity to model capability tier; Stage 4 enforces spending limits. Stage 5 applies a patented weighted scoring formula — 70% cost, 20% quality tier, 10% response speed — to rank surviving candidates and select the best one.
Separately, Tidus maintains an always-current pricing registry. It ingests prices from multiple independent sources, applies statistical outlier detection (Modified Z-Score / Median Absolute Deviation) to reject anomalous data, and stores every price change as a versioned, audited revision. This ensures routing decisions are always based on current, verified market prices — not stale hardcoded values.
The combination of these two systems — the five-stage routing algorithm and the self-healing pricing registry — constitutes the core patentable invention of the Tidus platform.
Technical Specification
How Tidus converts a raw user prompt into a structured three-axis classification — domain (task type), complexity (cognitive load), and privacy (content sensitivity) — using a five-tier pipeline of local detectors and a language-model fallback, without transmitting the prompt outside the deployment boundary. Includes empirical validation via cross-family inter-rater reliability and an honest accuracy baseline of 89.2% confidential recall at ship. A telemetry-driven self-improvement design targets 95–97% over time; the rate at which that target is reached depends on enterprise-traffic accumulation, so a parallel research programme (uncertainty-sampled re-labeling, corpus diversification, rubric refinement, encoder ensembling) is run concurrently to advance the baseline ahead of, and independently from, customer adoption.
scripts/, tests/classification/, and findings.md). This document is intended both as an enterprise-evaluation technical specification and as prior-art disclosure in support of patent filing.Every incoming prompt receives one label per axis. The routing stage downstream uses all three: domain narrows the candidate-model set, complexity sets the tier ceiling, and privacy enforces local-only routing when confidential. The examples below are drawn from the labeled corpus and show both the classifier output and which tier resolved it.
| Example prompt (abbreviated) | domain | complexity | privacy | Resolved at |
|---|---|---|---|---|
| "do you know the game arknights" | chat | simple | public | T2a encoder |
| "write a React component that fetches data with useEffect and handles errors" | code | moderate | internal | T2a encoder |
| "debug: bot.send_message(chat_id, '5828712341:AAG5HJa37u32SHLytWm5poFr…')" | code | moderate | confidential | T1 regex (Telegram-token pattern) |
| "I have depression and heightened anxiety, please give me scientific suggestions" | chat | critical | confidential | T5 LLM (topic-based — no entity) |
| "review my letter of explanation for a Canadian open work permit to accompany my wife" | summarization | critical | confidential | T5 LLM (immigration topic) |
| "Kalman filter for YOLO ball tracking, code attached: /Users/surabhi/Documents/kalman/best.pt" | code | complex | confidential | T5 LLM (filesystem user-id leak) |
| "contact me at jennifer.miller@acme.com re: Q3 pricing" | chat | simple | confidential | T2b Presidio (PERSON + EMAIL) |
| "Vue timeline with 张三 as template user and 13845257654 as placeholder phone" | code | moderate | public | T2a encoder (recognizes placeholders) |
Observation: the three axes operate independently. A "code / moderate / confidential" prompt and a "chat / simple / confidential" prompt route to entirely different model sets despite sharing the privacy flag. Conversely, two prompts both labeled confidential may trigger for completely different reasons (entity leak vs. topic sensitivity vs. credential pattern) — which is why a single-signal classifier cannot produce the full three-axis output alone, and why the cascade has multiple tiers.
Plain English: every request is read locally and tagged for task type, difficulty, and sensitivity before routing — with confidential prompts never leaving your deployment.
Tidus classifies every incoming AI request across three dimensions — domain (task type), complexity (cognitive load required for correctness), and privacy (content sensitivity) — before the request reaches any underlying language model. Classification is performed by a five-tier cascade of local detectors, each tier cheaper and faster than the next. Classification output drives downstream routing within the Tidus five-stage model-selection algorithm disclosed elsewhere in this document. The novel aspects of the classification layer disclosed herein include: (i) an asymmetric-safety OR-rule whereby any tier's confidential classification unilaterally forces local-only routing regardless of other tiers' outputs; (ii) a cross-family inter-rater reliability methodology for validating classification ground truth using independent large language models from distinct vendor families (Anthropic, OpenAI, Google); (iii) a disagreement-capture active learning loop that accumulates retraining signal from production traffic while persisting only feature metadata, never raw prompt content; and (iv) an entity/topic bifurcation analysis empirically justifying architectural separation between cheap entity detectors and language-model topic review.
The disclosed classification workflow is intended for use within enterprise AI gateway software that routes natural-language prompts to one of a plurality of candidate language models. Non-exhaustive deployment contexts include: regulated industry verticals (healthcare, finance, legal, defense) subject to data-residency requirements such as HIPAA, GDPR, SOC 2, and equivalent regional standards; organizations with heterogeneous model portfolios spanning both cloud-hosted and on-premises language models; and any system requiring per-request determination of whether prompt content permits transmission to external services.
Existing prompt-classification systems fall broadly into two classes, each with material limitations:
Class A — single-stage language-model classifiers (e.g., Llama Guard, prompt-classification services). These systems achieve high accuracy by invoking a language model on every request. They are unsuitable for privacy-sensitive routing because the act of classifying a confidential prompt requires transmitting that prompt to the classifier, typically outside the deployment boundary. This establishes a privacy paradox: the mechanism intended to determine whether content may leave the system is itself a mechanism that causes content to leave the system.
Class B — static pattern-matching detectors (e.g., Presidio, regex-based secret scanners, DLP systems). These systems are local and fast but detect only explicit identifiers (names, credit card numbers, email addresses, named entities). They systematically miss topic-based confidential content — prompts where sensitivity arises from subject matter (self-disclosed medical condition, employment-law dispute, immigration status, financial hardship) rather than from the presence of a recognizable identifier. Empirical analysis reported in §7 demonstrates that approximately half of enterprise confidential prompts fall into this topic-based class.
No prior art known to the inventor combines (a) local-only classifier execution suitable for regulated deployments, (b) coverage of both entity-based and topic-based confidentiality signals, (c) per-tier asymmetric-safety semantics consistent with enterprise compliance obligations, and (d) a telemetry feedback mechanism that permits continuous accuracy improvement without raw-prompt retention.
| Dimension | Class A — cloud LLM classifier | Class B — regex / NER only | Tidus — tiered asymmetric |
|---|---|---|---|
| Runs inside deployment boundary? | ❌ Usually cloud-hosted | ✅ | ✅ All five tiers local |
| Catches entity confidentials? | ✅ (at cost) | ✅ | ✅ Tier 2b |
| Catches topic confidentials? | ✅ | ❌ ~50% missed (§7.3) | ✅ via Tier 5 LLM |
| Per-request latency | 100–300 ms + network | < 5 ms | 5 ms fast path · 200 ms fallback |
| Privacy paradox? | ⚠️ Yes — classifier itself leaks | ✅ None | ✅ None |
| Self-improves from traffic? | ❌ | ❌ | ✅ Disagreement-capture (§9) |
No prior art combines all six rows. The Tidus column is what §4–§11 of this document disclose in detail.
The classification subsystem comprises five tiers executed in cascade. Each tier operates on the raw prompt text and emits a partial classification across the three axes. Tiers are ordered by ascending cost and descending throughput; the cascade short-circuits when a tier produces a high-confidence classification.
strict; cloud allowed for disabled)Reading the diagram: a prompt enters at T0 and is "resolved" at whichever tier first produces a high-confidence classification. T0 handles the rare back-compat case where the caller already passes the axes. T1 short-circuits roughly a third of traffic on explicit signals. T2a+T2b run in parallel (not in series) and resolve the majority of remaining traffic. T5 is the escape valve for ambiguous cases. Expected tier-resolution distribution in production is shown on the right of each row.
| Tier | Mechanism | Latency (p95) | Purpose |
|---|---|---|---|
| T0 | Caller override — explicit fields in the request API | < 1 µs | Back-compat for callers who already know the classification |
| T1 | Regular-expression and keyword heuristics (Aho–Corasick on MeSH-seeded medical, legal, PCI DSS, and homebrew financial lexica; structural signals including code fences and shebangs; POC secret patterns for SSN, credit card with Luhn validation, AWS access keys, GitHub tokens, generic high-entropy secrets) | 5–10 ms | High-confidence short-circuit for ~30–40% of traffic; first line of privacy defense |
| T2a | Trained encoder — frozen sentence-transformer backbone (all-MiniLM-L6-v2) with a per-axis scikit-learn logistic-regression head trained on a labeled corpus of 2,669 WildChat prompts (see §6) |
3–15 ms (CPU, ONNX int8) | Semantic classification for prompts without explicit identifiers |
| T2b | Presidio-based named-entity recognizer using en_core_web_sm, with a high-trust recognizer allowlist (PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, IBAN, CRYPTO, MEDICAL_LICENSE, URL, IP_ADDRESS) |
20–60 ms (runs in parallel with T2a) | Entity-based confidentiality detection — Rule E1 or E2 below |
| T5 | Language-model fallback, invoked only when T1–T2b disagree or report low confidence; implemented as a local language model for privacy_enforcement=strict deployments, or as a cloud language model for privacy_enforcement=disabled deployments (see §5) |
200–2,000 ms | Topic-based confidentiality detection — catches content that Tier 2b structurally cannot see |
Detection rules at Tier 2b (configurable per deployment):
PERSON entity, classify as confidential. Recall 89.2% (95% CI [80.7%, 94.2%]) on cross-family-validated ground truth (§7). Flags ~49% of traffic. Ship default.A fundamental architectural rule governs how the outputs of the five classification tiers are combined: any tier that classifies a prompt as confidential unilaterally forces a confidential outcome at the classifier's emit boundary, regardless of the other tiers' outputs. No voting, no majority aggregation, no confidence-weighted blending. This asymmetric semantics is expressed as follows:
privacy_emit = confidential
if any of {T0, T1, T2a, T2b, T5}
returns confidential
for the request
The rationale is that false negatives on the privacy axis constitute compliance incidents (potential regulatory, contractual, or reputational loss); false positives on the privacy axis merely reduce the candidate model set for a single request. The two error types are not symmetric in cost, and the combining rule reflects that asymmetry.
Prompt: "Help me fix this Python script that reads employee data. Here's the CSV: name,ssn,salary\nJohn Smith,123-45-6789,85000…"
| Tier | Signal | Emit |
|---|---|---|
| T1 regex | SSN pattern \d{3}-\d{2}-\d{4} matches "123-45-6789" | confidential |
| T2a encoder | Semantic vector → probably "code / moderate / internal" | internal |
| T2b Presidio | Detects PERSON ("John Smith"), US_SSN, and numeric context | confidential |
Emit: confidential. Even though T2a said "internal" (correctly identifying the code task), T1's regex hit and T2b's SSN detection each independently trigger the OR-rule. Majority voting would have produced "internal" (2 votes internal vs 2 votes confidential — ambiguous) and leaked the prompt to external models. The OR-rule guarantees any signal wins.
Per-tenant privacy enforcement modes. The effect of the confidential emit on downstream routing is configurable per tenant via a two-valued enumeration, privacy_enforcement:
strict (default, opt-out required). A confidential classification forces local-only model selection at Stage 1 of the downstream routing algorithm. Candidate models whose inference endpoint resides outside the deployment boundary are removed from the eligible set. No raw-prompt retention. Intended for healthcare, finance, defense, and other regulated verticals.disabled (opt-in). Classification still executes for cost-tier routing, complexity ceiling, and telemetry; however the confidential emit does not force local-only routing. All models remain eligible subject to other gating rules. Optional opt-in raw-prompt retention enabled. Intended for unregulated tenants whose data policy permits external model processing and who therefore benefit from faster improvement cycles (§9).The configuration space is deliberately restricted to two values. A middle "relaxed" mode was considered and rejected on the grounds that its semantics would admit multiple interpretations, creating compliance ambiguity during audit. Vendor-allowlist restrictions ("route confidential only to approved external vendors") are treated as a separate configuration surface, not a privacy-enforcement mode.
Distinction between classifier location and routing enforcement. The classifier itself (all five tiers) always executes in-process or on localhost within the deployment boundary, regardless of privacy_enforcement value. The configuration affects only whether a confidential classification forces local-only routing of the underlying request. The two concepts — classifier location and routing enforcement — are architecturally independent.
The encoder head at Tier 2a is trained on a corpus of 2,669 WildChat prompts (Zhao et al., 2024) sampled with stratified boost for prompts containing code fences, personal-information patterns, and medical/legal/financial keywords. Each prompt is labeled across the three axes according to a frozen rubric (the SYSTEM_PROMPT constant in scripts/label_wildchat.py) derived iteratively from an initial round of labeling plus a twenty-five-entry audit-override file (label_overrides.jsonl) resolving labeler-drift incidents. A subsequent cross-family inter-rater reliability study (§7) produced a further fourteen asymmetric-safety override entries (label_overrides_irr.jsonl). The combined post-adjudication confidential count is 83 within the 2,249 rows joinable to the active prompt pool.
Purple = build · amber = validate · blue = cross-check · green = ship. Full artifacts in findings.md + tests/classification/irr/irr_report.md.
Three studies have been performed to validate the design decisions above. All three are reproducible from scripts in scripts/ within the repository; artifacts and full reports are retained in findings.md, tests/classification/irr/irr_report.md, and audit_all_missed.txt.
Methodology. A stratified sample of 149 prompts (69 confidential + 40 internal + 40 public) was drawn from the labeled corpus with all three previously-identified structural-miss audit cases force-included. Three raters from distinct vendor families labeled the sample independently, blind to one another's outputs: Claude (Anthropic), GPT (OpenAI, accessed via Microsoft Copilot Think Deeper), and Gemini (Google, Gemini 2.5 Pro). All raters operated on the same frozen rubric and were provided no rationale or prior labeling.
Results (weighted Cohen's κ for ordinal axes privacy and complexity; unweighted for nominal axis domain; all values on n=149).
| Axis | Best pair | Fleiss κ (3-rater, unweighted) | Interpretation |
|---|---|---|---|
| domain | 0.801 (Claude-Gemini) | 0.737 | substantial |
| privacy | 0.783 (Claude-Gemini, weighted) | 0.577 | substantial pairwise; moderate three-rater |
| complexity | 0.679 (Claude-GPT, weighted) | 0.517 | substantial pairwise; moderate three-rater |
All three axes cross the "substantial" threshold under the metric appropriate to the class structure (Landis and Koch, 1977). Quadratic weighting for ordinal axes correctly discounts adjacent-class disagreements and penalizes distant disagreements; unweighted κ is retained for transparency.
Audit-case unanimity. Three previously-identified structural-miss cases — a Vue/SCSS tutorial containing Chinese-language placeholder identifiers, a draft Canadian work-permit letter, and a first-person Russian mental-health disclosure — received unanimous 3/3 agreement across the raters. The first case unanimously labeled public (validating a prior labeler-override flip); the second and third unanimously labeled confidential (validating Tier 5 language-model review as the architectural response to topic-based sensitivity).
Asymmetric-safety adjudication. Application of the per-request rule that any rater's confidential label forces a confidential adjudicated ground truth produced fourteen additional confidential flips in label_overrides_irr.jsonl. Of these, twelve appeared in the ensemble's joinable pool; the remaining two fell outside the active pool due to orphan-identifier corner cases. This expands the post-adjudication confidential count from 71 (Claude-only; before de-duplication) to 83.
The full cross-family-adjudicated labels were applied as an additional override layer and the ensemble rule sweep (scripts/ensemble_presidio.py) was re-run. Results on n=2,249 rows, gt_conf=83:
| Rule | Recall | 95% CI | Flagged % |
|---|---|---|---|
| E1 — PERSON alone | 89.2% (74/83) | [80.7%, 94.2%] | 49.3% |
| E2 — PERSON + Encoder-non-public | 83.1% (69/83) | [73.7%, 89.7%] | 18.5% |
| E0 — POC regex + encoder only | 71.1% (59/83) | [60.6%, 79.7%] | 10.4% |
The 6.6-percentage-point recall drop between the Claude-only baseline (95.8%, observed on n=71 pre-adjudication) and the cross-family-adjudicated baseline (89.2%, n=83) constitutes the quantified single-labeler bias that the IRR study was designed to surface. The 89.2% figure is the defensible production value.
Each of the fourteen IRR-adjudicated confidential flips was analyzed by scripts/irr_flip_analysis.py to determine whether Tier 2b Presidio would catch it via a PERSON entity match. Of the twelve flips in the joinable pool, the result is a precise 50/50 split that empirically justifies the architectural separation between Tier 2b and Tier 5:
Presidio detects a PERSON or other high-trust entity.
Presidio detects nothing. Sensitivity is in the subject matter itself.
/Users/surabhi/Documents/kalman/…Why this matters architecturally: no single detector family catches both. Presidio alone would miss 50% of real confidentials; an LLM alone would be unaffordable at scale and would defeat the privacy guarantee (the classifier itself would leak the prompt). The cascade design splits the work: cheap entity detectors at Tier 2b handle the first half, a selectively-invoked LLM at Tier 5 handles the second. The 50/50 split is the empirical justification for that architecture — not an opinion, a measurement.
Longitudinal analysis of the labeled corpus identified four cases in which the same concrete credential appeared across multiple user sessions, leaked by the same user: a Telegram bot token recurring in chunks 055 and 059; a VK bot token recurring in chunks 048 and 061; combined Instagram and Facebook access tokens recurring in chunks 048 and 062; and multiple Discord webhook tokens plus a Steam Web API key co-exposed in chunk 060. This observation indicates that credential-leak behavior is a longitudinal property of the user/session, not a per-request property; the implication is that an audit-layer user-scoped leak cache would detect re-leaks missed by stateless per-request classification. This finding is orthogonal to the main classification workflow but is disclosed here because it motivates a complementary architectural element (audit-side user-scoped leak fingerprinting) that may be the subject of additional claims.
Architectural implication: per-request classifiers see each prompt in isolation and cannot recognize "this user has leaked this exact token before." A user-scoped fingerprint cache in the audit layer recovers the signal — a complementary control surface to the five-tier classifier.
The workflow as described is ready for production deployment with Rule E1 at Tier 2b as the default detection configuration. Empirical shipping baseline:
privacy_enforcement value; cross-family IRR methodology does not require raw-prompt transmission beyond a controlled one-time validation sample.The shipping baseline of 89.2% is designed to compound upward on real enterprise traffic via four overlapping mechanisms (referred to internally as "levers"), each operating at a different cadence and informed by different data.
label_overrides_production_YYYY_MM.jsonlThe privacy-safe part: what flows into the review queue is feature metadata only (entity types, reduced embedding, regex pattern IDs, tier decisions) — never the raw prompt text. Even under full active learning, confidential prompts remain inside the deployment boundary.
Prompt: "draft a resignation letter citing mistreatment during performance review cycles"
| Tier | Decision |
|---|---|
| T2b Presidio | No entities detected → internal |
| T5 LLM | Employment-law complaint topic → confidential |
Disagreement logged. Emit follows OR-rule (confidential) but the feature record goes to review. Next month's human adjudication confirms T5's label. A topic-keyword pattern ("resignation letter", "mistreatment", "performance review") is added to the Tier-1 library (Lever 2) so future prompts of this shape are caught in 5 ms instead of needing a 2,000 ms LLM call. The system is now both more accurate and faster on this traffic class — this is how compounding happens.
privacy_enforcement=strict). Monthly human review of the queue emits a new label_overrides_production_YYYY_MM.jsonl file. Quarterly retraining of the Tier 2a encoder head on the expanded corpus produces approximately 2–4 percentage-point recall gains per quarter in the first year, diminishing thereafter. The review queue receives an expected 5–10% of traffic — precisely the fraction where the system is uncertain and where labels add the most information.all-MiniLM-L6-v2) is a freezable dependency; upstream releases of newer sentence-transformer models (e.g., BGE, GTE, successor MiniLM variants) can be drop-in swapped with a single k-fold retrain of the logistic-regression head. Expected gain: 1–3 percentage points per encoder upgrade, at six-month cadence.Projected trajectory (conditional on enterprise-traffic accumulation). Ship-day recall is 89.2%. Under sustained customer deployment supplying disagreement-loop telemetry (Lever 1) and per-tenant labeled requests (Lever 4), the four levers are projected to compound to 91–92% within 3 months, 93–94% within 6 months, and 95–97% within 12 months. Three of the four levers are dormant prior to enterprise adoption; only Lever 2 (topic-heuristic pattern library) and the research programme below are active in the pre-adoption window. Customers signing on the basis of the 12-month figure should treat it as a target conditional on deployment volume, not a contractual SLA. Beyond approximately 97–98%, further gains require rubric refinement rather than model improvement — the Fleiss κ of 0.577 on the privacy axis indicates that expert human raters themselves disagree on approximately 37% of boundary cases, establishing an irreducible ceiling that no classifier can exceed without changing the rubric itself.
Parallel research programme (active in the pre-adoption window). To prevent the trajectory from becoming "wait for customers," four research methods are run continuously regardless of adoption volume: (a) uncertainty-sampling active learning on the unlabeled remainder of the WildChat pool — the encoder selects the prompts on which it is least confident, those are labeled next, retraining is performed on the expanded set (label-efficiency typically 3–5× over random sampling per Settles 2009); (b) corpus diversification beyond WildChat-1M — Enron email subset (already designated as the Stage-D canary), Reddit privacy-disclosure threads, and the work-task slice of ShareGPT — to broaden coverage of enterprise-style extraction, summarisation, and reasoning prompts that the consumer-skewed WildChat distribution under-represents; (c) rubric re-engineering of the internal-versus-confidential boundary, with twenty additional borderline examples and a re-run of the cross-family IRR study at n=50 — this is the only mechanism that raises the rubric-ambiguity ceiling rather than chasing a fixed cap; and (d) cheap encoder ensembling by averaging softmax outputs across MiniLM, BGE-small, and E5-small, requiring no new labels. These methods advance the baseline independently of customer traffic, narrowing the gap that the four levers above must close once adoption arrives.
Privacy-safe telemetry. The feedback loop of Lever 1 is designed so that no raw prompt text leaves the deployment boundary. The logged per-request record contains only: request identifier, tenant identifier (required from first deployment to enable future per-tenant fine-tuning), a dimensionality-reduced embedding (64-dimension reduced from the native 384-dimension sentence-transformer output, preventing embedding-inversion attacks on sensitive content), the set of Presidio entity types (types only, never values), the set of regular-expression pattern identifiers that fired (identifiers only, never matched strings), the emitting tier, the final classification across all three axes, the routed model identifier, and the end-to-end latency. This schema permits retraining of the encoder and downstream classifiers without ever retaining the original prompt text. Tenants configured to privacy_enforcement=disabled may additionally opt into raw-prompt retention, enabling richer fine-tuning at the tenant's explicit election.
Deploying Tidus with the classification layer active requires three configuration steps, in order:
docs/deployment.md. Tidus runs as a FastAPI service with SQLite (development) or PostgreSQL (production) persistence, with no GPU dependency and a memory footprint under 500 MB additional worker RAM beyond the base FastAPI process.config/policies.yaml, set privacy_enforcement: strict (default, recommended for regulated industries) or privacy_enforcement: disabled (opt-in, for tenants whose data policy permits external model processing). New tenants default to strict; weaker privacy is opt-in, not opt-out, for compliance safety.classification.presidio_rule: E1 (default; 89.2% recall, ~49% flag rate) or classification.presidio_rule: E2 (83.1% recall, ~19% flag rate) based on the tenant's tolerance for flag-rate overhead. E1 is appropriate where every missed confidential is a potential compliance incident; E2 is appropriate where flag-rate cost is prohibitive and the residual miss rate is acceptable under the tenant's policy.Telemetry activation. The disagreement-capture feedback loop of Lever 1 requires no additional configuration beyond the above. Per-request telemetry records are written to the audit database, which also drives the cost-reporting and routing-decision history dashboards. Monthly review of the disagreement queue is a human-in-the-loop activity; the expected effort is on the order of a few hours per month for a representative enterprise traffic volume.
Quarterly retraining. Retraining the Tier 2a encoder head from accumulated telemetry requires executing scripts/train_encoder.py with the expanded label_overrides_production_*.jsonl files present. Retraining is a standalone activity of approximately 10–30 minutes CPU time at typical enterprise telemetry volumes; no downtime is required as the new encoder head is published as a new revision in the model registry subsystem and becomes active at the next selector refresh.
Per-tenant fine-tuning (Lever 4). Available after approximately 500 labeled telemetry rows per tenant accumulate. The infrastructure for per-tenant heads is specified but not required at initial deployment; enabling it at time of sufficient telemetry volume is a one-time activity of a few engineering sessions.
Excerpt from config/policies.yaml for a HIPAA-covered healthcare SaaS:
tenants:
acme-healthcare:
privacy_enforcement: strict # confidential → local-only routing
classification:
presidio_rule: E1 # 89.2% recall; flag cost acceptable
topic_heuristics_enabled: true # catches topic-based confidentials cheap
vendor_allowlist: # independent of privacy; applied at routing stage
- local-llama-3-70b
- local-mistral-large
- azure-openai-east-us # BAA-covered
tenants:
acme-internal-saas:
privacy_enforcement: disabled # unregulated; best-model routing
classification:
presidio_rule: E2 # lower flag rate; lower Tier-5 volume
topic_heuristics_enabled: true
raw_prompt_retention: opt-in # faster per-tenant fine-tuning
Note: a single Tidus deployment can host both tenants. privacy_enforcement is evaluated per request based on the calling tenant's config; the classifier itself always runs in-process regardless.
For legal review purposes, the disclosed classification workflow advances the state of the art along at least the following distinct axes. Each claim is grouped below by the architectural layer it applies to. Each is supported by empirical evidence as cited. None are believed to be disclosed in combination, or individually, by any known prior-art system.
| Layer | # | Claim | Evidence in this document |
|---|---|---|---|
| Per-request runtime | 1 | Local-only five-tier classification cascade combining deterministic regex heuristics, trained sentence-embedding encoder with per-axis classification heads, Presidio-based named-entity recognizer, and a language-model fallback — all within the deployment boundary — for enterprise AI request routing. | §4 System Architecture |
| 2 | Asymmetric-safety OR-rule for combining tier outputs: any tier's confidential classification unilaterally forces the emit value, deliberately rejecting majority-vote and confidence-weighted combiners on compliance-asymmetry grounds. | §4 merge rule | |
| Training-data & methodology | 3 | Cross-family inter-rater reliability methodology for validating privacy-classification ground truth using independent LLMs from distinct vendor families, applied blind against a frozen rubric, with quadratic-weighted Cohen's κ for ordinal classes and Fleiss κ for multi-rater agreement. | §7.1 IRR study |
| 4 | Asymmetric-safety adjudication rule for ground-truth construction: any rater's confidential label forces confidential in the adjudicated labels — symmetric to the per-request OR-rule, applied at training-data construction time. | §7.1 adjudication | |
| 5 | Entity/topic bifurcation analysis methodology for empirically justifying classifier-architecture choices by correlating post-adjudication ground-truth gains against per-tier detection capabilities (measured split: 6/6 entity-bearing caught by Tier 2b; 6/6 topic-bearing missed). | §7.3 bifurcation | |
| Telemetry & feedback | 6 | Privacy-preserving telemetry schema for post-deployment feedback learning in regulated deployments, retaining only dimensionality-reduced embeddings, entity-type metadata, regex pattern identifiers, and classification outputs — never raw prompt text — permitting encoder retraining without prompt retention. | §9 Lever 1 |
| 7 | Disagreement-capture active learning loop whereby only inter-tier-disagreement requests are flagged for human review, achieving label-efficiency on the order of a ten-fold reduction compared to random-sample review. | §9 Lever 1 | |
| Configuration surface | 8 | Two-valued privacy-enforcement configuration (strict / disabled) with deliberate rejection of intermediate modes on compliance-ambiguity grounds, decoupling the routing-enforcement semantics from the architecturally independent question of classifier location. | §5 privacy_enforcement |
Each of the eight claims is severable — any subset may be pursued independently. Combinations across layers (e.g., claims 2 + 4, or claims 6 + 7) constitute additional dependent-claim surface.
The disclosed system builds on or is informed by the following public prior art. Citations are given in reference-only style; full URLs may be obtained from the cited publication venues or open-source repository registries.
all-MiniLM-L6-v2).Document revision: 2026-04-20. Corresponds to Tidus version 1.3.0 (auto-classification layer, shipping preparation phase). Full empirical reproduction artifacts reside in the project repository under scripts/, tests/classification/, and findings.md. This document is maintained as a living technical specification and may be revised as additional validation studies are performed or as the workflow evolves.
55 models tracked. Price drops surfaced. Savings opportunities identified. Free, forever.
No spam. Unsubscribe anytime via one-click link in any email.