Question 1

Our de-identification tool misses PHI in clinical notes — LLM studies show >50% miss rate. What should we use instead?

Accepted Answer

Hybrid three-tier detection provides both high recall (ML-based NER for names and contextual PHI) and high precision (regex for structured identifiers). The 260+ entity types include medical-specific identifiers: MRN formats, NPI, DEA numbers, health plan IDs. Confidence thresholds can be set for maximum recall in high-risk PHI scenarios. Example: A hospital system is building a de-identified research dataset from 500,000 clinical notes. Their current tool (Presidio default) misses ~30% of PHI b

Question 2

Over-redaction in e-discovery is causing sanctions — our tool blacks out too much. What causes this and how do we fix it?

Accepted Answer

Configurable confidence thresholds per entity type allow legal teams to calibrate precision vs. recall. The hybrid system's regex component provides reproducible, defensible detection for structured PII. The preview modal in the Chrome Extension shows what will be redacted before committing — the same principle applies across platforms. Example: A litigation support team at a large law firm handles 200,000-document e-discovery productions monthly. Their previous ML-only tool's 35% false positive

Question 3

How do I ensure my automated redaction tool doesn't over-redact and hide evidence that opposing counsel needs?

Accepted Answer

Confidence scoring per entity (0-100%) provides the basis for audit trails. Per-entity operator configuration allows legal teams to apply different handling rules to different entity types (e.g., replace party names with pseudonyms but redact SSNs). Reversible encryption maintains the ability to restore original text when authorized review is needed. Example: A legal technology team at a large law firm preparing document production in a commercial litigation matter. They need to redact client id

Question 4

Our PII detection tool redacts too many things that aren't PII — it's creating a huge manual review burden. How do we reduce false positives?

Accepted Answer

Three-tier hybrid: regex handles structured data with 100% reproducibility; spaCy NLP handles contextual name/org/location detection; XLM-RoBERTa handles cross-lingual ambiguity. Confidence thresholds are configurable per entity type — a legal team can set names to 90% confidence while keeping phone numbers at regex-certainty. Example: A large law firm's e-discovery team processes 50,000 documents per litigation matter. Their ML-only redaction tool produces 35% false positive rate, requiring att

Question 5

How do I explain to auditors exactly why a specific piece of text was redacted or not redacted?

Accepted Answer

Confidence scoring per entity provides the audit trail foundation. The hybrid approach's use of regex for structured data makes those detections fully reproducible and explainable (exact pattern matched). NLP detections include entity type, model, and confidence — sufficient for compliance documentation. Example: A clinical research organization must demonstrate to an IRB (Institutional Review Board) that their de-identification process meets HIPAA Expert Determination standards. The audit requi

Question 6

We need PII detection for KYC document processing — false positives slow down customer onboarding. How do we balance speed and accuracy?

Accepted Answer

Context-aware hybrid detection with configurable thresholds per entity type. Financial-specific entity types (bank accounts, SWIFT codes, BICs, IBAN formats) use regex for deterministic detection. Names use NLP with context words and confidence scoring. Threshold configuration allows financial teams to tune for their specific volume/accuracy trade-off. Example: A digital banking platform processes 5,000 KYC applications daily across 15 European countries. Their PII detection step creates a 2-day

Question 7

Presidio is flagging everything as PII in our log files — how do I reduce false positives without missing real PII?

Accepted Answer

The hybrid three-tier architecture separates structured data (regex with 100% reproducibility) from contextual detection (NLP) from cross-lingual detection (transformers). Confidence thresholds are configurable per entity type. Context-aware enhancement boosts scores when context words appear near matches and suppresses false positives when context is absent. The result is dramatically lower false positive rates than Presidio defaults. Example: A data engineering team at a healthcare company run

Question 8

How do I prevent developers from accidentally pasting API keys and source code into Claude or Cursor?

Accepted Answer

MCP Server intercepts all prompts sent to Claude Desktop and Cursor before they reach the AI model. API keys, connection strings, and credentials are detected (custom entity patterns support proprietary secret formats) and anonymized/redacted before transmission. The developer's workflow is unchanged — the protection is transparent. Example: A software development team at a fintech company uses Cursor IDE with Claude for code review and debugging. Their security team discovered three instances o

Question 9

Our lawyers are using Claude for contract review — how do we prevent client PII and deal terms from being sent to Anthropic?

Accepted Answer

MCP Server anonymizes client names, company names, deal terms, and financial figures before they reach Claude. The AI processes anonymized versions and produces output with placeholders. With reversible encryption enabled, anonym.legal automatically de-anonymizes the AI's output — the lawyer sees the original names restored in the AI response. Example: A mid-size law firm's M&A practice group uses Claude for first-pass contract review. Client names ("TechCorp acquiring MegaStartup for $450M") ar

Question 10

Samsung banned ChatGPT after employees leaked source code — how do we allow AI tools without banning them entirely?

Accepted Answer

MCP Server acts as a transparent proxy between AI tools and the AI model. Sensitive data (source code secrets, customer PII, financial figures) is anonymized before reaching the AI. Employees continue using Claude Desktop and Cursor normally. Security teams have the control they need without productivity sacrifice. Example: A semiconductor manufacturer's security team wants to allow AI coding assistants after their competitor's Samsung-style ban hurt developer morale and productivity. They deplo

Question 11

A government contractor pasted FEMA flood relief applicant data into ChatGPT — what technical controls should have prevented this?

Accepted Answer

Chrome Extension intercepts clipboard content before it reaches ChatGPT's input field. MCP Server intercepts at the model layer for Claude/Cursor. Both provide real-time detection with a preview modal before submission — employees see what will be anonymized and can proceed with protected data or cancel. No training required; the tool catches what employees miss. Example: A federal agency grants FOIA processing team access to ChatGPT for summarization tasks. Policy prohibits including claimant P

Question 12

83% of organizations lack controls to prevent sensitive data from entering AI tools — what does a practical solution look like?

Accepted Answer

Chrome Extension installs in minutes and immediately intercepts PII before it reaches ChatGPT, Claude.ai, and Gemini. No DLP configuration required. MCP Server for Claude Desktop and Cursor requires minimal setup. Both tools work without network-level changes, making them deployable on individual workstations or enterprise-wide via policy. Example: A 200-person professional services firm learns from industry news that 83% of organizations lack AI controls. Their CISO wants to implement controls

Question 13

How do I use Cursor/Claude for coding without accidentally sending API keys, database credentials, and proprietary algorithms to the AI?

Accepted Answer

The MCP Server on port 3100 acts as a transparent proxy. All text passed to Claude Desktop or Cursor through the MCP protocol is filtered for PII before reaching the AI model. Developers configure once; protection is automatic. All 5 anonymization methods are available — developers can use reversible encryption to pseudonymize code identifiers (e.g., customer IDs in database queries) and decrypt AI responses automatically. Example: A senior developer at a healthcare SaaS company using Cursor to

Question 14

How do I let developers use AI tools while preventing PII from leaving our corporate network?

Accepted Answer

The MCP Server provides exactly this technical control layer. It sits between the user's AI tool and the AI model API. All prompts pass through the anonymization engine; sensitive data is replaced/encrypted before transmission. Security teams get audit trails. Developers get AI productivity. The reversible encryption option means responses from the AI can reference the pseudonymized data and be automatically decrypted for the developer's view. Example: The CISO at a German automotive manufacture

Question 15

We anonymized documents for sharing, but now legal needs the originals for discovery — how do we get them back?

Accepted Answer

AES-256-GCM reversible encryption preserves the mathematical relationship between the anonymized token and the original value. With the client-held encryption key, any anonymized document can be fully restored to its original content. Without the key, the anonymized version is computationally indistinguishable from a permanently redacted document. Legal teams share encrypted versions; produce originals when required using the retained key. Example: A pharmaceutical company shares clinical trial

Question 16

We de-identified patient data for research, but now need to contact specific patients based on research findings — how?

Accepted Answer

Reversible encryption creates a protected pseudonymization layer. The research dataset uses encrypted tokens. The decryption key is held by the designated data custodian. When re-contact is clinically justified and IRB-approved, the custodian decrypts the specific participant records to enable follow-up. The broader dataset remains protected — only the specific authorized decryption is performed. Example: A European oncology research center conducts a 5,000-patient study using anonym.legal's enc

Question 17

We anonymized documents to share with outside counsel, but now we need to produce the originals in discovery. How do we recover the original data?

Accepted Answer

Reversible encryption using AES-256-GCM generates deterministic encrypted tokens from original PII. The key is held only by the user. "John Smith" becomes "[ENC:x9f3a...]" consistently throughout the document — maintaining referential integrity. When authorized de-anonymization is needed (discovery production, audit verification, research follow-up), the user applies their key and all tokens restore to originals. The Chrome Extension auto-decrypts AI responses, so working with encrypted data is

Question 18

Our external auditors need to verify the original data behind our redacted financial reports — how do we handle this?

Accepted Answer

Reversible encryption allows selective de-anonymization. The finance team shares encrypted anonymized reports. Auditors working under formal engagement can be given decryption capability for their audit period. After audit completion, the key can be rotated — previous encrypted copies remain protected, auditors cannot retroactively access records outside their engagement. Example: A private equity firm shares portfolio company financial data with an external audit firm for annual review. Client

Question 19

Anonymous employee surveys revealed a serious harassment allegation — we need to follow up but can't identify who filed it. What should we do?

Accepted Answer

Reversible encryption allows HR to run "conditionally anonymous" surveys. Responses are encrypted before storage. The decryption key is held by a designated HR executive (or third-party ombudsman). When a response contains a serious allegation meeting predefined criteria (e.g., physical harassment, legal violations), the authorized party can decrypt that specific response to identify the reporter and initiate formal investigation. Example: A 2,000-employee manufacturing company's annual culture

Question 20

We use AI to process customer queries but need to restore original names for the final response — how does token mapping work across AI interactions?

Accepted Answer

Session-based token mapping maintains consistent anonymization within a conversation. The same customer name always maps to the same token within a session. Auto-decrypt in Chrome Extension responses restores real names in AI outputs before display. Persistent token mapping is also available for longer-lived workflows. Example: A German insurance company's AI-powered claims processing system processes customer complaint emails. Customer names, policy numbers, and claim amounts are anonymized bef

Question 21

We de-identified patient data for a research study. Now we need to re-contact participants for a follow-up. How do we identify them?

Accepted Answer

Reversible encryption generates consistent tokens (deterministic AES-256-GCM) — "Patient_001" maps to the same encrypted token throughout all study records. The research team holds the key. Re-identification for follow-up requires the key holder to decrypt. All decrypt events are logged. This satisfies both the IRB requirement for controlled re-identification capability and the HIPAA Safe Harbor requirement for de-identified data sharing.

Question 22

Our healthcare system uses proprietary patient identifiers (MRN format: HOSP-YYYY-XXXXXX). HIPAA requires de-identification but no tool detects our format. We'd need to write custom code — is there a simpler way?

Accepted Answer

Custom entity creation with AI-assisted regex generation is purpose-built for this use case. A compliance officer describes the MRN format ("Hospital identifier starting with HOSP, dash, 4-digit year, dash, 6-digit number") and receives a working regex pattern. Custom entity is saved, applied to all document processing, and shared with the team via presets. Zero engineering required. HIPAA Safe Harbor compliance for organization-specific identifiers is achievable in under an hour. Example: A reg

Question 23

Our employee ID format is 'EMP-XXXXX' — none of the standard PII tools detect it. How do we anonymize internal identifiers that aren't standard PII types?

Accepted Answer

Custom entity creation with AI-assisted pattern generation. Users describe their identifier format in plain language ("Employee IDs that start with EMP followed by 5 digits") and the AI generates the appropriate regex pattern. Custom entities integrate seamlessly with the existing 260+ type detection. Results can be saved as presets and shared across teams. Zero engineering required — compliance and legal teams can define their own patterns. Example: A financial services firm has customer accoun

Question 24

We work with German tax identification numbers (Steueridentifikationsnummer) — 11 digits starting with a non-zero digit. Standard tools don't detect them. Is there a way to add this?

Accepted Answer

The 260+ entity library includes major European national identifiers. For formats not yet covered, the custom entity builder allows compliance teams to add them using the AI pattern assistant or manually entering the regex. Once added, they're available in all processing modes and can be shared via presets to the entire team. The German Steueridentifikationsnummer, for example, can be added in under 5 minutes. Example: A German payroll outsourcing firm processes documents for 500 client companie

Question 25

I'm trying to build a GDPR-compliant customer support AI. The problem is customer messages contain our order IDs (ORD-XXXXXXX) alongside standard PII. I need to strip both before sending to the AI. How do I handle custom identifiers?

Accepted Answer

Custom entity creation for order IDs and account numbers in specific formats, combined with the default 260+ entity type detection, provides complete anonymization in a single pass. The Chrome Extension or MCP Server can apply custom entity detection in real-time as support agents type — preventing PII and custom identifiers from ever reaching external AI systems. Configuration is shareable across the support team via presets. Example: A SaaS company's customer support team uses Claude via their

Frequently Asked Questions

Hybrid Recognizer System

Our de-identification tool misses PHI in clinical notes — LLM studies show >50% miss rate. What should we use instead?

Over-redaction in e-discovery is causing sanctions — our tool blacks out too much. What causes this and how do we fix it?

How do I ensure my automated redaction tool doesn't over-redact and hide evidence that opposing counsel needs?

Our PII detection tool redacts too many things that aren't PII — it's creating a huge manual review burden. How do we reduce false positives?

How do I explain to auditors exactly why a specific piece of text was redacted or not redacted?

We need PII detection for KYC document processing — false positives slow down customer onboarding. How do we balance speed and accuracy?

Presidio is flagging everything as PII in our log files — how do I reduce false positives without missing real PII?

MCP Server Integration

How do I prevent developers from accidentally pasting API keys and source code into Claude or Cursor?

Our lawyers are using Claude for contract review — how do we prevent client PII and deal terms from being sent to Anthropic?

Samsung banned ChatGPT after employees leaked source code — how do we allow AI tools without banning them entirely?

A government contractor pasted FEMA flood relief applicant data into ChatGPT — what technical controls should have prevented this?

83% of organizations lack controls to prevent sensitive data from entering AI tools — what does a practical solution look like?

How do I use Cursor/Claude for coding without accidentally sending API keys, database credentials, and proprietary algorithms to the AI?

How do I let developers use AI tools while preventing PII from leaving our corporate network?

Reversible Encryption (UNIQUE Tokens)

We anonymized documents for sharing, but now legal needs the originals for discovery — how do we get them back?

We de-identified patient data for research, but now need to contact specific patients based on research findings — how?

We anonymized documents to share with outside counsel, but now we need to produce the originals in discovery. How do we recover the original data?

Our external auditors need to verify the original data behind our redacted financial reports — how do we handle this?

Anonymous employee surveys revealed a serious harassment allegation — we need to follow up but can't identify who filed it. What should we do?

We use AI to process customer queries but need to restore original names for the final response — how does token mapping work across AI interactions?

We de-identified patient data for a research study. Now we need to re-contact participants for a follow-up. How do we identify them?

Custom Entity Creation

Our healthcare system uses proprietary patient identifiers (MRN format: HOSP-YYYY-XXXXXX). HIPAA requires de-identification but no tool detects our format. We'd need to write custom code — is there a simpler way?

Our employee ID format is 'EMP-XXXXX' — none of the standard PII tools detect it. How do we anonymize internal identifiers that aren't standard PII types?

We work with German tax identification numbers (Steueridentifikationsnummer) — 11 digits starting with a non-zero digit. Standard tools don't detect them. Is there a way to add this?

I'm trying to build a GDPR-compliant customer support AI. The problem is customer messages contain our order IDs (ORD-XXXXXXX) alongside standard PII. I need to strip both before sending to the AI. How do I handle custom identifiers?

We're building a legal discovery tool and need to detect case reference numbers, attorney bar numbers, and court docket IDs — none of which are standard PII. How do we add legal-specific identifiers?

Every hospital in our network has a different Medical Record Number format. How do I create custom detection rules without being a regex expert?

Presidio Foundation

I set up Presidio but it's generating massive false positives — it's flagging almost every capitalized word as a person name. The precision is terrible. Is there a way to fix this?

Presidio's setup took 3 days and still crashes randomly. I'm spending more time maintaining infrastructure than doing actual data work. Is there a managed alternative?

Presidio only detects about 40 entity types out of the box. We need European tax IDs, IBAN numbers, German registration numbers, and more. Does anyone have comprehensive recognizer libraries?

Presidio's documentation is really sparse for production deployment — I can't find guidance on how to scale it, monitor it, or handle failures. Anyone have production deployment experience?

We want Presidio's capabilities but spending weeks on setup and Python dependency management is not viable. Is there a managed option?

We built our anonymization pipeline on Presidio and now we're getting inconsistent results across different environments. Our staging results differ from production. How do we ensure reproducibility?

Real-Time Detection

By the time we realize PII was sent to our AI vendor, it's too late — the data is already in their training pipeline. We need prevention, not just detection after the fact.

We audit AI tool usage for compliance — how do we know which employees are sending PII to AI systems? We need real-time monitoring, not just after-the-fact logs.

Is it worth implementing real-time PII detection if our existing monitoring catches violations after the fact?

How do we prevent PHI from appearing in AI-generated clinical notes before they're saved to the EHR?

Our compliance team wants to see confidence scores for each detected PII entity — we need to know how certain the system is before auto-redacting. Where can we find tools with confidence scoring?

We want to catch PII before it enters our database — is there a way to do real-time validation on form inputs before they're stored?

I paste customer emails into our AI summarization tool constantly. I keep forgetting to remove PII first. Is there a way to have it automatically highlight PII before I accidentally send it?