Home / Projects / Hybrid PII Detection

Hybrid PII Detection System

Insurance Sector

Hybrid NLP Transformers Regex Compliance Python

Architected a high-precision PII detection engine for LATAM and Malaysian markets, combining regex rules with an ensemble of transformer models for semantic understanding. The system incorporates Tesseract OCR for document processing and detects PII across countries, ensuring 100% compliance with regional regulations including Chile's Law 19.628, Brazil's LGPD, Colombia's Law 1581, Uruguay's Law 18.331, and Malaysia's PDPA 2010.

Technical Architecture

flowchart TD A[Data Ingestion
Documents & Communications] --> B{Image/Document?} B -->|Yes| C[Tesseract OCR
Text Extraction] B -->|No| D[Preprocessing] C --> D D --> E[Hybrid Detection Engine] E --> F[Regex Rules
Fast Pattern Matching] E --> G[Transformer Ensemble
Semantic Understanding] F --> H[Fusion Algorithm
Confidence Scoring] G --> H H --> I{PII Detected?} I -->|Yes| J[Alert Users & Managers] I -->|No| K[Process Normally] J --> L[Compliance Logging] L --> M[Generate Report] M --> N[Country-Specific
Regulatory Alert] K --> O[Normal Processing]

Business Impact

100% compliance with LATAM and Malaysian PII regulations
Real-time alerts to users and managers across all countries
High-precision detection using hybrid regex + transformer ensemble architecture
Comprehensive coverage of country-specific personal identifiers
On-premise deployment ensuring data sovereignty and compliance
Faster inference (3-5x) compared to LLM-based approaches
Robust OCR pipeline handling scanned documents despite Tesseract limitations

PII Regulations by Country

The system is designed to comply with specific PII regulations across LATAM and Malaysia. Below are the key legal frameworks and personal identifier formats detected by the system.

Country	Law Code	Key Requirements	Personal Identifiers (Examples)	Official Link
Chile	Law N° 19.628	Protection of private life and personal data. Requires explicit consent for processing, data minimization, and individual rights access/correction/deletion.	RUT (Rol Único Tributario): `12.345.678-9` `12345678-9` Format: XX.XXX.XXX-Y or XXXXXXXX-Y	View Law
Brazil	LGPD - Law 13.709/2018	General Data Protection Law. Similar to GDPR. Requires legal basis for processing, data subject rights, consent management, and breach notification within 2 days.	CPF (Pessoa Física): `123.456.789-09` CNPJ (Pessoa Jurídica): `12.345.678/0001-99` Format: XXX.XXX.XXX-XX or XX.XXX.XXX/XXXX-XX	View Law
Colombia	Law 1581 of 2012	Protection of personal data. Habeas Data right, consent requirements, data processing principles, and security measures for sensitive information.	Cédula de Ciudadanía: `123456789` Foreigner ID: `ABC123456` Format: 9 digits for citizens, alphanumeric for foreigners	View Law
Uruguay	Law 18.331	Protection of Personal Data. Requires informed consent, purpose limitation, data quality, security measures, and individual rights (access, rectification, cancellation, opposition).	Cédula de Identidad (CI): `12345678` `1.234.567-8` Format: 8 digits, can include punctuation	View Law
Malaysia	PDPA 2010 (Act 709)	Personal Data Protection Act. Requires data user consent, purpose limitation, data accuracy, security measures, and data subject rights to access/correct data.	MyKad (National ID): `123456-12-5678` Passport: `A1234567` Format: XXXXXX-XX-XXXX for MyKad, AXXXXXXXX for passport	View Agency

OCR Pipeline with Tesseract

Due to authorization constraints on using modern cloud-based OCR services, the system implements Tesseract OCR for text extraction from images and scanned documents. This choice required significant optimization to achieve acceptable accuracy.

Image Preprocessing: Adaptive thresholding, noise reduction, and contrast enhancement to improve Tesseract's recognition accuracy on challenging documents
Multi-language Support: Configured Tesseract for Spanish, Portuguese, English, and Malay to handle all target markets
Layout Analysis: Custom post-processing to maintain document structure and spatial relationships between text elements
Confidence Filtering: OCR quality thresholds to reject low-confidence extractions that could lead to false PII positives

OCR Technology Comparison

Technology	Accuracy	Layout Understanding	Handwriting	Context Awareness	Deployment
Tesseract (Used)	Low-Medium (60-80%)	Basic (zone-based)	None	None	On-premise, open-source
SOTA OCR (e.g., PaddleOCR, TrOCR)	High (90-95%)	Advanced (deep learning)	Partial	Limited	Often cloud-based
Vision Language Models (e.g., GPT-4V, Claude Vision)	Very High (95%+)	Excellent (semantic)	Good	Full (semantic PII detection)	Cloud API only

Tesseract Limitations & Mitigations

Low accuracy on complex layouts: Tables and multi-column documents often produce garbled text → Mitigated with custom layout analysis rules
No handwriting support: Cannot detect PII in handwritten notes → System flags documents requiring manual review
Poor quality scans: Degrades significantly with noise, blur, or low resolution → Aggressive preprocessing pipeline added
No contextual understanding: Cannot identify semantic PII (e.g., "my phone is 555-1234") → Rely on transformer ensemble for this capability
False positives in OCR errors: Random characters may match regex patterns → Confidence scoring and validation filters applied

Transformer Ensemble Detection

Instead of using a single large language model, the system implements an ensemble of specialized transformer models, each fine-tuned for specific PII detection tasks. This approach provides better accuracy, faster inference, and more interpretable results than monolithic LLMs.

Named Entity Recognition (NER) Models: BERT-based models fine-tuned on Spanish and Portuguese PII datasets to detect names, organizations, and locations
Document Classification Models: RoBERTa models trained to identify document types (ID cards, contracts, medical records) for targeted PII scanning
Contextual Understanding Models: XLM-RoBERTa for cross-lingual PII detection in multilingual documents
Ensemble Voting: Weighted voting mechanism combines predictions from multiple models, with confidence thresholds calibrated per PII type
Fusion Algorithm: Combines regex pattern matching with transformer ensemble outputs using a custom confidence scoring system to minimize false positives while ensuring high recall
Continuous Learning: Feedback loop from human-in-the-loop validation improves model performance over time through active learning

Ensemble Architecture Benefits

Specialized Models: Each model focuses on a specific PII type, achieving higher accuracy than general-purpose models
Faster Inference: Smaller specialized models provide 3-5x faster inference compared to large LLMs
Better Interpretability: Clear model attribution for each PII detection, enabling targeted improvements
On-Premise Deployment: No dependency on external APIs, ensuring data privacy and compliance
Scalability: Individual models can be updated or replaced without affecting the overall system

Detection Pipeline Integration

The complete detection pipeline integrates OCR, regex patterns, and transformer models in a coordinated workflow:

Input Processing: Documents are classified (text, image, PDF) and routed through appropriate preprocessing
OCR Extraction: Tesseract extracts text from images/scans with quality filtering
Parallel Detection: Regex patterns and transformer ensemble analyze text simultaneously
Result Fusion: Confidence scores from all detectors are combined using weighted averaging
Threshold-based Alerting: Configurable confidence thresholds determine when to trigger alerts
Country-specific Rules: Detection logic adapts to jurisdiction-specific PII requirements

Key Learnings

Building this hybrid PII detection system provided valuable insights into managing compliance across multiple jurisdictions:

Regulatory Complexity: Each country's PII regulations have unique requirements and identifier formats, requiring a modular architecture that can easily accommodate new jurisdictions
Ensemble Model Architecture: Combining multiple specialized transformer models with regex provides better accuracy, faster inference, and more interpretable results than monolithic approaches. The modular design allows for targeted improvements without retraining entire systems
OCR Trade-offs: Authorization constraints forced Tesseract adoption, which required significant preprocessing and confidence filtering. While SOTA OCR or VLMs would provide superior results, the ensemble transformer models compensate by providing strong semantic understanding of the OCR output
False Positive Management: In compliance scenarios, minimizing false positives is as important as maximizing recall. Multi-level confidence thresholds (OCR quality, regex match strength, transformer confidence) must be carefully calibrated based on business risk tolerance
Multi-language Challenges: Spanish, Portuguese, English, and Malay have distinct linguistic patterns. XLM-RoBERTa's cross-lingual capabilities combined with language-specific fine-tuned models provide robust multilingual PII detection
Alert Fatigue: User experience design is critical—too many alerts lead to alert fatigue and reduced compliance. Smart alerting with context, severity levels, and actionable guidance improves adoption and response rates
On-Premise vs. Cloud: Data privacy and compliance requirements mandated on-premise deployment, limiting access to cloud-based OCR and VLM services. This constraint drove the design of a comprehensive preprocessing and ensemble architecture