Back to Projects

Hybrid PII Detection System

Insurance Sector

Hybrid NLP Transformers Regex Compliance Python

Architected a high-precision PII detection engine for LATAM and Malaysian markets, combining regex rules with an ensemble of transformer models for semantic understanding. The system incorporates Tesseract OCR for document processing and detects PII across countries, ensuring 100% compliance with regional regulations including Chile's Law 19.628, Brazil's LGPD, Colombia's Law 1581, Uruguay's Law 18.331, and Malaysia's PDPA 2010.

Technical Architecture

flowchart TD A[Data Ingestion
Documents & Communications] --> B{Image/Document?} B -->|Yes| C[Tesseract OCR
Text Extraction] B -->|No| D[Preprocessing] C --> D D --> E[Hybrid Detection Engine] E --> F[Regex Rules
Fast Pattern Matching] E --> G[Transformer Ensemble
Semantic Understanding] F --> H[Fusion Algorithm
Confidence Scoring] G --> H H --> I{PII Detected?} I -->|Yes| J[Alert Users & Managers] I -->|No| K[Process Normally] J --> L[Compliance Logging] L --> M[Generate Report] M --> N[Country-Specific
Regulatory Alert] K --> O[Normal Processing]

Business Impact

  • 100% compliance with LATAM and Malaysian PII regulations
  • Real-time alerts to users and managers across all countries
  • High-precision detection using hybrid regex + transformer ensemble architecture
  • Comprehensive coverage of country-specific personal identifiers
  • On-premise deployment ensuring data sovereignty and compliance
  • Faster inference (3-5x) compared to LLM-based approaches
  • Robust OCR pipeline handling scanned documents despite Tesseract limitations

PII Regulations by Country

The system is designed to comply with specific PII regulations across LATAM and Malaysia. Below are the key legal frameworks and personal identifier formats detected by the system.

Country Law Code Key Requirements Personal Identifiers (Examples) Official Link
Chile Law N° 19.628 Protection of private life and personal data. Requires explicit consent for processing, data minimization, and individual rights access/correction/deletion.
RUT (Rol Único Tributario):
12.345.678-9
12345678-9
Format: XX.XXX.XXX-Y or XXXXXXXX-Y
View Law
Brazil LGPD - Law 13.709/2018 General Data Protection Law. Similar to GDPR. Requires legal basis for processing, data subject rights, consent management, and breach notification within 2 days.
CPF (Pessoa Física):
123.456.789-09
CNPJ (Pessoa Jurídica):
12.345.678/0001-99
Format: XXX.XXX.XXX-XX or XX.XXX.XXX/XXXX-XX
View Law
Colombia Law 1581 of 2012 Protection of personal data. Habeas Data right, consent requirements, data processing principles, and security measures for sensitive information.
Cédula de Ciudadanía:
123456789
Foreigner ID:
ABC123456
Format: 9 digits for citizens, alphanumeric for foreigners
View Law
Uruguay Law 18.331 Protection of Personal Data. Requires informed consent, purpose limitation, data quality, security measures, and individual rights (access, rectification, cancellation, opposition).
Cédula de Identidad (CI):
12345678
1.234.567-8
Format: 8 digits, can include punctuation
View Law
Malaysia PDPA 2010 (Act 709) Personal Data Protection Act. Requires data user consent, purpose limitation, data accuracy, security measures, and data subject rights to access/correct data.
MyKad (National ID):
123456-12-5678
Passport:
A1234567
Format: XXXXXX-XX-XXXX for MyKad, AXXXXXXXX for passport
View Agency

OCR Pipeline with Tesseract

Due to authorization constraints on using modern cloud-based OCR services, the system implements Tesseract OCR for text extraction from images and scanned documents. This choice required significant optimization to achieve acceptable accuracy.

  • Image Preprocessing: Adaptive thresholding, noise reduction, and contrast enhancement to improve Tesseract's recognition accuracy on challenging documents
  • Multi-language Support: Configured Tesseract for Spanish, Portuguese, English, and Malay to handle all target markets
  • Layout Analysis: Custom post-processing to maintain document structure and spatial relationships between text elements
  • Confidence Filtering: OCR quality thresholds to reject low-confidence extractions that could lead to false PII positives

OCR Technology Comparison

Technology Accuracy Layout Understanding Handwriting Context Awareness Deployment
Tesseract (Used) Low-Medium (60-80%) Basic (zone-based) None None On-premise, open-source
SOTA OCR (e.g., PaddleOCR, TrOCR) High (90-95%) Advanced (deep learning) Partial Limited Often cloud-based
Vision Language Models (e.g., GPT-4V, Claude Vision) Very High (95%+) Excellent (semantic) Good Full (semantic PII detection) Cloud API only

Tesseract Limitations & Mitigations

  • Low accuracy on complex layouts: Tables and multi-column documents often produce garbled text → Mitigated with custom layout analysis rules
  • No handwriting support: Cannot detect PII in handwritten notes → System flags documents requiring manual review
  • Poor quality scans: Degrades significantly with noise, blur, or low resolution → Aggressive preprocessing pipeline added
  • No contextual understanding: Cannot identify semantic PII (e.g., "my phone is 555-1234") → Rely on transformer ensemble for this capability
  • False positives in OCR errors: Random characters may match regex patterns → Confidence scoring and validation filters applied

Transformer Ensemble Detection

Instead of using a single large language model, the system implements an ensemble of specialized transformer models, each fine-tuned for specific PII detection tasks. This approach provides better accuracy, faster inference, and more interpretable results than monolithic LLMs.

  • Named Entity Recognition (NER) Models: BERT-based models fine-tuned on Spanish and Portuguese PII datasets to detect names, organizations, and locations
  • Document Classification Models: RoBERTa models trained to identify document types (ID cards, contracts, medical records) for targeted PII scanning
  • Contextual Understanding Models: XLM-RoBERTa for cross-lingual PII detection in multilingual documents
  • Ensemble Voting: Weighted voting mechanism combines predictions from multiple models, with confidence thresholds calibrated per PII type
  • Fusion Algorithm: Combines regex pattern matching with transformer ensemble outputs using a custom confidence scoring system to minimize false positives while ensuring high recall
  • Continuous Learning: Feedback loop from human-in-the-loop validation improves model performance over time through active learning

Ensemble Architecture Benefits

  • Specialized Models: Each model focuses on a specific PII type, achieving higher accuracy than general-purpose models
  • Faster Inference: Smaller specialized models provide 3-5x faster inference compared to large LLMs
  • Better Interpretability: Clear model attribution for each PII detection, enabling targeted improvements
  • On-Premise Deployment: No dependency on external APIs, ensuring data privacy and compliance
  • Scalability: Individual models can be updated or replaced without affecting the overall system

Detection Pipeline Integration

The complete detection pipeline integrates OCR, regex patterns, and transformer models in a coordinated workflow:

  • Input Processing: Documents are classified (text, image, PDF) and routed through appropriate preprocessing
  • OCR Extraction: Tesseract extracts text from images/scans with quality filtering
  • Parallel Detection: Regex patterns and transformer ensemble analyze text simultaneously
  • Result Fusion: Confidence scores from all detectors are combined using weighted averaging
  • Threshold-based Alerting: Configurable confidence thresholds determine when to trigger alerts
  • Country-specific Rules: Detection logic adapts to jurisdiction-specific PII requirements

Key Learnings

Building this hybrid PII detection system provided valuable insights into managing compliance across multiple jurisdictions:

  • Regulatory Complexity: Each country's PII regulations have unique requirements and identifier formats, requiring a modular architecture that can easily accommodate new jurisdictions
  • Ensemble Model Architecture: Combining multiple specialized transformer models with regex provides better accuracy, faster inference, and more interpretable results than monolithic approaches. The modular design allows for targeted improvements without retraining entire systems
  • OCR Trade-offs: Authorization constraints forced Tesseract adoption, which required significant preprocessing and confidence filtering. While SOTA OCR or VLMs would provide superior results, the ensemble transformer models compensate by providing strong semantic understanding of the OCR output
  • False Positive Management: In compliance scenarios, minimizing false positives is as important as maximizing recall. Multi-level confidence thresholds (OCR quality, regex match strength, transformer confidence) must be carefully calibrated based on business risk tolerance
  • Multi-language Challenges: Spanish, Portuguese, English, and Malay have distinct linguistic patterns. XLM-RoBERTa's cross-lingual capabilities combined with language-specific fine-tuned models provide robust multilingual PII detection
  • Alert Fatigue: User experience design is critical—too many alerts lead to alert fatigue and reduced compliance. Smart alerting with context, severity levels, and actionable guidance improves adoption and response rates
  • On-Premise vs. Cloud: Data privacy and compliance requirements mandated on-premise deployment, limiting access to cloud-based OCR and VLM services. This constraint drove the design of a comprehensive preprocessing and ensemble architecture