Architected a high-precision PII detection engine for LATAM and Malaysian markets, combining regex rules
with an ensemble of transformer models for semantic understanding. The system incorporates Tesseract OCR
for document processing and detects PII across countries, ensuring 100% compliance with regional
regulations including Chile's Law 19.628, Brazil's LGPD, Colombia's Law 1581, Uruguay's Law 18.331, and
Malaysia's PDPA 2010.
Technical Architecture
flowchart TD
A[Data Ingestion
Documents & Communications] --> B{Image/Document?}
B -->|Yes| C[Tesseract OCR
Text Extraction]
B -->|No| D[Preprocessing]
C --> D
D --> E[Hybrid Detection Engine]
E --> F[Regex Rules
Fast Pattern Matching]
E --> G[Transformer Ensemble
Semantic Understanding]
F --> H[Fusion Algorithm
Confidence Scoring]
G --> H
H --> I{PII Detected?}
I -->|Yes| J[Alert Users & Managers]
I -->|No| K[Process Normally]
J --> L[Compliance Logging]
L --> M[Generate Report]
M --> N[Country-Specific
Regulatory Alert]
K --> O[Normal Processing]
Business Impact
- 100% compliance with LATAM and Malaysian PII regulations
- Real-time alerts to users and managers across all countries
- High-precision detection using hybrid regex + transformer ensemble architecture
- Comprehensive coverage of country-specific personal identifiers
- On-premise deployment ensuring data sovereignty and compliance
- Faster inference (3-5x) compared to LLM-based approaches
- Robust OCR pipeline handling scanned documents despite Tesseract limitations
PII Regulations by Country
The system is designed to comply with specific PII regulations across LATAM and Malaysia. Below are
the key legal frameworks and personal identifier formats detected by the system.
| Country |
Law Code |
Key Requirements |
Personal Identifiers (Examples) |
Official Link |
| Chile |
Law N° 19.628 |
Protection of private life and personal data. Requires explicit consent for processing,
data minimization, and individual rights access/correction/deletion. |
RUT (Rol Único Tributario):
12.345.678-9
12345678-9
Format: XX.XXX.XXX-Y or XXXXXXXX-Y
|
View
Law |
| Brazil |
LGPD - Law 13.709/2018 |
General Data Protection Law. Similar to GDPR. Requires legal basis for processing, data
subject rights, consent management, and breach notification within 2 days. |
CPF (Pessoa Física):
123.456.789-09
CNPJ (Pessoa Jurídica):
12.345.678/0001-99
Format: XXX.XXX.XXX-XX or XX.XXX.XXX/XXXX-XX
|
View Law |
| Colombia |
Law 1581 of 2012 |
Protection of personal data. Habeas Data right, consent requirements, data processing
principles, and security measures for sensitive information. |
Cédula de Ciudadanía:
123456789
Foreigner ID:
ABC123456
Format: 9 digits for citizens, alphanumeric for foreigners
|
View Law |
| Uruguay |
Law 18.331 |
Protection of Personal Data. Requires informed consent, purpose limitation, data
quality, security measures, and individual rights (access, rectification, cancellation,
opposition). |
Cédula de Identidad (CI):
12345678
1.234.567-8
Format: 8 digits, can include punctuation
|
View Law |
| Malaysia |
PDPA 2010 (Act 709) |
Personal Data Protection Act. Requires data user consent, purpose limitation, data
accuracy, security measures, and data subject rights to access/correct data. |
MyKad (National ID):
123456-12-5678
Passport:
A1234567
Format: XXXXXX-XX-XXXX for MyKad, AXXXXXXXX for passport
|
View Agency |
OCR Pipeline with Tesseract
Due to authorization constraints on using modern cloud-based OCR services, the system implements
Tesseract OCR for text extraction from images and scanned documents. This choice required
significant optimization to achieve acceptable accuracy.
- Image Preprocessing: Adaptive thresholding, noise reduction, and contrast
enhancement to improve Tesseract's recognition accuracy on challenging documents
- Multi-language Support: Configured Tesseract for Spanish, Portuguese, English,
and Malay to handle all target markets
- Layout Analysis: Custom post-processing to maintain document structure and
spatial relationships between text elements
- Confidence Filtering: OCR quality thresholds to reject low-confidence
extractions that could lead to false PII positives
OCR Technology Comparison
| Technology |
Accuracy |
Layout Understanding |
Handwriting |
Context Awareness |
Deployment |
| Tesseract (Used) |
Low-Medium (60-80%) |
Basic (zone-based) |
None |
None |
On-premise, open-source |
| SOTA OCR (e.g., PaddleOCR, TrOCR) |
High (90-95%) |
Advanced (deep learning) |
Partial |
Limited |
Often cloud-based |
| Vision Language Models (e.g., GPT-4V, Claude Vision) |
Very High (95%+) |
Excellent (semantic) |
Good |
Full (semantic PII detection) |
Cloud API only |
Tesseract
Limitations & Mitigations
- Low accuracy on complex layouts: Tables and multi-column documents often
produce garbled text → Mitigated with custom layout analysis rules
- No handwriting support: Cannot detect PII in handwritten notes → System
flags documents requiring manual review
- Poor quality scans: Degrades significantly with noise, blur, or low
resolution → Aggressive preprocessing pipeline added
- No contextual understanding: Cannot identify semantic PII (e.g., "my phone
is 555-1234") → Rely on transformer ensemble for this capability
- False positives in OCR errors: Random characters may match regex patterns →
Confidence scoring and validation filters applied
Transformer Ensemble Detection
Instead of using a single large language model, the system implements an ensemble of specialized
transformer models, each fine-tuned for specific PII detection tasks. This approach provides better
accuracy, faster inference, and more interpretable results than monolithic LLMs.
- Named Entity Recognition (NER) Models: BERT-based models fine-tuned on Spanish
and Portuguese PII datasets to detect names, organizations, and locations
- Document Classification Models: RoBERTa models trained to identify document
types (ID cards, contracts, medical records) for targeted PII scanning
- Contextual Understanding Models: XLM-RoBERTa for cross-lingual PII detection in
multilingual documents
- Ensemble Voting: Weighted voting mechanism combines predictions from multiple
models, with confidence thresholds calibrated per PII type
- Fusion Algorithm: Combines regex pattern matching with transformer ensemble
outputs using a custom confidence scoring system to minimize false positives while ensuring high
recall
- Continuous Learning: Feedback loop from human-in-the-loop validation improves
model performance over time through active learning
Ensemble Architecture Benefits
- Specialized Models: Each model focuses on a specific PII type, achieving
higher accuracy than general-purpose models
- Faster Inference: Smaller specialized models provide 3-5x faster inference
compared to large LLMs
- Better Interpretability: Clear model attribution for each PII detection,
enabling targeted improvements
- On-Premise Deployment: No dependency on external APIs, ensuring data
privacy and compliance
- Scalability: Individual models can be updated or replaced without affecting
the overall system
Detection Pipeline Integration
The complete detection pipeline integrates OCR, regex patterns, and transformer models in a
coordinated workflow:
- Input Processing: Documents are classified (text, image, PDF) and routed
through appropriate preprocessing
- OCR Extraction: Tesseract extracts text from images/scans with quality
filtering
- Parallel Detection: Regex patterns and transformer ensemble analyze text
simultaneously
- Result Fusion: Confidence scores from all detectors are combined using weighted
averaging
- Threshold-based Alerting: Configurable confidence thresholds determine when to
trigger alerts
- Country-specific Rules: Detection logic adapts to jurisdiction-specific PII
requirements
Key Learnings
Building this hybrid PII detection system provided valuable insights into managing compliance across
multiple jurisdictions:
- Regulatory Complexity: Each country's PII regulations have unique requirements
and identifier formats, requiring a modular architecture that can easily accommodate new
jurisdictions
- Ensemble Model Architecture: Combining multiple specialized transformer models
with regex provides better accuracy, faster inference, and more interpretable results than
monolithic approaches. The modular design allows for targeted improvements without retraining
entire systems
- OCR Trade-offs: Authorization constraints forced Tesseract adoption, which
required significant preprocessing and confidence filtering. While SOTA OCR or VLMs would
provide superior results, the ensemble transformer models compensate by providing strong
semantic understanding of the OCR output
- False Positive Management: In compliance scenarios, minimizing false positives
is as important as maximizing recall. Multi-level confidence thresholds (OCR quality, regex
match strength, transformer confidence) must be carefully calibrated based on business risk
tolerance
- Multi-language Challenges: Spanish, Portuguese, English, and Malay have
distinct linguistic patterns. XLM-RoBERTa's cross-lingual capabilities combined with
language-specific fine-tuned models provide robust multilingual PII detection
- Alert Fatigue: User experience design is critical—too many alerts lead to alert
fatigue and reduced compliance. Smart alerting with context, severity levels, and actionable
guidance improves adoption and response rates
- On-Premise vs. Cloud: Data privacy and compliance requirements mandated
on-premise deployment, limiting access to cloud-based OCR and VLM services. This constraint
drove the design of a comprehensive preprocessing and ensemble architecture