Designed and implemented a daily batch processing system for Chilean medical claims that analyzes 1,000+
scanned prescriptions to identify duplicates, detect potential fraud patterns, and flag suspicious
claims. The system uses pytesseract for Spanish OCR to extract text from scanned documents, Sentence
Transformers for text embeddings, and FAISS for efficient similarity search. Operating as a scheduled
cron job at 5:00 AM, the system sends email notifications to fraud investigation and quality assurance
teams.
Technical Architecture
flowchart TD
A[Scanned Prescriptions] --> B[Daily Batch Trigger
Cron Job]
B --> C[Document Ingestion
Medical prescriptions]
C --> D[OCR Processing
pytesseract Spanish]
D --> E[Text Preprocessing
Cleaning, Normalization]
E --> F[Embedding Generation
Sentence Transformers]
F --> G[FAISS Vector Search
Similarity Matching]
G --> H{Similarity Analysis}
H -->|≥0.75| I[Duplicates & Related]
H -->|<0.75| J[Outliers] I --> K[Fraud Investigation]
J --> L[Quality Review]
K --> M[Report Generation]
L --> M
M --> N[Email Notifications
To Stakeholders]
Architecture Components
- pytesseract (Spanish): Extracts text from scanned prescriptions using
Tesseract OCR with Spanish language model. Handles Chilean medical terminology, medication
names, and prescription formats.
- Sentence Transformers (all-MiniLM-L6-v2): Generates 384-dimensional
embeddings for text similarity analysis. Chosen for its balance of accuracy and speed with
Spanish language support.
- FAISS IVF Index: Efficient similarity search across growing prescription
database. Using Inverted File index with 100 clusters for 50x faster search than brute-force
methods.
- Cron Job Scheduler: Daily execution at 5:00 AM, processing previous day's
prescriptions in approximately 30 minutes.
- PostgreSQL: Metadata storage for prescriptions, similarity results, and
tracking of processed items.
- Email Notifications: Sends targeted email notifications to relevant
stakeholders based on similarity categories (fraud team, quality team).
Daily Batch Workflow
sequenceDiagram
participant C as Cron Scheduler
participant S as Script Engine
participant O as OCR Engine
pytesseract
participant D as Database
participant F as FAISS Index
participant E as Email Service
C->>S: Trigger (5:00 AM)
S->>D: Fetch new prescriptions (yesterday)
D-->>S: 1K scanned prescriptions
S->>O: Extract text from scans
O-->>S: Spanish text extracted
S->>S: Preprocess & Generate Embeddings
S->>F: Vector similarity search
F-->>S: Top 100 nearest neighbors
S->>S: Cluster & Categorize
S->>D: Store results
S->>E: Send email notifications
E-->>S: Emails sent
S-->>C: Completed (5:30 AM)
The daily batch process runs during off-peak hours to minimize system load. The workflow is designed
to be fault-tolerant with retry mechanisms and comprehensive logging for monitoring and debugging.
- Document Ingestion: Fetches all scanned prescriptions created/modified in the
previous 24 hours from healthcare providers and pharmacies in Chile
- OCR Processing: Extracts Spanish text from scanned prescriptions using
pytesseract with Chilean medical terminology and medication name recognition
- Preprocessing Pipeline: Text cleaning, normalization, and handling of OCR
artifacts before embedding generation
- Parallel Processing: Embedding generation and FAISS search run in parallel
across multiple CPU cores to optimize throughput
- Result Categorization: Prescriptions classified based on similarity thresholds
with actionable outputs for each category
- Automated Notifications: Different report formats for different stakeholders
(fraud team gets suspicious prescription pairs, quality team gets OCR quality outliers)
Similarity Detection Process
flowchart LR
A[Scanned Prescription] --> B[OCR Extraction
pytesseract Spanish]
B --> C[Text Preprocessing
Clean & Normalize]
C --> D[Generate Embedding
384-dim vector]
D --> E[FAISS IVF Index
Growing database]
E --> F[Search Top 100
Nearest Neighbors]
F --> G[Calculate Cosine Similarity]
G --> H{Similarity Threshold}
H -->|≥0.95| I[Duplicate Flag]
H -->|0.75-0.95| J[Related Flag]
H -->|<0.75| K[No Match] I --> L[Fraud Investigation]
J --> L
K --> M[Normal Processing]
Similarity Categories
| Category |
Threshold |
Examples |
Action |
| Exact Duplicates |
≥ 0.98 |
Identical prescriptions scanned multiple times |
Fraud investigation alert |
| Near Duplicates |
0.95 - 0.98 |
Same medication/dosage with minor variations (dates, formatting) |
Fraud investigation alert |
| Related |
0.75 - 0.95 |
Similar medications, same patient, same doctor |
Fraud investigation, pattern analysis |
| Unique |
< 0.75 |
No significant similarities in database |
Normal processing, quality review |
FAISS Optimization Details
- Index Type: FAISS IVF (Inverted File) with 100 clusters, optimized for
recall-speed tradeoff
- Vector Dimension: 384 (all-MiniLM-L6-v2 output)
- Index Size: Scales with document volume, ~1.5MB per 1,000 vectors in RAM
for sub-millisecond search
- Indexing Time: ~3 minutes for initial build, daily updates add ~1-2 minutes
- Search Speed: ~0.3ms per query for top-100 nearest neighbors
- NPROBE Parameter: Set to 20 clusters to search, balancing accuracy and
speed (achieves 95% recall vs exact search)
Reports & Outputs
flowchart TD
A[Similarity Results] --> B{Categorization}
B --> C[Duplicate Cluster]
B --> D[Related Cluster]
B --> E[Outlier Set]
C --> F[Fraud Alert Report
Suspicious pairs]
D --> F
E --> G[Quality Report
Outliers for review]
F --> H[Email Notification
Fraud Team]
G --> I[Email Notification
Quality Team]
H --> J[Database Log]
I --> J
The system generates two primary report types sent as email notifications to specific stakeholder
groups with actionable insights and recommendations.
- Fraud Alert Report: Identifies duplicate and related prescriptions from
different patients, doctors, or pharmacies that may indicate prescription fraud rings or
medication diversion. Includes similarity scores, OCR quality metrics, and recommended
investigation priority based on medication type and value.
- Quality Review Report: Highlights outlier prescriptions with poor OCR quality,
unrecognized medication names, or unusual patterns indicating potential data quality issues or
new prescription formats requiring human review.
Key Learnings
Building this OCR-based medical claim fraud detection system provided valuable insights into Spanish
OCR, Chilean medical terminology, large-scale similarity search, and batch processing challenges:
- FAIVS Index Selection: The choice between FAISS index types (IVF, HNSW, PQ)
depends on the specific use case. IVF provided the best balance of accuracy, speed, and memory
for our growing prescription database. Tuning nprobe was critical—too low reduces accuracy, too
high increases latency. Finding the optimal point (nprobe=20) required systematic
experimentation.
- Threshold Calibration: Similarity thresholds (0.95 for duplicates, 0.75 for
related) were not set arbitrarily. Required extensive manual validation of labeled data across
different prescription types and formats. Thresholds that work well for standard printed
prescriptions were too strict for handwritten ones. Implemented format-specific thresholds in
production.
- Batch vs Real-time Trade-offs: While real-time similarity detection would
provide immediate insights, batch processing was more appropriate for this use case due to: (1)
fraud investigation doesn't require real-time, (2) daily reports are sufficient for analysis,
(3) batch allows for more thorough analysis and error handling, (4) significantly lower
infrastructure costs with predictable resource usage.
- OCR Quality Challenges: Scanned prescriptions vary significantly in
quality—from crisp digital prints to faded carbon copies and handwritten notes. Implemented OCR
confidence scoring to flag low-quality extractions. pytesseract with Spanish model achieved 92%
average accuracy, but handwritten sections required manual review. Image preprocessing
(denoising, contrast enhancement) improved OCR accuracy by 15%.
- Chilean Medical Terminology: Successfully handled Chilean-specific medication
names, dosage formats, and prescription abbreviations. Created a custom dictionary of 500+
Chilean medications and medical terms to improve OCR recognition. Dosage formats (mg/ml, gotas,
comprimidos) required normalization for consistent similarity matching.
- False Positive Management: In production, false positives are costly—they cause
wasted investigation time for the fraud team. Implemented a two-stage process: (1) automated
categorization, (2) human-in-the-loop for borderline cases. Added confidence scores and
explainability features to help reviewers make decisions faster.
- Document Preprocessing Importance: The quality of similarity detection heavily
depends on text preprocessing. Removed boilerplate headers/footers, normalized dates/times,
handled OCR artifacts, and applied Spanish text normalization. Preprocessing increased accuracy
by 8-10% but added 40% to processing time—worthwhile trade-off.
- Incremental Index Updates: Rebuilding the entire FAISS index daily took too
long. Implemented incremental updates—adding new vectors and periodically rebuilding for
balance. Daily adds take 1-2 minutes, full rebuild happens monthly during maintenance window.
This keeps batch duration under 30 minutes.
- Monitoring and Alerting: Batch processes can fail silently. Implemented
comprehensive monitoring: success/failure notifications, processing duration alerts (if >1
hour), result volume alerts (if duplicates drop suddenly), and automated health checks. The
system has 99.8% uptime with average recovery time of 15 minutes when issues occur.
- Scalability Planning: With 365K new prescriptions/year, the system needs to
scale to millions of vectors over time. Architecture supports this: FAISS can handle billions of
vectors, current hardware has headroom, and the batch duration will remain manageable as
prescription volume grows. Future-proofing is essential for long-term projects.
- Stakeholder Communication: Different stakeholders care about different metrics.
Fraud team wants suspicious prescription pairs, quality team wants OCR outliers. Tailored email
reports with relevant metrics and actionable recommendations increased adoption from 60% to 95%
across teams. Communication is as important as technical implementation.