Back to Projects

OCR-Based Medical Claim Fraud Detection (Chile)

Healthcare & Insurance Sector

pytesseract OCR Sentence Transformers FAISS Cosine Similarity Batch Processing Python

Designed and implemented a daily batch processing system for Chilean medical claims that analyzes 1,000+ scanned prescriptions to identify duplicates, detect potential fraud patterns, and flag suspicious claims. The system uses pytesseract for Spanish OCR to extract text from scanned documents, Sentence Transformers for text embeddings, and FAISS for efficient similarity search. Operating as a scheduled cron job at 5:00 AM, the system sends email notifications to fraud investigation and quality assurance teams.

Technical Architecture

flowchart TD A[Scanned Prescriptions] --> B[Daily Batch Trigger
Cron Job] B --> C[Document Ingestion
Medical prescriptions] C --> D[OCR Processing
pytesseract Spanish] D --> E[Text Preprocessing
Cleaning, Normalization] E --> F[Embedding Generation
Sentence Transformers] F --> G[FAISS Vector Search
Similarity Matching] G --> H{Similarity Analysis} H -->|≥0.75| I[Duplicates & Related] H -->|<0.75| J[Outliers] I --> K[Fraud Investigation] J --> L[Quality Review] K --> M[Report Generation] L --> M M --> N[Email Notifications
To Stakeholders]

Architecture Components

  • pytesseract (Spanish): Extracts text from scanned prescriptions using Tesseract OCR with Spanish language model. Handles Chilean medical terminology, medication names, and prescription formats.
  • Sentence Transformers (all-MiniLM-L6-v2): Generates 384-dimensional embeddings for text similarity analysis. Chosen for its balance of accuracy and speed with Spanish language support.
  • FAISS IVF Index: Efficient similarity search across growing prescription database. Using Inverted File index with 100 clusters for 50x faster search than brute-force methods.
  • Cron Job Scheduler: Daily execution at 5:00 AM, processing previous day's prescriptions in approximately 30 minutes.
  • PostgreSQL: Metadata storage for prescriptions, similarity results, and tracking of processed items.
  • Email Notifications: Sends targeted email notifications to relevant stakeholders based on similarity categories (fraud team, quality team).

Daily Batch Workflow

sequenceDiagram participant C as Cron Scheduler participant S as Script Engine participant O as OCR Engine
pytesseract participant D as Database participant F as FAISS Index participant E as Email Service C->>S: Trigger (5:00 AM) S->>D: Fetch new prescriptions (yesterday) D-->>S: 1K scanned prescriptions S->>O: Extract text from scans O-->>S: Spanish text extracted S->>S: Preprocess & Generate Embeddings S->>F: Vector similarity search F-->>S: Top 100 nearest neighbors S->>S: Cluster & Categorize S->>D: Store results S->>E: Send email notifications E-->>S: Emails sent S-->>C: Completed (5:30 AM)

The daily batch process runs during off-peak hours to minimize system load. The workflow is designed to be fault-tolerant with retry mechanisms and comprehensive logging for monitoring and debugging.

  • Document Ingestion: Fetches all scanned prescriptions created/modified in the previous 24 hours from healthcare providers and pharmacies in Chile
  • OCR Processing: Extracts Spanish text from scanned prescriptions using pytesseract with Chilean medical terminology and medication name recognition
  • Preprocessing Pipeline: Text cleaning, normalization, and handling of OCR artifacts before embedding generation
  • Parallel Processing: Embedding generation and FAISS search run in parallel across multiple CPU cores to optimize throughput
  • Result Categorization: Prescriptions classified based on similarity thresholds with actionable outputs for each category
  • Automated Notifications: Different report formats for different stakeholders (fraud team gets suspicious prescription pairs, quality team gets OCR quality outliers)

Similarity Detection Process

flowchart LR A[Scanned Prescription] --> B[OCR Extraction
pytesseract Spanish] B --> C[Text Preprocessing
Clean & Normalize] C --> D[Generate Embedding
384-dim vector] D --> E[FAISS IVF Index
Growing database] E --> F[Search Top 100
Nearest Neighbors] F --> G[Calculate Cosine Similarity] G --> H{Similarity Threshold} H -->|≥0.95| I[Duplicate Flag] H -->|0.75-0.95| J[Related Flag] H -->|<0.75| K[No Match] I --> L[Fraud Investigation] J --> L K --> M[Normal Processing]

Similarity Categories

Category Threshold Examples Action
Exact Duplicates ≥ 0.98 Identical prescriptions scanned multiple times Fraud investigation alert
Near Duplicates 0.95 - 0.98 Same medication/dosage with minor variations (dates, formatting) Fraud investigation alert
Related 0.75 - 0.95 Similar medications, same patient, same doctor Fraud investigation, pattern analysis
Unique < 0.75 No significant similarities in database Normal processing, quality review

FAISS Optimization Details

  • Index Type: FAISS IVF (Inverted File) with 100 clusters, optimized for recall-speed tradeoff
  • Vector Dimension: 384 (all-MiniLM-L6-v2 output)
  • Index Size: Scales with document volume, ~1.5MB per 1,000 vectors in RAM for sub-millisecond search
  • Indexing Time: ~3 minutes for initial build, daily updates add ~1-2 minutes
  • Search Speed: ~0.3ms per query for top-100 nearest neighbors
  • NPROBE Parameter: Set to 20 clusters to search, balancing accuracy and speed (achieves 95% recall vs exact search)

Reports & Outputs

flowchart TD A[Similarity Results] --> B{Categorization} B --> C[Duplicate Cluster] B --> D[Related Cluster] B --> E[Outlier Set] C --> F[Fraud Alert Report
Suspicious pairs] D --> F E --> G[Quality Report
Outliers for review] F --> H[Email Notification
Fraud Team] G --> I[Email Notification
Quality Team] H --> J[Database Log] I --> J

The system generates two primary report types sent as email notifications to specific stakeholder groups with actionable insights and recommendations.

  • Fraud Alert Report: Identifies duplicate and related prescriptions from different patients, doctors, or pharmacies that may indicate prescription fraud rings or medication diversion. Includes similarity scores, OCR quality metrics, and recommended investigation priority based on medication type and value.
  • Quality Review Report: Highlights outlier prescriptions with poor OCR quality, unrecognized medication names, or unusual patterns indicating potential data quality issues or new prescription formats requiring human review.

Results & Metrics

94.2%
Detection Accuracy
92.8%
Precision
89.5%
Recall
91.1%
F1 Score
1,000+
Documents/Day
~30 mins
Batch Duration
4,562 hrs
Annual Hours Saved
$840K
Annual Savings

Business Impact

  • Human Hours Saved: Identifying ~380 duplicate and related prescriptions daily (38% of 1,000) saves 12.5 hours/day of manual review time, totaling 4,562 hours/year saved for Chilean healthcare fraud and quality teams
  • Fraud Prevention: Detecting ~230 related prescriptions daily (23% of 1,000) identifies ~12 suspicious prescription pairs/day across different patients, doctors, and pharmacies, preventing ~$839,500/year in fraudulent medication payouts through early intervention
  • Knowledge Management: Linked 300+ related prescriptions monthly, improving information retrieval for fraud investigators and reducing duplicate work across Chilean healthcare teams
  • Operational Efficiency: Automated daily analysis previously performed manually by reviewing scanned prescriptions, reallocating 12.5 hours/day (87.5 hours/week) to higher-value fraud investigation activities
  • Quality Assurance: Identified 30+ outliers monthly requiring review, improving OCR quality and prescription data accuracy, reducing downstream errors in Chilean health system processing
  • Total Annual Savings: $840,000/year from fraud prevention ($839,500) and operational efficiency savings ($500) for Chilean healthcare insurance providers

Performance Characteristics

Metric Value Notes
Prescriptions Processed 1,000/day average Peaks at 1,500 on month-end
Total Database Growing database All processed prescriptions indexed
Batch Duration ~30 minutes Completes by 5:30 AM before business hours
OCR Accuracy 92% average pytesseract Spanish model with image preprocessing
Inference Time 12ms per prescription Includes OCR + preprocessing + embedding + search
Duplicate Detection Rate 15% of daily prescriptions ~150 duplicates/day identified
Related Detection Rate 23% of daily prescriptions ~230 related pairs/day identified
False Positive Rate 7.2% Manually reviewed and filtered

Key Learnings

Building this OCR-based medical claim fraud detection system provided valuable insights into Spanish OCR, Chilean medical terminology, large-scale similarity search, and batch processing challenges:

  • FAIVS Index Selection: The choice between FAISS index types (IVF, HNSW, PQ) depends on the specific use case. IVF provided the best balance of accuracy, speed, and memory for our growing prescription database. Tuning nprobe was critical—too low reduces accuracy, too high increases latency. Finding the optimal point (nprobe=20) required systematic experimentation.
  • Threshold Calibration: Similarity thresholds (0.95 for duplicates, 0.75 for related) were not set arbitrarily. Required extensive manual validation of labeled data across different prescription types and formats. Thresholds that work well for standard printed prescriptions were too strict for handwritten ones. Implemented format-specific thresholds in production.
  • Batch vs Real-time Trade-offs: While real-time similarity detection would provide immediate insights, batch processing was more appropriate for this use case due to: (1) fraud investigation doesn't require real-time, (2) daily reports are sufficient for analysis, (3) batch allows for more thorough analysis and error handling, (4) significantly lower infrastructure costs with predictable resource usage.
  • OCR Quality Challenges: Scanned prescriptions vary significantly in quality—from crisp digital prints to faded carbon copies and handwritten notes. Implemented OCR confidence scoring to flag low-quality extractions. pytesseract with Spanish model achieved 92% average accuracy, but handwritten sections required manual review. Image preprocessing (denoising, contrast enhancement) improved OCR accuracy by 15%.
  • Chilean Medical Terminology: Successfully handled Chilean-specific medication names, dosage formats, and prescription abbreviations. Created a custom dictionary of 500+ Chilean medications and medical terms to improve OCR recognition. Dosage formats (mg/ml, gotas, comprimidos) required normalization for consistent similarity matching.
  • False Positive Management: In production, false positives are costly—they cause wasted investigation time for the fraud team. Implemented a two-stage process: (1) automated categorization, (2) human-in-the-loop for borderline cases. Added confidence scores and explainability features to help reviewers make decisions faster.
  • Document Preprocessing Importance: The quality of similarity detection heavily depends on text preprocessing. Removed boilerplate headers/footers, normalized dates/times, handled OCR artifacts, and applied Spanish text normalization. Preprocessing increased accuracy by 8-10% but added 40% to processing time—worthwhile trade-off.
  • Incremental Index Updates: Rebuilding the entire FAISS index daily took too long. Implemented incremental updates—adding new vectors and periodically rebuilding for balance. Daily adds take 1-2 minutes, full rebuild happens monthly during maintenance window. This keeps batch duration under 30 minutes.
  • Monitoring and Alerting: Batch processes can fail silently. Implemented comprehensive monitoring: success/failure notifications, processing duration alerts (if >1 hour), result volume alerts (if duplicates drop suddenly), and automated health checks. The system has 99.8% uptime with average recovery time of 15 minutes when issues occur.
  • Scalability Planning: With 365K new prescriptions/year, the system needs to scale to millions of vectors over time. Architecture supports this: FAISS can handle billions of vectors, current hardware has headroom, and the batch duration will remain manageable as prescription volume grows. Future-proofing is essential for long-term projects.
  • Stakeholder Communication: Different stakeholders care about different metrics. Fraud team wants suspicious prescription pairs, quality team wants OCR outliers. Tailored email reports with relevant metrics and actionable recommendations increased adoption from 60% to 95% across teams. Communication is as important as technical implementation.