Back to Projects

RPA Pension Payment Automation System

Insurance Sector

RPA Python Selenium Web Scraping SharePoint PDF Processing Automation

Designed and implemented a large-scale Robotic Process Automation (RPA) system for automated pension payment processing in the insurance sector. Built with Python and Selenium, the system processes 500,000+ payment documents monthly, performing automated web scraping with intelligent batch splitting, PDF extraction, summarization, and SharePoint integration. This automation saves 50,000 hours annually and significantly reduces regulatory complaints from Chile's Super Intendency.

500K+
PDFs Processed Monthly
50K
Hours Saved Yearly
100%
Automated Workflow
Zero
IP Blocking Events

Business Impact

  • 50,000 hours saved annually - Massive reduction in manual processing time
  • Reduced Super Intendency complaints - Improved compliance and customer satisfaction
  • Zero downtime - Robust anti-blocking measures ensure uninterrupted operation
  • Enterprise-grade integration - Seamless SharePoint integration for document management
  • Scalable architecture - Handles peak loads without performance degradation

System Architecture

flowchart TD A[Monthly Trigger
Scheduled Job] --> B[Download Lists
Nominias Paid Last Month] B --> C[Intelligent Batch Splitting
Anti-Blocking Algorithm] C --> D[Web Scraping with Selenium
Batch 1-N] D --> E[Download PDFs
Payment Documents] E --> F[Queue Management
Rate Limiting] F --> G[PDF Processing Pipeline] G --> H[Text Extraction] H --> I[Data Extraction
Structured Information] I --> J[Summary Generation
Automated Reports] J --> K[SharePoint Integration
Document Upload] K --> L[Completion Notification
Status Logging] F --> M{Batch Complete?} M -->|No| D M -->|Yes| G

Workflow Process

  • Monthly Trigger: Automated job initiation downloads previous month's payment lists
  • Intelligent Batching: System divides large datasets into optimal batches using anti-blocking algorithm
  • Web Scraping & Downloads: Selenium retrieves payment documents with sophisticated rate limiting to prevent blocking
  • PDF Processing: Extracts text and structures payment information from 500,000+ documents monthly
  • Summarization: Generates automated reports with totals, trends, and regulatory compliance data
  • SharePoint Integration: Uploads organized documents with metadata tagging for enterprise searchability

Key Technical Challenges

Large-Scale Web Scraping Without Blocking

Processing 500,000+ documents risks IP blocking. Solution: Intelligent batch splitting with dynamic sizing, exponential backoff, session rotation, and adaptive speed monitoring prevents detection while maintaining high throughput.

PDF Processing at Scale

Half a million PDFs monthly with varying quality. Solution: Multi-format parser with OCR fallback, parallel processing, automatic retry logic, and quality validation checkpoints handle corrupted, scanned, and text-based documents.

Enterprise SharePoint Integration

Uploading 500,000+ documents requires robust organization. Solution: Dynamic folder creation by payment period/Convenio/nomina with metadata tagging ensures searchable, well-organized document management.

Reliability Achievements

  • Zero blocking events: System has operated without IP blocking or account suspension since deployment
  • 99.9% uptime: Reliable monthly processing with self-healing from transient issues
  • Graceful degradation: Continues processing even when services experience temporary problems

Data Extraction & Summarization

The system extracts actionable information and generates summaries that provide business intelligence and support regulatory compliance with Chile's Super Intendency.

  • Structured Data Extraction: Key fields (amounts, dates, payment types, customer IDs) automatically extracted from PDFs
  • Validation & Quality Control: Business logic validation ensures extracted data meets quality standards
  • Automated Summaries: Monthly reports with totals, averages, trends, and regulatory compliance data
  • Dashboard Integration: Real-time monitoring exports data to analytics dashboards for decision support

Key Learnings

Building this large-scale RPA system provided valuable insights into enterprise automation:

  • Scale Changes Everything: Solutions working for 1,000 documents fail at 500,000. Architecture must be designed for scale from day one with parallel processing, efficient memory management, and robust error handling.
  • Anti-Blocking is Critical: At enterprise scale, sophisticated detection mechanisms require more than simple rate limiting. Behavioral simulation, adaptive algorithms, and continuous monitoring are essential to maintain high throughput.
  • PDF Variability Demands Robustness: Handling text-based PDFs, scanned images, corrupted documents, and mixed layouts requires multi-format parsing with OCR fallback. Quality validation checkpoints are non-negotiable.
  • Error Recovery Strategy: At this volume, errors are inevitable. The system must distinguish between transient (retryable), fatal (manual intervention), and expected (ignore) errors with comprehensive logging and alerting.
  • Automation Isn't "Set It and Forget It": The 50,000-hour annual savings justified development investment, but ongoing maintenance costs must be factored into total ROI. Monitoring, updates, and occasional refactoring are essential.

Technology Stack

Component Technology Purpose
Programming Language Python 3.x Core automation logic
Web Automation Selenium WebDriver Browser automation and web scraping
PDF Processing PyPDF2, pdfplumber Text extraction from PDFs
OCR Tesseract Image-based PDF text extraction
SharePoint Integration Office365-REST-Python-Client Enterprise document management
Authentication OAuth 2.0 / MSAL Secure access to SharePoint APIs
Job Scheduling .bat Monthly automated execution