Searchable PDF: Scanned Document OCR Pipeline and Full-Text Search Platform
Preserve the original scan — add an invisible text layer so Ctrl+F, copy-paste, and FTS just work
Searchable PDF is a Django web application with a Celery-based OCR pipeline that accepts uploaded PDF or image files, runs OCR to extract text, generates searchable PDFs (original image + invisible text layer), and enables full-text search across the processed document library.
Built at 1POINT1 for organizations digitizing legacy paper archives where image fidelity must be preserved but search and indexing are required.
- Role
- AI Engineer at 1POINT1
- Search
- PostgreSQL tsvector + ts_rank
- Team Size
- Engineering delivery team
- Deployment
- Django + Celery + PostgreSQL + Docker Compose
- Ocr Engine
- Tesseract (pytesseract)