Machine Learning Intern
Description
- Built an OCR-based merchant-document validation pipeline for onboarding flows, processing PAN and Aadhaar images via S3 → AWS Lambda → Google Cloud Vision OCR → regex parsing → Signzy API verification, achieving 97% extraction-and-verification accuracy on a 100-image test set.
- Applied OpenCV-based denoising, deskewing, and template-matching to improve OCR robustness on merchant-submitted document images.
- Built an internal RAG-based chatbot using LangChain + Milvus with BM25 + dense hybrid retrieval and GPT-3.5 for answer generation. Created the corpus end-to-end using Google Search API, Beautiful Soup, Selenium, and PyMuPDF.
- Embedded corpus with
text-embedding-ada-002, stored in Milvus with HNSW indexing, and implemented session persistence and OpenAI token streaming. - Built near-real-time onboarding funnel dashboards on PostgreSQL; identified major leakage points, generating product insight that informed changes associated with a 40% reduction in onboarding drop-off.
Stack: Python, PostgreSQL, AWS Lambda, Amazon S3, Amazon EFS, OpenCV, Google Cloud Vision, Signzy API, LangChain, Milvus, Beautiful Soup, Selenium, PyMuPDF, GPT-3.5
