|
Projects & Research
This page highlights a few representative projects spanning production ML systems, Document AI, open-source benchmarks, and biomedical ML.
Production ML systems (AWS)
Document AI: Fax processing (Ambry Genetics)
- Note: Details are partially redacted due to NDA; I’m happy to discuss architecture, tradeoffs, and evaluation approach.
- Built a Temporal-orchestrated, human-in-the-loop fax document processing system on AWS to automate splitting, classification, and patient-to-order matching; reduced manual processing time by 80–90% and supported 300–475 faxes/day.
- Implemented Temporal worker services (background processes): heavier workers for OCR/LLM inference and lighter workers for orchestration and workflow bookkeeping.
- OCR + extraction with AWS Textract + AWS Bedrock; workflow state persisted in Amazon Aurora PostgreSQL.
- Note: Details are partially redacted due to NDA; I’m happy to discuss schema design, reliability, and evaluation.
- Designed an LLM-based system to extract structured fields from insurance denial documents, enabling potential recovery of ~$3M/year in lost claims.
Clinical genomics ML (Ambry Genetics)
- Built ontology-aware features and trained an XGBoost model to prioritize genes; deployed to EC2 and reduced manual variant analysis workload by 50%.
- Built an LLM-based pedigree image classifier on AWS Bedrock; 99% accuracy on an internal test dataset.
Open-source datasets & benchmarks (Bengali.AI)
Full project list
OOD Speech (ASR)
- First Bengali out-of-distribution speech recognition benchmark
- 1100+ hours from 22,000+ contributors across 17 domains
- Fine-tuned GPT-based Whisper for regional Bengali ASR.
paper Kaggle demo
Document layout analysis (DocAI)
- BaDLAD: 33,695 annotated Bengali document samples across six domains
- Trained Mask R-CNN and YOLO-based object detectors for layout analysis.
- paper
Text normalization & parsing
- Engineered normalizer and parser libraries for Abugida Unicode texts supporting 7 Indic languages
- Improved LLM robustness under adversarial conditions by 5–10 points across multiple metrics.
- paper
Biomedical ML (Vanderbilt University)
Histopathology: Active learning for contrastive learning
- Built an active learning-based training pipeline for SimCLR on histopathology images; reduced data requirements by 93% and training time by 62%.
- paper, code
Medical imaging: MRI soft tissue tumor segmentation
- Curated an MRI soft tissue tumor segmentation dataset (199 patients)
- Developed multimodal UNet + Segment Anything Model (SAM) approaches; achieved Dice 80% (state of the art).
- paper, code
ECG signal processing
- Engineered a CNN for inferior myocardial infarction detection from ECG signals; accuracy 84.54% (state of the art at time).
- paper, code
|