Projects & Research

This page highlights a few representative projects spanning production ML systems, Document AI, open-source benchmarks, and biomedical ML.

Production ML systems (AWS)

Document AI: Fax processing (Ambry Genetics)

  • Note: Details are partially redacted due to NDA; I’m happy to discuss architecture, tradeoffs, and evaluation approach.
  • Built a Temporal-orchestrated, human-in-the-loop fax document processing system on AWS to automate splitting, classification, and patient-to-order matching; reduced manual processing time by 80–90% and supported 300–475 faxes/day.
  • Implemented Temporal worker services (background processes): heavier workers for OCR/LLM inference and lighter workers for orchestration and workflow bookkeeping.
  • OCR + extraction with AWS Textract + AWS Bedrock; workflow state persisted in Amazon Aurora PostgreSQL.

Document AI: Insurance denial extraction (Ambry Genetics)

  • Note: Details are partially redacted due to NDA; I’m happy to discuss schema design, reliability, and evaluation.
  • Designed an LLM-based system to extract structured fields from insurance denial documents, enabling potential recovery of ~$3M/year in lost claims.

Clinical genomics ML (Ambry Genetics)

  • Built ontology-aware features and trained an XGBoost model to prioritize genes; deployed to EC2 and reduced manual variant analysis workload by 50%.
  • Built an LLM-based pedigree image classifier on AWS Bedrock; 99% accuracy on an internal test dataset.

Open-source datasets & benchmarks (Bengali.AI)

Full project list

OOD Speech (ASR)

  • First Bengali out-of-distribution speech recognition benchmark
  • 1100+ hours from 22,000+ contributors across 17 domains
  • Fine-tuned GPT-based Whisper for regional Bengali ASR. paper Kaggle demo

Document layout analysis (DocAI)

  • BaDLAD: 33,695 annotated Bengali document samples across six domains
  • Trained Mask R-CNN and YOLO-based object detectors for layout analysis.
  • paper

Text normalization & parsing

  • Engineered normalizer and parser libraries for Abugida Unicode texts supporting 7 Indic languages
  • Improved LLM robustness under adversarial conditions by 5–10 points across multiple metrics.
  • paper

Biomedical ML (Vanderbilt University)

Histopathology: Active learning for contrastive learning

  • Built an active learning-based training pipeline for SimCLR on histopathology images; reduced data requirements by 93% and training time by 62%.
  • paper, code

Medical imaging: MRI soft tissue tumor segmentation

  • Curated an MRI soft tissue tumor segmentation dataset (199 patients)
  • Developed multimodal UNet + Segment Anything Model (SAM) approaches; achieved Dice 80% (state of the art).
  • paper, code

ECG signal processing

  • Engineered a CNN for inferior myocardial infarction detection from ECG signals; accuracy 84.54% (state of the art at time).
  • paper, code