Skip to content
View saitejasrivilli's full-sized avatar

Highlights

  • Pro

Block or report saitejasrivilli

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
saitejasrivilli/README.md

👋 Hi, I'm Sai Teja Srivillibhutturu

LLM Post-Training Engineer | RLHF · GRPO · Agent RL | GPU Systems

LinkedIn GitHub Google Scholar Portfolio


🎯 Professional Summary

LLM Post-Training Engineer focused on the full alignment pipeline: SFT → DPO → GRPO → Agent RL with verifiable rewards. Building on 4× NVIDIA A30 at UT Arlington and shipping clinical AI at Qure.ai.

Post-Training work (all benchmarked on real hardware):

  • Agent GRPO on GSM8K — best reward 0.5575 over 200 iterations; policy learns to call a Python executor tool and condition on tool results before producing <final_answer>
  • PRM (Process Reward Model) — step-level rewards (γ=0.9) achieve best reward 1.0659; step signal contributes 15% of gradient even on wrong-answer rollouts
  • SFT → DPO → RLVR pipeline — BERTScore 0.780 → 0.855 (+9.6%), DPO margin +0.137, RLVR best reward 0.8189
  • vLLM vs HF inference — 1.24–1.32× real throughput gain, 10–18% TTFT reduction (measured, not estimated)
  • Quantization — NF4 reduces Qwen2.5-7B VRAM from 15.25 GB → 5.83 GB (−61.8%) at only 16% throughput loss

Published IEEE research on LLM-based path planning (OJCOMS 2026, ICC 2026).


💼 Professional Experience

AI Solutions Engineer Intern @ Qure.ai

📍 Arlington, TX | Mar 2026 – Present

  • Clinical Protocol Automation: Leading LLM configuration for hospital clients (Mount Sinai, Medstar) to automate clinical workflows using proprietary clinical knowledge
  • Healthcare Interoperability: Building EPIC/FHIR integrations to enable real-time protocol recommendations directly in hospital systems
  • Infrastructure Redesign: Architecting pluggable executor framework for clinical pipeline orchestration—Docker-first, API-driven design with portable artifact store across environments

Tech Stack: Python, FastAPI, Docker, Kubernetes, FHIR, Healthcare APIs


Graduate Research Assistant – TopGPT Project @ UT Arlington

📍 Arlington, TX | Jun 2025 – Present

  • Full-Stack LLM/RAG Platform: Building enterprise-grade retrieval-augmented generation system for knowledge workers
  • GPU Infrastructure: Leveraging 4× NVIDIA A30 cluster (96GB total VRAM) for multi-GPU DDP training and inference optimization
  • Research & Development: Experimenting with advanced RAG patterns, prompt optimization, and efficient fine-tuning techniques

Tech Stack: PyTorch, CUDA, vLLM, Vector Databases, LangChain


ML Engineer (Contract) @ DentalScan / ReplyQuickAI

📍 Remote | Dec 2025 – Feb 2026

  • Computer Vision Pipeline: Developed CNN-based dental image analysis system with automated defect detection
  • Cloud Deployment: End-to-end pipeline from model training to production on AWS (S3, EC2, SageMaker)
  • Experiment Tracking: Integrated MLflow for reproducible model versioning and metric comparison

Tech Stack: PyTorch, TensorFlow, AWS (S3, EC2, SageMaker), MLflow, OpenCV


Software Engineer (4 yrs) @ Tata Consultancy Services

📍 India | Jun 2019 – May 2023

  • Built scalable Java-based backend systems for financial services domain
  • Designed distributed system architectures and optimized database performance
  • Led API design and microservices migration initiatives

📚 Publications & Research

DTMAP: Digital Twin-Guided AI Path Planning for Connectivity-Aware Mobility

IEEE Open Journal of the Communications Society (OJCOMS) | Accepted April 2026

Multi-objective path planning framework integrating wireless digital twins with a fine-tuned LLM for connectivity-aware navigation in 6G/XR environments. Achieves 1.9% outage probability vs. 2.3% for RL, 312 ms avg. inference latency, outperforms A*, greedy, Q-learning, LLaMA 3.1/3.3-70B, and Qwen-2.5-72B.

  • Tunable α parameter trades off signal strength vs. travel distance across 21 values without retraining
  • GPT-4o-mini fine-tuned on instruction-conditioned routing data via DT-grounded oracle supervision
  • Deterministic sanitization pipeline brings raw LLM path validity from 65% → 100%
  • Tech stack: GPT-4o-mini (OpenAI Fine-tuning API), NVIDIA Sionna (ray tracing), OpenStreetMap, Blender, Python, PyTorch

CTMap: LLM-Enabled Connectivity-Aware Path Planning in mmWave Digital Twin Networks

IEEE ICC 2026 (CQRM) | arXiv:2601.00110

Designed an LLM-driven approach to network path optimization for next-generation 6G networks, achieving:

  • Connectivity-aware routing in mmWave networks using digital twin simulation
  • Practical deployment on edge devices with on-device inference

📖 Read on arXiv | 💻 View Research Code

This work bridges the gap between LLM reasoning capabilities and systems-level network optimization—proving that transformer-based models can effectively solve constrained optimization problems in telecommunications.


🌐 Live Deployments & Interactive Demos

Experience my work in action. All demos are production-ready and actively maintained:

Project Platform Description Status
🤖 Multi-Strategy AI Agent System 🤗 Hugging Face Spaces 4 reasoning strategies (CoT, ToT, ReAct, Multi-Agent) with intelligent routing ✅ Live
🔍 Glean-Lite: Enterprise RAG Vercel Go-based RAG engine with semantic search and document ingestion ✅ Production
⚡ Edge LLM Benchmark Vercel Real-time LLM benchmarks on MacBook Air M2 using MLX framework ✅ Interactive
🌊 Maxwell PINN Solver Streamlit Demo Physics-informed neural network solving Maxwell's equations (1700× COMSOL speedup) ✅ Live

Try them out: Click any link above to see ML/AI in action. No signup required.


🚀 Featured Projects

🧠 LLM Post-Training Pipeline — SFT → DPO → GRPO → Agent RL

End-to-end implementation of modern LLM alignment techniques on Qwen2.5-7B-Instruct across 4× NVIDIA A30 GPUs. All numbers are measured on real hardware.

Repository Method Key Result
rlhf-synthesis-optimization PPO · DPO · GRPO · Agent GRPO · PRM · RLAIF · STaR Agent GRPO best reward 0.5575 (200 iters); PRM best 1.0659; LLM-PPO 0.9007
LLM_FineTuning_SFT_Production SFT → DPO → RLVR BERTScore 0.780 → 0.855; DPO margin +0.137; RLVR best reward 0.8189
efficient-post-training-suite Full 6-stage pipeline + SLURM configs SFT → DPO → GRPO → Agent → Eval on A30 cluster
reward-model-training Bradley-Terry RM on HH-RLHF · scalar head · ECE + calibration curve Val acc 65.0%, test acc 62.7%, mean margin 0.29 (measured A30)
preference-data-pipeline HH-RLHF · UltraFeedback · OASST1 → quality filter → MinHash dedup → DPO JSONL 14 unit tests · chatml/llama3 templates · contamination check
code-agent-eval-benchmark GSM8K · HumanEval · LLM-as-Judge GSM8K 54.0% · HumanEval pass@1 70.0% · Judge 8.2/10
distributed-training-models FSDP · DDP · multi-node SLURM FSDP fp16 2-GPU 21,844 tok/s; multi-node configs for 2/4-node Qwen2.5-7B
attention-optimization vLLM PagedAttention vs HuggingFace 1.24–1.32× throughput · 10–18% TTFT reduction (10 measured runs)

Key techniques implemented: GRPO with group-relative advantages, verifiable rewards (RLVR), process reward models with step-level discounting, RLAIF (LLM-as-judge → DPO), STaR (self-taught reasoner), rejection sampling fine-tuning, multi-node FSDP with SLURM, NF4/INT8 quantization.


⭐ Advanced AI Agent System — Multi-Strategy Reasoning

Multi-strategy AI reasoning system implementing cutting-edge techniques from recent AI research papers. The system intelligently routes queries to the most effective reasoning strategy based on task characteristics.

Implemented Strategies:

  • Chain-of-Thought (CoT): Step-by-step reasoning with self-consistency voting across multiple chains
  • Tree-of-Thoughts (ToT): Multi-path exploration with beam search for complex problem-solving
  • ReAct Agent: Reasoning + Acting loop with real-time web search integration via Tavily
  • Multi-Agent Orchestration: Planner → Worker → Critic architecture for collaborative reasoning

Why This Stack?

  • Groq LLM API: Sub-100ms latency inference—crucial for interactive agent workflows
  • Tavily Search: Production-grade real-time search API, more reliable than direct web scraping
  • ChromaDB: Lightweight, embeddable vector database—no external service dependency
  • LangChain: Mature agent framework with proven patterns for tool integration

Research Papers Implemented:

🔗 Live Demo | 📖 GitHub


🔥 LLM & GPU Optimization

Achieving production-scale inference performance through systematic optimization.

Performance Benchmarks (Measured — NVIDIA A30, Qwen2.5-7B-Instruct)

┌─────────────────────────────────────────────────────────────────┐
│          vLLM vs HuggingFace generate() — Real Measurements      │
├─────────────────────────────────────────────────────────────────┤
│  Batch │ HF tok/s │ vLLM tok/s │ Speedup │ HF TTFT │ vLLM TTFT │
│  ──────┼──────────┼────────────┼─────────┼─────────┼────────── │
│    1   │   37.6   │    49.8    │  1.32×  │  29 ms  │   24 ms   │
│    4   │  149.1   │   190.2    │  1.28×  │  32 ms  │   28 ms   │
│    8   │  297.7   │   368.4    │  1.24×  │  49 ms  │   43 ms   │
│   16   │  563.1   │   697.5    │  1.24×  │  73 ms  │   66 ms   │
│                                                                   │
│  Mechanism: PagedAttention KV-cache + CUDA graph capture         │
│  3 warmup + 5 measure runs per batch size                        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│            Quantization — FP16 vs INT8 vs NF4 (A30)             │
├─────────────────────────────────────────────────────────────────┤
│  Precision │  TTFT   │  tok/s │  VRAM   │  Reduction            │
│  ──────────┼─────────┼────────┼─────────┼──────────────         │
│  FP16      │  69.8ms │  19.4  │ 15.25GB │  baseline             │
│  INT8      │ 1392ms  │   1.2  │  8.82GB │  −42.2% VRAM          │
│  NF4       │  272ms  │  16.2  │  5.83GB │  −61.8% VRAM ✅       │
│                                                                   │
│  NF4: best tradeoff — 61.8% memory reduction, 16% throughput    │
│  loss. Fits Qwen2.5-7B in 5.83 GB (RTX 3080 / 4070 class)      │
└─────────────────────────────────────────────────────────────────┘

Optimization techniques benchmarked on hardware:

  1. NF4 / INT8 quantization (bitsandbytes): FP16→NF4 saves 61.8% VRAM at 16% throughput loss; INT8 slower at batch=1 due to dequantization overhead
  2. PagedAttention (vLLM): 1.24–1.32× throughput, 10–18% TTFT reduction vs HuggingFace generate()
  3. Kernel fusion (torch.compile): FusedLayerNormLinear eliminates HBM round-trip between LayerNorm and Linear
  4. Gradient checkpointing: recompute activations on backward — reduces peak memory at cost of ~33% extra compute
  5. DataParallel 4× GPU: 1.80× throughput on 4× A30; CUDA kernel dispatch overhead measured

Featured Optimization Projects

Repository Focus Key Result Status
attention-optimization vLLM PagedAttention vs HuggingFace generate() — 4 batch sizes, 10 measured runs 1.24–1.32× throughput, 10–18% TTFT reduction ⭐ Measured
gpu-optimization-mistral FP16/INT8/NF4 quantization + kernel fusion + DataParallel profiling on 4× A30 NF4 −61.8% VRAM, fused ops 37× on targeted layers ✅ Measured
LORA-implementation Low-Rank Adaptation for parameter-efficient fine-tuning 10× parameter reduction ✅ Complete

Quick Start: Benchmarking vLLM

git clone https://github.com/saitejasrivilli/attention-optimization
cd attention-optimization
pip install -r requirements.txt
python benchmark.py --model Qwen/Qwen2.5-7B-Instruct --batch-sizes 1 4 8 16
# Results: 1.24–1.32× throughput gain over HF generate() on NVIDIA A30

🤖 AI Agents & Multi-Agent Systems

Building intelligent agents that reason, plan, and collaborate.

Project Description Architecture Status
ai-agent-system Multi-strategy AI reasoning with 4 reasoning modes Groq + Tavily + ChromaDB ⭐ Live
AdvancedLLMAgent Sophisticated agent with function calling & tool use LangChain + RAG ✅ Production
Multi_Agent_Workflow_Automator Multi-agent orchestration for complex workflows Agent coordinator pattern ✅ Scalable
offline-rag-assistant Privacy-focused RAG for offline deployment Vector DB + Local LLM ✅ Deployable

Agent Architecture Patterns Implemented:

Input Query
    ↓
┌─────────────────────────────────────────┐
│   LLM Auto-Classifier                   │  ← Intelligent Strategy Routing
│   (Task Type: Reasoning/Search/Coding)  │
└─────────────────────────────────────────┘
    ↓
Route to Optimal Strategy:
    ├→ [Simple Q&A] → Chain-of-Thought
    ├→ [Complex Problem] → Tree-of-Thoughts  
    ├→ [Fact Retrieval] → ReAct (with Search)
    └→ [Multi-step Task] → Multi-Agent (Plan→Execute→Critique)
    
    ↓
Agent Loop: Thought → Action → Observation → (repeat)
    ↓
Return Result with Reasoning Trail

🔬 ML Systems & Computer Vision

Production-ready machine learning systems from data to deployment.

Project Description Tech Stack Impact
ai-video-analysis-system End-to-end video analysis with object detection & tracking PyTorch, OpenCV, YOLO Real-time (30 FPS)
ComputerVision Computer vision algorithms & deep learning implementations TensorFlow, OpenCV, Detectron2 Comprehensive suite
TeluguGPT Language model specialized for Telugu language Transformers, HuggingFace Domain-specific LLM
TelecomGPT Domain-specific LLM for telecom industry Fine-tuning, LoRA, Transfer Learning Industry-focused

📊 Data Engineering & ML Pipelines

Scalable systems for data processing and machine learning workflows.

Project Description Tech Stack Scale
DistributedKVStore Distributed key-value store with consensus algorithms Go, Raft, gRPC Production-ready
end-to-end-data-engineering-project Complete ETL pipeline: ingestion → processing → analytics Spark, Airflow, Cloud SQL Enterprise scale
Collaborative_filtering_recommender_system Scalable recommendation engine for e-commerce PySpark, MLlib Millions of users
TelecomChurnPredictor Customer churn prediction system with feature engineering PySpark, XGBoost, MLflow 95%+ accuracy

Quick Example: Running the Recommendation Engine

git clone https://github.com/saitejasrivilli/Collaborative_filtering_recommender_system
cd Collaborative_filtering_recommender_system
spark-submit --master local[4] train.py --data ./movielens-20m
# Output: Personalized recommendations for 10K+ users

🛡️ AI Safety & Evaluation

Rigorous evaluation frameworks for responsible AI development.

Project Description Focus Area Status
Red-Teaming-Failure-Analysis-Mitigation Systematic LLM red-teaming with adversarial prompt generation Safety, Robustness ✅ Active
Generative-Model-Safety-Evaluation Safety benchmarks for LLMs and diffusion models Evaluation, Benchmarking ✅ Comprehensive
llm-long-context-stress-test Stress testing LLMs on long-context tasks (100K+ tokens) Capability Testing ✅ Published
simulation-planning-evaluation Evaluation framework for agent planning capabilities Agent Evaluation ✅ Extensible

🛠️ Technical Skills & Expertise

🤖 ML/DL & LLMs

  • Core: PyTorch, TensorFlow, JAX
  • LLM Frameworks: LangChain, LlamaIndex, vLLM
  • Techniques: RAG, Vector DBs, LoRA/QLoRA
  • Inference: Quantization, Speculative Decoding
  • Optimization: CUDA, FlashAttention, KV-Cache

☁️ Cloud & Infrastructure

  • AWS: EC2, S3, SageMaker, Lambda (Certified)
  • Oracle: GenAI, Vector Search, Cloud Infrastructure
  • Microsoft: Azure, Fabric (Certified)
  • Containerization: Docker, Kubernetes, Helm
  • MLOps: CI/CD, Monitoring, Reproducibility

💻 Software Engineering

  • Languages: Python, Go, C++, Java, SQL
  • Web: FastAPI, Flask, REST APIs
  • Databases: PostgreSQL, Neo4j, Redis
  • Messaging: Kafka, RabbitMQ
  • Systems: Distributed Systems, DSA

🎓 Specialized Expertise Matrix

Where I have deep, production-tested knowledge:

Domain Depth Key Projects Evidence
LLM Post-Training (RLHF/GRPO) ⭐⭐⭐⭐⭐ PPO, DPO, GRPO, Agent GRPO, PRM, RLAIF, STaR best_reward 0.5575–1.0659 measured
Reward Modeling & Evaluation ⭐⭐⭐⭐⭐ Verifiable rewards, process rewards, LLM-as-judge GSM8K 54%, HumanEval 70%
LLM Inference & Optimization ⭐⭐⭐⭐⭐ vLLM PagedAttention, NF4/INT8 quant, KV-cache 1.32× throughput, −61.8% VRAM
Distributed Training ⭐⭐⭐⭐⭐ FSDP, DDP, multi-node SLURM, torchrun 21,844 tok/s FSDP fp16 2-GPU
GPU Optimization & CUDA ⭐⭐⭐⭐⭐ Profiling, quantization, attention, gradient checkpointing A30 4-GPU cluster
Multi-Agent & RAG Systems ⭐⭐⭐⭐ ReAct, CoT, ToT, self-healing RAG Production deployments
Cloud Architecture ⭐⭐⭐⭐ AWS, Oracle, Kubernetes, FHIR/EPIC Certified, Qure.ai prod
Research & Publications ⭐⭐⭐⭐ IEEE OJCOMS 2026, ICC 2026, path planning 2 peer-reviewed venues

🏆 Certifications & Continuous Learning

Professional Certifications

Certification Issuer Validity Focus
AWS Certified Data Engineer – Associate Amazon Web Services Dec 2024 – Dec 2027 Cloud data pipelines, ETL, analytics
Microsoft Certified: Data Engineer Associate Microsoft Aug 2025 – Aug 2026 Fabric, Azure, data architecture
Oracle Cloud Associate Cloud Engineer Oracle Jun 2024 – Jun 2026 Cloud infrastructure, GenAI services
Oracle AI Vector Search Specialist Oracle Feb 2025 – Feb 2027 Vector databases, RAG, semantic search
Neo4j Certified Associate Neo4j Jul 2024 – Jul 2026 Graph databases, Cypher, data modeling
Certified Data Scientist 365 Data Science Nov 2024 ML fundamentals, deep learning, SQL
Machine Learning in Production (Honors) EDX (UC Berkeley) Jun 2024 MLOps, model deployment, monitoring

Specialized Technical Training

Course Provider Completion Key Skills
Advanced Large Language Model Agents UC Berkeley EECS Jul 2025 Inference-time reasoning, DPO, RAG, neural-symbolic AI
AI Evaluations for Everyone Anthropic & Aishwarya Naresh Dec 2025 LLM benchmarking, evaluation frameworks, quality metrics
Agentforce Specialist Salesforce Jun 2025 LLM prompt engineering, agent design, enterprise AI
CodePath Technical Interview Prep CodePath May 2025 DSA, competitive programming, system design
Neo4j Graph Academy Neo4j Jul 2024 Advanced Cypher, graph algorithms, recommendations

📈 Key Achievements & Impact

┌──────────────────────────────────────────────────────────────────────┐
│                     PRODUCTION IMPACT METRICS                        │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  🚀 Post-Training Results          🏆 Research & Publications         │
│  ├─ Agent GRPO reward 0.5575      ├─ IEEE OJCOMS journal (2026)    │
│  ├─ PRM best reward 1.0659        ├─ IEEE ICC 2026 conference      │
│  ├─ vLLM 1.32× throughput gain   ├─ 1700× COMSOL speedup (PINN)   │
│  └─ NF4 −61.8% VRAM (5.83 GB)    └─ 3 patent-eligible algorithms   │
│                                                                      │
│  📚 Open Source & Community        🎓 Career Development             │
│  ├─ 40+ public repositories       ├─ 6+ cloud certifications       │
│  ├─ 2.5K+ GitHub stars            ├─ 20+ specialized courses       │
│  └─ Active in AI safety research  └─ Mentoring + technical writing  │
│                                                                      │
│  🔧 Systems Engineering           💼 Professional Growth             │
│  ├─ Multi-GPU DDP training        ├─ From SWE → ML Engineer path   │
│  ├─ Kubernetes orchestration      ├─ 4 years TCS → frontier AI      │
│  └─ End-to-end ML pipelines       └─ Healthcare AI focus (Qure.ai) │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

🎯 What I'm Currently Working On

  • 🔭 LLM post-training: Agent GRPO with verifiable rewards, PRM step-level rewards, RLAIF preference generation
  • 🌱 STaR / rejection sampling: self-taught reasoning loops — generate rationale → filter correct → SFT → iterate
  • Inference optimization: vLLM PagedAttention benchmarks, NF4 quantization (−61.8% VRAM), multi-node FSDP
  • 🏥 Clinical AI: LLM-based protocol automation with FHIR/EPIC integrations at Qure.ai
  • 📊 Evaluation: multi-axis LLM benchmarking (GSM8K, HumanEval, LLM-as-judge)

🌍 Why I'm Unique

  1. Post-Training depth: implemented the full stack — SFT, DPO, PPO, GRPO, Agent GRPO with tool use, PRM, RLAIF, STaR — all with real measured results on Qwen2.5-7B
  2. Verifiable numbers: every benchmark in my repos is measured on real hardware (NVIDIA A30), not estimated or copy-pasted
  3. Systems + algorithms: can take a training run from SLURM launch through multi-node FSDP to reward model evaluation
  4. Published Researcher: 2 IEEE publications (OJCOMS 2026, ICC 2026) on LLM-based path planning in 6G networks
  5. Production AI: shipping clinical LLM automation with FHIR/EPIC integrations at Qure.ai
  6. 40+ public repos spanning post-training, inference optimization, distributed training, and evaluation

📫 Let's Connect & Collaborate

I'm actively seeking opportunities in LLM Post-Training, RLHF/GRPO research engineering, and ML Systems roles. Whether you're building frontier AI alignment pipelines, scaling reward modeling, or advancing agent evaluation—let's talk!

Reach out for:

  • 🔍 Technical collaboration on ML/AI projects
  • 💼 ML Engineering & LLM Engineer opportunities
  • 🎓 Mentorship in LLM optimization & RAG systems
  • 🚀 Open-source contributions & research partnerships

Email LinkedIn Google Scholar Portfolio


⭐ If you find my projects useful, consider giving them a star and sharing with others building the future of AI!

Pinned Loading

  1. Collaborative_filtering_recommender_system Collaborative_filtering_recommender_system Public

    A hybrid product recommendation system leveraging user-based, item-based, and SVD filtering, deployed with Streamlit for interactive UI.

    Python