AI Experience — Interview Prep
System Architecture
Loading diagram...
Table of Contents
- System Architecture
- 1. Project Overview
- 2. Interview Questions & Answers
- 3. Understanding Qwen3 Embeddings
- 4. What is Modal?
- 5. Fine-Tuning Roadmap
- 6. Connecting Theory to Practice
- 7. Key Technical Details — Quick Reference
- 8. Potential Curveball Questions
1. Project Overview (Your Elevator Pitch)
"I built an AI-powered fitness platform that generates personalized workout plans. The core challenge was taking real coach-designed training programs — uploaded as PDFs — and using them alongside LLMs to produce structured, periodized workout plans tailored to individual users. I designed a multi-stage generation pipeline using DSPy, multiple LLM providers, vector-based exercise search, and a custom prescription grammar to validate exercise prescriptions like '3x5 @ RPE 8'. The system could either generate plans from scratch via AI or clone coach-curated plans, depending on context."
2. Interview Questions & Answers
Q1: "How did you use AI to generate workout plans based on coach plans?"
Key point: This was LLM orchestration, not traditional model training.
The system used a multi-stage LLM pipeline built with DSPy (a framework for programming — not just prompting — language models):
| Stage | What it does | Model Used | Max Tokens |
|---|---|---|---|
| Plan Overview | Generates plan name, description, cycle structure, progression/recovery/injury guidelines | Grok-4 (via OpenRouter) | 32,384 |
| Weekly Tasks | For each cycle, generates balanced exercise programming with push:pull ratios, volume management | Grok-4 (via OpenRouter) | 24,000 |
| Session Details | Generates individual sessions with exercises, prescriptions, rest periods | Claude Sonnet 4.5 (Anthropic) | 8,000 |
| Sets & Prescriptions | Parses prescriptions like "3x5 @ RPE 8" into structured data | Claude Haiku 4.5 (Anthropic) | 4,000 |
How coach plans fed into this:
- Coaches uploaded PDF workout plans → stored in Cloudflare R2
- PaddleOCR (computer vision) extracted text from the PDFs
- GPT-4o (vision model) parsed the extracted text into structured
CoachWorkPlanobjects via Pydantic - These parsed plans served as the gold standard — they could be directly cloned for users during onboarding (feature flag:
USE_COACH_PLAN), or their patterns informed the AI generation pipeline
If pressed on "training": We didn't fine-tune a foundation model. We used LLM orchestration with structured outputs. The coach plans were reference data that the system could clone or use as context. The "intelligence" came from carefully designed DSPy signatures (typed input/output contracts) and multi-model routing.
Q2: "Walk me through how you ingested and structured coach workout plans."
The Coach Plan Import Pipeline:
Coach uploads PDF/image → Cloudflare R2 storage → PaddleOCR (paddle_ocr_vl.py) extracts raw text → GPT-4o (pydantic_extractor.py) parses into structured data → Prescription DSL validation (PEG grammar) → Exercise library matching (vector similarity) → Upsert to Supabase (unified_plan_upload.py)
The data hierarchy:
CoachWorkPlan ├── CoachWorkPlanCycle (e.g., "Hypertrophy Block") │ └── CoachWorkSession (e.g., "Day 1 — Upper Push") │ └── CoachWorkSessionTask (e.g., "Bench Press") │ └── CoachWorkSessionTaskSet (e.g., "Set 1: 185lbs x 5")
Prescription DSL — A PEG-based grammar to validate exercise prescriptions:
- Valid:
3 x 5 @ RPE 8,3 x 8-12 @ RPE 8 / 180s,4 x 185lbs, 205lbs, 225lbs,max @ 60% - Invalid:
3 x 5 @ moderate(free text),10 per leg(laterality belongs in description, not prescription)
This strict validation ensured data integrity — every prescription had to parse into structured sets/reps/weight/intensity/rest.
Q3: "What embedding models and vector search did you use?"
- Embedding model: Qwen3-Embedding-0.6B generating 1024-dimensional vectors
- Vector storage: pgvector extension in Supabase (PostgreSQL)
- Embedding generation: Ran on Modal (serverless GPU compute using Hugging Face models)
- Use case: Exercise similarity search — when a user needs an easier exercise or doesn't have specific equipment, the system finds semantically similar alternatives
How it worked in practice:
python# ExerciseService
find_alternatives(
exercise_id,
easier=True, # Find easier alternatives
available_equipment=["dumbbell"], # Filter by equipment
limit=10
) # Returns: list[(ExerciseLibrary, similarity_score)]
# NLExerciseParser — natural language exercise queries
parse("this is too hard, I don't have a barbell")
# Returns: easier=True, allowed_equipment=[dumbbells, bodyweight, ...]
Why Qwen3 over something bigger? Cost-performance tradeoff. For exercise similarity (a relatively constrained domain), a 0.6B parameter model produced sufficient quality embeddings. We didn't need a massive model — exercises have fairly distinct semantic signatures.
Q4: "How did you evaluate the quality of AI-generated plans?"
LLM-as-Judge evaluation framework:
pythonBatchEvaluator:
- Generates N plans per test persona
- LLM judge scores against rubric
- Scoring dimensions: structure, progression, balance, engagement
ScoreAnalyzer:
- Identifies patterns in low scores
- Generates suggestions for prompt improvement
- Exports timestamped results for tracking over time
Test personas represented different fitness profiles (beginner, intermediate, advanced; different goals, injuries, equipment access). Each persona would generate multiple plans, and the evaluator would score them.
What I'd say about evaluation challenges:
- LLM-as-judge has known biases (verbosity bias, position bias)
- We used structured rubrics to constrain scoring
- Ideally, you'd combine LLM evaluation with human expert review and user outcome data
- The evaluation pipeline was iterative — low scores fed back into prompt refinement
Q5: "What was your RAG / retrieval approach?"
The system used retrieval-augmented generation in several ways:
-
Exercise Library as Knowledge Base: When generating sessions, the system retrieved relevant exercises from the vector-indexed library based on the user's equipment, goals, and constraints
-
Context Window Management: Each generation stage received context from previous stages:
- Session generation received all previously-generated sessions in that week (to prevent muscle imbalance)
- The system tracked push:pull ratios and weekly volume across sessions
-
Exercise Enrichment: After generating exercise names, the system enriched them with vector embeddings to match against the canonical exercise library — ensuring exercises had proper metadata (difficulty, joint stress, equipment requirements)
Q6: "How did you handle prompt engineering?"
DSPy Signatures — not raw prompts. DSPy lets you define typed input/output contracts:
pythonclass GeneratePlanSignature(dspy.Signature):
"""Generate a structured workout plan."""
fitness_context: OnboardingFitnessFields = dspy.InputField()
plan: WorkPlanLLM = dspy.OutputField()
Key prompt engineering decisions:
-
Per-stage model routing: Different models for different cognitive tasks
- Complex planning → Grok-4 (large context, strong reasoning)
- Session details → Claude Sonnet (good structured output)
- Parsing → Claude Haiku (fast, cheap, sufficient for extraction)
-
Structured outputs everywhere: Pydantic models defined exact schemas — no free-text parsing
-
Context injection: Each stage received relevant prior context to maintain coherence
-
Configuration-driven:
settings.tomlmade model selection, token limits, and endpoints configurable without code changes
Q7: "What was your full tech stack?"
| Layer | Technology | Purpose |
|---|---|---|
| LLM Orchestration | DSPy 3.0.4+ | Multi-stage pipeline, typed signatures |
| LLM Providers | OpenRouter (Grok-4), Anthropic (Claude), OpenAI (GPT-4o) | Multi-model routing |
| Backend API | FastAPI | REST + WebSocket endpoints |
| Database | Supabase (PostgreSQL + pgvector) | Relational data + vector search |
| Embeddings | Qwen3-Embedding-0.6B on Modal | Exercise similarity vectors |
| OCR | PaddleOCR + GPT-4o vision | Coach plan PDF ingestion |
| Frontend | Next.js, React Query, Zustand, TypeScript | Coach portal + test client |
| Monorepo | Nx | Package management across apps/packages |
| Deployment | Docker + Railway | Containerized deployment via Railway |
| Observability | Langfuse | LLM tracing and monitoring |
| Storage | Cloudflare R2 | Coach plan PDF/image storage |
Q8: "How did you handle errors and edge cases in AI generation?"
-
Prescription DSL Validation: PEG grammar rejects malformed prescriptions at parse time — no garbage data enters the system
-
Exercise Match Quality Tracking: When mapping LLM-generated exercise names to the canonical library, the system tracked match confidence. Low-confidence matches were flagged for review.
-
Injury Constraint Parsing: A dedicated DSPy signature parsed free-text injury descriptions into structured constraints (affected joints, movement restrictions, severity)
-
Joint Stress Calculation: Another signature estimated cumulative joint loading across a session to prevent overloading injured or stressed areas
-
Bilateral Exercise Handling: The system tracked whether exercises were unilateral or bilateral to correctly calculate volume
-
WebSocket Queue System: The agent server used a queue system for async plan saves — if generation succeeded but the save failed, the plan wasn't lost
Q9: "What's the difference between AI-generated plans and coach plans in your system?"
Parallel data models:
WorkPlan/WorkPlanCycle/WorkSession/WorkSessionTask— AI-generatedCoachWorkPlan/CoachWorkPlanCycle/CoachWorkSession/CoachWorkSessionTask— Coach-created
Why separate models?
- Coach plans have different metadata (source tracking, import quality scores)
- AI plans carry generation context (the onboarding data that produced them)
- Coach plans are treated as canonical/authoritative; AI plans are generated approximations
Feature flag approach: USE_COACH_PLAN = true → clone the coach plan directly instead of running the generation pipeline. This saved LLM costs and guaranteed coach-quality plans.
Q10: "How would you improve the system?"
-
Structured diff: AI plans vs coach plans — quantitatively compare exercise overlap %, volume accuracy, push:pull ratio deviation. This diff drives DSPy prompt optimization, then SFT if enough data accumulates.
-
RLHF from user feedback — users completing workouts could rate difficulty, enjoyment, effectiveness
-
Better evaluation metrics — move beyond LLM-as-judge to outcome-based metrics (user adherence, progression over time)
-
Hybrid retrieval — combine vector search with BM25 sparse retrieval for exercise lookup (Reciprocal Rank Fusion)
-
Caching and cost optimization — semantic caching for similar user profiles
-
Multi-modal input — accept video of exercises for form analysis
3. Understanding Qwen3 Embeddings — What Does It Mean to "Embed"?
The Short Answer
Qwen3-Embedding-0.6B was used to convert every exercise in the library into a 1024-dimensional vector. These vectors were stored in Supabase's pgvector extension and used for semantic similarity search.
What Does "Embedding" Actually Mean?
An embedding model is a neural network that converts text into a list of numbers (a vector). The key insight: semantically similar text produces similar vectors.
"Barbell Bench Press" → [0.23, -0.81, 0.45, ..., 0.12] (1024 numbers) "Dumbbell Chest Press" → [0.21, -0.79, 0.44, ..., 0.14] (1024 numbers) ← very similar! "Barbell Back Squat" → [-0.55, 0.32, -0.18, ..., 0.67] (1024 numbers) ← very different
The model "understands" that bench press and chest press are semantically related (both are horizontal pressing movements targeting chest), while squats are a completely different movement pattern.
How It Worked in Practice
Step 1: Pre-compute embeddings for every exercise
- Take the full exercise library (names + descriptions)
- Run each one through Qwen3-Embedding-0.6B on Modal (serverless GPU)
- Store the resulting 1024-dim vectors in Supabase pgvector
Step 2: Use vectors for search at runtime
- When the LLM generates an exercise name like "Incline Dumbbell Press", embed that name
- Find the closest vectors in the database using cosine similarity
- Return the matching canonical exercise with all its metadata
Step 3: Exercise substitution
- User says "this is too hard" or "I don't have a barbell"
- System finds semantically similar but easier/different-equipment exercises
- Vector similarity naturally groups exercises by movement pattern
Why Qwen3-Embedding-0.6B Specifically?
- 0.6B parameters — small enough to run cheaply on Modal GPUs
- 1024 dimensions — good balance of expressiveness vs. storage/compute cost
- Exercise domain — exercises have distinct semantic signatures, so a smaller model suffices
- Cost — embedding the full library is a one-time batch operation
4. What is Modal? (Serverless GPU Compute)
Modal is a serverless cloud platform for running code on GPUs without managing infrastructure. Think of it like AWS Lambda, but for ML workloads.
How It Was Used
Modal hosted the Qwen3-Embedding-0.6B model on a GPU and exposed it as an HTTP endpoint:
pythonclass ModalEmbeddingProvider:
def __init__(self):
self.endpoint_url = os.getenv(
"MODAL_EMBEDDINGS_ENDPOINT",
"https://embeddings.example.com/embed",
)
self._dimension = 1024 # Qwen3-Embedding-0.6B output dimensions
async def embed(self, texts: list[str]) -> list[list[float]]:
# POST to Modal-hosted endpoint, get back 1024-dim vectors
response = await self.client.post(self.endpoint_url, json={"texts": texts})
return response.json()["embeddings"]
Why Modal Over Alternatives?
| Approach | Cost Model | Setup | Use Case |
|---|---|---|---|
| Own GPU server | Pay 24/7 even when idle | Install CUDA, PyTorch, deploy model, maintain | Overkill for infrequent embeddings |
| OpenAI Embeddings API | Per-token pricing | Zero setup | No control over model choice |
| Modal | Pay only for compute time | Write a decorated function | Run any Hugging Face model, on-demand |
5. Fine-Tuning Roadmap — How We Planned to Improve the AI
The Structured Diff Approach
Since both AI plans and coach plans share the exact same Pydantic data model, you can do a structured diff at every level:
Coach Plan (gold standard) AI Plan (generated) ├── 4 cycles ├── 4 cycles ✓ match │ ├── Week 1: Upper/Lower split │ ├── Week 1: Push/Pull ✗ different split │ │ ├── Bench Press 3x5@RPE8 │ │ ├── Bench Press 3x5@RPE8 ✓ exact match │ │ ├── Barbell Row 3x8 │ │ ├── Cable Row 3x10 ~ partial match
Quantitative metrics from this diff:
| Metric | What It Measures |
|---|---|
| Exercise overlap % | How many exercises match between AI and coach plan |
| Volume accuracy | Total sets/reps per muscle group vs coach plan |
| Push:pull ratio deviation | Balance compared to coach standard |
| Prescription format match rate | How often AI prescriptions are valid |
| Progression alignment | Does intensity increase across cycles like coach plan |
Three Approaches (Simplest to Most Sophisticated)
Option A: DSPy Prompt Optimization (No Weight Changes)
- DSPy's built-in optimizers try different prompt variations
- Measures quality using the structured diff score
- No model weights change — just better prompts
Option B: Supervised Fine-Tuning (SFT)
- Train a smaller model on (user profile → coach plan) pairs
- Needs 500+ plan pairs minimum
- Result: cheaper model that produces coach-aligned outputs
Option C: DPO / Preference Training
- Generate both AI and coach plans for same profile
- Train model to prefer coach-like outputs
- Doesn't need exact replication, just learns "better" direction
6. Connecting Theory to Practice
| Interview Topic | Textbook Answer | Practical Example |
|---|---|---|
| LLM Training Stages | Pretraining → Fine-tuning → Alignment | Used pretrained models and orchestrated them with DSPy |
| Prompt Engineering | Zero-shot, few-shot, CoT, DSPy | DSPy signatures — typed I/O contracts — rather than raw prompting |
| RAG | Retriever + Generator pattern | Exercise library as knowledge base, vector search for retrieval |
| Embedding Models | BERT derivatives, contrastive learning | Qwen3-Embedding-0.6B, 1024-dim vectors, deployed on Modal |
| Vector Databases | pgvector, Qdrant, similarity metrics | Supabase with pgvector, cosine similarity |
| Evaluation | RAGAS, retrieval metrics | Custom LLM-as-judge with BatchEvaluator |
| Structured Output | Instructor, JSON mode | Pydantic models everywhere, DSPy signatures, PEG grammar |
| Hallucination | Grounding, RAG, validation | Prescription DSL + exercise library matching |
| Multi-model Architecture | Cost/performance tradeoffs | Grok-4 for reasoning, Claude Sonnet for generation, Haiku for extraction |
7. Key Technical Details — Quick Reference
- Models: Grok-4 (planning), Claude Sonnet 4.5 (sessions), Claude Haiku 4.5 (parsing), GPT-4o (vision/OCR)
- Framework: DSPy 3.0.4+ with MCP support
- Embeddings: Qwen3-Embedding-0.6B, 1024 dimensions
- Vector DB: Supabase pgvector
- GPU Compute: Modal (serverless)
- OCR: PaddleOCR + GPT-4o structured extraction
- Prescription Grammar: PEG-based DSL (e.g.,
3 x 5 @ RPE 8 / 180s) - Data hierarchy: WorkPlan → WorkPlanCycle → WorkSession → WorkSessionTask → WorkSessionTaskSet
- API: FastAPI (REST + WebSocket)
- Monorepo: Nx with packages: agents, relational, vector, plans-analyze
- Observability: Langfuse for LLM tracing
- Deployment: Docker + Railway (main apps), Modal (GPU services)
- Storage: Cloudflare R2 for coach plan uploads
Key architectural decisions:
- Multi-model routing (cost/quality optimization per stage)
- DSPy over raw prompting (programmatic, testable, typed)
- Separate data models for coach vs AI plans (data integrity)
- PEG grammar for prescription validation (no garbage in)
- Vector-based exercise search over keyword search (semantic understanding)
8. Potential Curveball Questions
"Why not just fine-tune a model on coach plans?"
"Fine-tuning requires significant data volume and compute. We had a growing but limited corpus of coach plans. LLM orchestration with structured outputs gave us production-quality results faster, with the flexibility to swap models as better ones became available."
"How do you prevent the AI from generating dangerous exercises?"
"Three layers: (1) Injury constraint parsing identifies affected joints and movement restrictions, (2) Joint stress calculation estimates cumulative loading per session, (3) Exercise library matching ensures only known, vetted exercises are included."
"What happens if the LLM generates garbage?"
"Structured outputs via Pydantic enforce schema compliance. The prescription DSL rejects malformed formats at parse time. Exercise names get matched against the canonical library — unrecognized exercises are flagged."
"How did you handle latency for plan generation?"
"Plan generation is async — the agent server uses WebSocket to stream progress updates to the client. The multi-stage approach means each stage can complete and save independently. We used Claude Haiku for the highest-volume parsing stage specifically because it's fast and cheap."
"What's DSPy and why did you choose it over LangChain?"
"DSPy treats LLM calls as typed, composable modules rather than prompt templates. It gives you programmatic control — you define signatures (input/output contracts), and DSPy handles the prompting. Unlike LangChain's chain-of-prompts approach, DSPy is closer to traditional software engineering: typed interfaces, testable modules, and the ability to optimize prompts algorithmically."
"How did you decide which model to use for each stage?"
"Empirical testing. We tried different models at each stage and evaluated outputs. Planning needs strong reasoning and long context → Grok-4. Session generation needs good structured output → Claude Sonnet. Prescription parsing is a simpler extraction task → Claude Haiku (fast, cheap, sufficient). The settings.toml config made it easy to swap and test."