AI Experience — Interview Prep

System Architecture

Loading diagram...

System Architecture
1. Project Overview
2. Interview Questions & Answers
3. Understanding Qwen3 Embeddings
4. What is Modal?
5. Fine-Tuning Roadmap
6. Connecting Theory to Practice
7. Key Technical Details — Quick Reference
8. Potential Curveball Questions

1. Project Overview (Your Elevator Pitch)

"I built an AI-powered fitness platform that generates personalized workout plans. The core challenge was taking real coach-designed training programs — uploaded as PDFs — and using them alongside LLMs to produce structured, periodized workout plans tailored to individual users. I designed a multi-stage generation pipeline using DSPy, multiple LLM providers, vector-based exercise search, and a custom prescription grammar to validate exercise prescriptions like '3x5 @ RPE 8'. The system could either generate plans from scratch via AI or clone coach-curated plans, depending on context."

2. Interview Questions & Answers

Q1: "How did you use AI to generate workout plans based on coach plans?"

Key point: This was LLM orchestration, not traditional model training.

The system used a multi-stage LLM pipeline built with DSPy (a framework for programming — not just prompting — language models):

Stage	What it does	Model Used	Max Tokens
Plan Overview	Generates plan name, description, cycle structure, progression/recovery/injury guidelines	Grok-4 (via OpenRouter)	32,384
Weekly Tasks	For each cycle, generates balanced exercise programming with push:pull ratios, volume management	Grok-4 (via OpenRouter)	24,000
Session Details	Generates individual sessions with exercises, prescriptions, rest periods	Claude Sonnet 4.5 (Anthropic)	8,000
Sets & Prescriptions	Parses prescriptions like "3x5 @ RPE 8" into structured data	Claude Haiku 4.5 (Anthropic)	4,000

How coach plans fed into this:

Coaches uploaded PDF workout plans → stored in Cloudflare R2
PaddleOCR (computer vision) extracted text from the PDFs
GPT-4o (vision model) parsed the extracted text into structured CoachWorkPlan objects via Pydantic
These parsed plans served as the gold standard — they could be directly cloned for users during onboarding (feature flag: USE_COACH_PLAN), or their patterns informed the AI generation pipeline

If pressed on "training": We didn't fine-tune a foundation model. We used LLM orchestration with structured outputs. The coach plans were reference data that the system could clone or use as context. The "intelligence" came from carefully designed DSPy signatures (typed input/output contracts) and multi-model routing.

Q2: "Walk me through how you ingested and structured coach workout plans."

The Coach Plan Import Pipeline:

Coach uploads PDF/image
    → Cloudflare R2 storage
    → PaddleOCR (paddle_ocr_vl.py) extracts raw text
    → GPT-4o (pydantic_extractor.py) parses into structured data
    → Prescription DSL validation (PEG grammar)
    → Exercise library matching (vector similarity)
    → Upsert to Supabase (unified_plan_upload.py)

The data hierarchy:

CoachWorkPlan
├── CoachWorkPlanCycle (e.g., "Hypertrophy Block")
│   └── CoachWorkSession (e.g., "Day 1 — Upper Push")
│       └── CoachWorkSessionTask (e.g., "Bench Press")
│           └── CoachWorkSessionTaskSet (e.g., "Set 1: 185lbs x 5")

Prescription DSL — A PEG-based grammar to validate exercise prescriptions:

Valid: 3 x 5 @ RPE 8, 3 x 8-12 @ RPE 8 / 180s, 4 x 185lbs, 205lbs, 225lbs, max @ 60%
Invalid: 3 x 5 @ moderate (free text), 10 per leg (laterality belongs in description, not prescription)

This strict validation ensured data integrity — every prescription had to parse into structured sets/reps/weight/intensity/rest.

Q3: "What embedding models and vector search did you use?"

Embedding model: Qwen3-Embedding-0.6B generating 1024-dimensional vectors
Vector storage: pgvector extension in Supabase (PostgreSQL)
Embedding generation: Ran on Modal (serverless GPU compute using Hugging Face models)
Use case: Exercise similarity search — when a user needs an easier exercise or doesn't have specific equipment, the system finds semantically similar alternatives

How it worked in practice:

python# ExerciseService
find_alternatives(
    exercise_id,
    easier=True,                        # Find easier alternatives
    available_equipment=["dumbbell"],    # Filter by equipment
    limit=10
)  # Returns: list[(ExerciseLibrary, similarity_score)]

# NLExerciseParser — natural language exercise queries
parse("this is too hard, I don't have a barbell")
# Returns: easier=True, allowed_equipment=[dumbbells, bodyweight, ...]

Why Qwen3 over something bigger? Cost-performance tradeoff. For exercise similarity (a relatively constrained domain), a 0.6B parameter model produced sufficient quality embeddings. We didn't need a massive model — exercises have fairly distinct semantic signatures.

Q4: "How did you evaluate the quality of AI-generated plans?"

LLM-as-Judge evaluation framework:

pythonBatchEvaluator:
  - Generates N plans per test persona
  - LLM judge scores against rubric
  - Scoring dimensions: structure, progression, balance, engagement

ScoreAnalyzer:
  - Identifies patterns in low scores
  - Generates suggestions for prompt improvement
  - Exports timestamped results for tracking over time

Test personas represented different fitness profiles (beginner, intermediate, advanced; different goals, injuries, equipment access). Each persona would generate multiple plans, and the evaluator would score them.

What I'd say about evaluation challenges:

LLM-as-judge has known biases (verbosity bias, position bias)
We used structured rubrics to constrain scoring
Ideally, you'd combine LLM evaluation with human expert review and user outcome data
The evaluation pipeline was iterative — low scores fed back into prompt refinement

Q5: "What was your RAG / retrieval approach?"

The system used retrieval-augmented generation in several ways:

Exercise Library as Knowledge Base: When generating sessions, the system retrieved relevant exercises from the vector-indexed library based on the user's equipment, goals, and constraints
Context Window Management: Each generation stage received context from previous stages:
- Session generation received all previously-generated sessions in that week (to prevent muscle imbalance)
- The system tracked push:pull ratios and weekly volume across sessions
Exercise Enrichment: After generating exercise names, the system enriched them with vector embeddings to match against the canonical exercise library — ensuring exercises had proper metadata (difficulty, joint stress, equipment requirements)

Q6: "How did you handle prompt engineering?"

DSPy Signatures — not raw prompts. DSPy lets you define typed input/output contracts:

pythonclass GeneratePlanSignature(dspy.Signature):
    """Generate a structured workout plan."""
    fitness_context: OnboardingFitnessFields = dspy.InputField()
    plan: WorkPlanLLM = dspy.OutputField()

Key prompt engineering decisions:

Per-stage model routing: Different models for different cognitive tasks
- Complex planning → Grok-4 (large context, strong reasoning)
- Session details → Claude Sonnet (good structured output)
- Parsing → Claude Haiku (fast, cheap, sufficient for extraction)
Structured outputs everywhere: Pydantic models defined exact schemas — no free-text parsing
Context injection: Each stage received relevant prior context to maintain coherence
Configuration-driven: settings.toml made model selection, token limits, and endpoints configurable without code changes

Q7: "What was your full tech stack?"

Layer	Technology	Purpose
LLM Orchestration	DSPy 3.0.4+	Multi-stage pipeline, typed signatures
LLM Providers	OpenRouter (Grok-4), Anthropic (Claude), OpenAI (GPT-4o)	Multi-model routing
Backend API	FastAPI	REST + WebSocket endpoints
Database	Supabase (PostgreSQL + pgvector)	Relational data + vector search
Embeddings	Qwen3-Embedding-0.6B on Modal	Exercise similarity vectors
OCR	PaddleOCR + GPT-4o vision	Coach plan PDF ingestion
Frontend	Next.js, React Query, Zustand, TypeScript	Coach portal + test client
Monorepo	Nx	Package management across apps/packages
Deployment	Docker + Railway	Containerized deployment via Railway
Observability	Langfuse	LLM tracing and monitoring
Storage	Cloudflare R2	Coach plan PDF/image storage

Q8: "How did you handle errors and edge cases in AI generation?"

Prescription DSL Validation: PEG grammar rejects malformed prescriptions at parse time — no garbage data enters the system
Exercise Match Quality Tracking: When mapping LLM-generated exercise names to the canonical library, the system tracked match confidence. Low-confidence matches were flagged for review.
Injury Constraint Parsing: A dedicated DSPy signature parsed free-text injury descriptions into structured constraints (affected joints, movement restrictions, severity)
Joint Stress Calculation: Another signature estimated cumulative joint loading across a session to prevent overloading injured or stressed areas
Bilateral Exercise Handling: The system tracked whether exercises were unilateral or bilateral to correctly calculate volume
WebSocket Queue System: The agent server used a queue system for async plan saves — if generation succeeded but the save failed, the plan wasn't lost

Q9: "What's the difference between AI-generated plans and coach plans in your system?"

Parallel data models:

WorkPlan / WorkPlanCycle / WorkSession / WorkSessionTask — AI-generated
CoachWorkPlan / CoachWorkPlanCycle / CoachWorkSession / CoachWorkSessionTask — Coach-created

Why separate models?

Coach plans have different metadata (source tracking, import quality scores)
AI plans carry generation context (the onboarding data that produced them)
Coach plans are treated as canonical/authoritative; AI plans are generated approximations

Feature flag approach: USE_COACH_PLAN = true → clone the coach plan directly instead of running the generation pipeline. This saved LLM costs and guaranteed coach-quality plans.

Q10: "How would you improve the system?"

Structured diff: AI plans vs coach plans — quantitatively compare exercise overlap %, volume accuracy, push:pull ratio deviation. This diff drives DSPy prompt optimization, then SFT if enough data accumulates.
RLHF from user feedback — users completing workouts could rate difficulty, enjoyment, effectiveness
Better evaluation metrics — move beyond LLM-as-judge to outcome-based metrics (user adherence, progression over time)
Hybrid retrieval — combine vector search with BM25 sparse retrieval for exercise lookup (Reciprocal Rank Fusion)
Caching and cost optimization — semantic caching for similar user profiles
Multi-modal input — accept video of exercises for form analysis

3. Understanding Qwen3 Embeddings — What Does It Mean to "Embed"?

The Short Answer

Qwen3-Embedding-0.6B was used to convert every exercise in the library into a 1024-dimensional vector. These vectors were stored in Supabase's pgvector extension and used for semantic similarity search.

What Does "Embedding" Actually Mean?

An embedding model is a neural network that converts text into a list of numbers (a vector). The key insight: semantically similar text produces similar vectors.

"Barbell Bench Press"  →  [0.23, -0.81, 0.45, ..., 0.12]  (1024 numbers)
"Dumbbell Chest Press"  →  [0.21, -0.79, 0.44, ..., 0.14]  (1024 numbers)  ← very similar!
"Barbell Back Squat"    →  [-0.55, 0.32, -0.18, ..., 0.67]  (1024 numbers)  ← very different

The model "understands" that bench press and chest press are semantically related (both are horizontal pressing movements targeting chest), while squats are a completely different movement pattern.

How It Worked in Practice

Step 1: Pre-compute embeddings for every exercise

Take the full exercise library (names + descriptions)
Run each one through Qwen3-Embedding-0.6B on Modal (serverless GPU)
Store the resulting 1024-dim vectors in Supabase pgvector

Step 2: Use vectors for search at runtime

When the LLM generates an exercise name like "Incline Dumbbell Press", embed that name
Find the closest vectors in the database using cosine similarity
Return the matching canonical exercise with all its metadata

Step 3: Exercise substitution

User says "this is too hard" or "I don't have a barbell"
System finds semantically similar but easier/different-equipment exercises
Vector similarity naturally groups exercises by movement pattern

Why Qwen3-Embedding-0.6B Specifically?

0.6B parameters — small enough to run cheaply on Modal GPUs
1024 dimensions — good balance of expressiveness vs. storage/compute cost
Exercise domain — exercises have distinct semantic signatures, so a smaller model suffices
Cost — embedding the full library is a one-time batch operation

Modal is a serverless cloud platform for running code on GPUs without managing infrastructure. Think of it like AWS Lambda, but for ML workloads.

How It Was Used

Modal hosted the Qwen3-Embedding-0.6B model on a GPU and exposed it as an HTTP endpoint:

pythonclass ModalEmbeddingProvider:
    def __init__(self):
        self.endpoint_url = os.getenv(
            "MODAL_EMBEDDINGS_ENDPOINT",
            "https://embeddings.example.com/embed",
        )
        self._dimension = 1024  # Qwen3-Embedding-0.6B output dimensions

    async def embed(self, texts: list[str]) -> list[list[float]]:
        # POST to Modal-hosted endpoint, get back 1024-dim vectors
        response = await self.client.post(self.endpoint_url, json={"texts": texts})
        return response.json()["embeddings"]

Approach	Cost Model	Setup	Use Case
Own GPU server	Pay 24/7 even when idle	Install CUDA, PyTorch, deploy model, maintain	Overkill for infrequent embeddings
OpenAI Embeddings API	Per-token pricing	Zero setup	No control over model choice
Modal	Pay only for compute time	Write a decorated function	Run any Hugging Face model, on-demand

5. Fine-Tuning Roadmap — How We Planned to Improve the AI

The Structured Diff Approach

Since both AI plans and coach plans share the exact same Pydantic data model, you can do a structured diff at every level:

Coach Plan (gold standard)          AI Plan (generated)
├── 4 cycles                        ├── 4 cycles              ✓ match
│   ├── Week 1: Upper/Lower split   │   ├── Week 1: Push/Pull  ✗ different split
│   │   ├── Bench Press 3x5@RPE8   │   │   ├── Bench Press 3x5@RPE8  ✓ exact match
│   │   ├── Barbell Row 3x8        │   │   ├── Cable Row 3x10        ~ partial match

Quantitative metrics from this diff:

Metric	What It Measures
Exercise overlap %	How many exercises match between AI and coach plan
Volume accuracy	Total sets/reps per muscle group vs coach plan
Push:pull ratio deviation	Balance compared to coach standard
Prescription format match rate	How often AI prescriptions are valid
Progression alignment	Does intensity increase across cycles like coach plan

Three Approaches (Simplest to Most Sophisticated)

Option A: DSPy Prompt Optimization (No Weight Changes)

DSPy's built-in optimizers try different prompt variations
Measures quality using the structured diff score
No model weights change — just better prompts

Option B: Supervised Fine-Tuning (SFT)

Train a smaller model on (user profile → coach plan) pairs
Needs 500+ plan pairs minimum
Result: cheaper model that produces coach-aligned outputs

Option C: DPO / Preference Training

Generate both AI and coach plans for same profile
Train model to prefer coach-like outputs
Doesn't need exact replication, just learns "better" direction

6. Connecting Theory to Practice

Interview Topic	Textbook Answer	Practical Example
LLM Training Stages	Pretraining → Fine-tuning → Alignment	Used pretrained models and orchestrated them with DSPy
Prompt Engineering	Zero-shot, few-shot, CoT, DSPy	DSPy signatures — typed I/O contracts — rather than raw prompting
RAG	Retriever + Generator pattern	Exercise library as knowledge base, vector search for retrieval
Embedding Models	BERT derivatives, contrastive learning	Qwen3-Embedding-0.6B, 1024-dim vectors, deployed on Modal
Vector Databases	pgvector, Qdrant, similarity metrics	Supabase with pgvector, cosine similarity
Evaluation	RAGAS, retrieval metrics	Custom LLM-as-judge with BatchEvaluator
Structured Output	Instructor, JSON mode	Pydantic models everywhere, DSPy signatures, PEG grammar
Hallucination	Grounding, RAG, validation	Prescription DSL + exercise library matching
Multi-model Architecture	Cost/performance tradeoffs	Grok-4 for reasoning, Claude Sonnet for generation, Haiku for extraction

7. Key Technical Details — Quick Reference

Models: Grok-4 (planning), Claude Sonnet 4.5 (sessions), Claude Haiku 4.5 (parsing), GPT-4o (vision/OCR)
Framework: DSPy 3.0.4+ with MCP support
Embeddings: Qwen3-Embedding-0.6B, 1024 dimensions
Vector DB: Supabase pgvector
GPU Compute: Modal (serverless)
OCR: PaddleOCR + GPT-4o structured extraction
Prescription Grammar: PEG-based DSL (e.g., 3 x 5 @ RPE 8 / 180s)
Data hierarchy: WorkPlan → WorkPlanCycle → WorkSession → WorkSessionTask → WorkSessionTaskSet
API: FastAPI (REST + WebSocket)
Monorepo: Nx with packages: agents, relational, vector, plans-analyze
Observability: Langfuse for LLM tracing
Deployment: Docker + Railway (main apps), Modal (GPU services)
Storage: Cloudflare R2 for coach plan uploads

Key architectural decisions:

Multi-model routing (cost/quality optimization per stage)
DSPy over raw prompting (programmatic, testable, typed)
Separate data models for coach vs AI plans (data integrity)
PEG grammar for prescription validation (no garbage in)
Vector-based exercise search over keyword search (semantic understanding)

8. Potential Curveball Questions

"Why not just fine-tune a model on coach plans?"

"Fine-tuning requires significant data volume and compute. We had a growing but limited corpus of coach plans. LLM orchestration with structured outputs gave us production-quality results faster, with the flexibility to swap models as better ones became available."

"How do you prevent the AI from generating dangerous exercises?"

"Three layers: (1) Injury constraint parsing identifies affected joints and movement restrictions, (2) Joint stress calculation estimates cumulative loading per session, (3) Exercise library matching ensures only known, vetted exercises are included."

"What happens if the LLM generates garbage?"

"Structured outputs via Pydantic enforce schema compliance. The prescription DSL rejects malformed formats at parse time. Exercise names get matched against the canonical library — unrecognized exercises are flagged."

"How did you handle latency for plan generation?"

"Plan generation is async — the agent server uses WebSocket to stream progress updates to the client. The multi-stage approach means each stage can complete and save independently. We used Claude Haiku for the highest-volume parsing stage specifically because it's fast and cheap."

"What's DSPy and why did you choose it over LangChain?"

"DSPy treats LLM calls as typed, composable modules rather than prompt templates. It gives you programmatic control — you define signatures (input/output contracts), and DSPy handles the prompting. Unlike LangChain's chain-of-prompts approach, DSPy is closer to traditional software engineering: typed interfaces, testable modules, and the ability to optimize prompts algorithmically."

"How did you decide which model to use for each stage?"

"Empirical testing. We tried different models at each stage and evaluated outputs. Planning needs strong reasoning and long context → Grok-4. Session generation needs good structured output → Claude Sonnet. Prescription parsing is a simpler extraction task → Claude Haiku (fast, cheap, sufficient). The settings.toml config made it easy to swap and test."