denniscao.net

AI Experience — Interview Prep

System Architecture

Loading diagram...

Table of Contents


1. Project Overview (Your Elevator Pitch)

"I built an AI-powered fitness platform that generates personalized workout plans. The core challenge was taking real coach-designed training programs — uploaded as PDFs — and using them alongside LLMs to produce structured, periodized workout plans tailored to individual users. I designed a multi-stage generation pipeline using DSPy, multiple LLM providers, vector-based exercise search, and a custom prescription grammar to validate exercise prescriptions like '3x5 @ RPE 8'. The system could either generate plans from scratch via AI or clone coach-curated plans, depending on context."


2. Interview Questions & Answers

Q1: "How did you use AI to generate workout plans based on coach plans?"

Key point: This was LLM orchestration, not traditional model training.

The system used a multi-stage LLM pipeline built with DSPy (a framework for programming — not just prompting — language models):

StageWhat it doesModel UsedMax Tokens
Plan OverviewGenerates plan name, description, cycle structure, progression/recovery/injury guidelinesGrok-4 (via OpenRouter)32,384
Weekly TasksFor each cycle, generates balanced exercise programming with push:pull ratios, volume managementGrok-4 (via OpenRouter)24,000
Session DetailsGenerates individual sessions with exercises, prescriptions, rest periodsClaude Sonnet 4.5 (Anthropic)8,000
Sets & PrescriptionsParses prescriptions like "3x5 @ RPE 8" into structured dataClaude Haiku 4.5 (Anthropic)4,000

How coach plans fed into this:

  • Coaches uploaded PDF workout plans → stored in Cloudflare R2
  • PaddleOCR (computer vision) extracted text from the PDFs
  • GPT-4o (vision model) parsed the extracted text into structured CoachWorkPlan objects via Pydantic
  • These parsed plans served as the gold standard — they could be directly cloned for users during onboarding (feature flag: USE_COACH_PLAN), or their patterns informed the AI generation pipeline

If pressed on "training": We didn't fine-tune a foundation model. We used LLM orchestration with structured outputs. The coach plans were reference data that the system could clone or use as context. The "intelligence" came from carefully designed DSPy signatures (typed input/output contracts) and multi-model routing.


Q2: "Walk me through how you ingested and structured coach workout plans."

The Coach Plan Import Pipeline:

Coach uploads PDF/image
    → Cloudflare R2 storage
    → PaddleOCR (paddle_ocr_vl.py) extracts raw text
    → GPT-4o (pydantic_extractor.py) parses into structured data
    → Prescription DSL validation (PEG grammar)
    → Exercise library matching (vector similarity)
    → Upsert to Supabase (unified_plan_upload.py)

The data hierarchy:

CoachWorkPlan
├── CoachWorkPlanCycle (e.g., "Hypertrophy Block")
│   └── CoachWorkSession (e.g., "Day 1 — Upper Push")
│       └── CoachWorkSessionTask (e.g., "Bench Press")
│           └── CoachWorkSessionTaskSet (e.g., "Set 1: 185lbs x 5")

Prescription DSL — A PEG-based grammar to validate exercise prescriptions:

  • Valid: 3 x 5 @ RPE 8, 3 x 8-12 @ RPE 8 / 180s, 4 x 185lbs, 205lbs, 225lbs, max @ 60%
  • Invalid: 3 x 5 @ moderate (free text), 10 per leg (laterality belongs in description, not prescription)

This strict validation ensured data integrity — every prescription had to parse into structured sets/reps/weight/intensity/rest.


Q3: "What embedding models and vector search did you use?"

  • Embedding model: Qwen3-Embedding-0.6B generating 1024-dimensional vectors
  • Vector storage: pgvector extension in Supabase (PostgreSQL)
  • Embedding generation: Ran on Modal (serverless GPU compute using Hugging Face models)
  • Use case: Exercise similarity search — when a user needs an easier exercise or doesn't have specific equipment, the system finds semantically similar alternatives

How it worked in practice:

python
# ExerciseService
find_alternatives(
    exercise_id,
    easier=True,                        # Find easier alternatives
    available_equipment=["dumbbell"],    # Filter by equipment
    limit=10
)  # Returns: list[(ExerciseLibrary, similarity_score)]

# NLExerciseParser — natural language exercise queries
parse("this is too hard, I don't have a barbell")
# Returns: easier=True, allowed_equipment=[dumbbells, bodyweight, ...]

Why Qwen3 over something bigger? Cost-performance tradeoff. For exercise similarity (a relatively constrained domain), a 0.6B parameter model produced sufficient quality embeddings. We didn't need a massive model — exercises have fairly distinct semantic signatures.


Q4: "How did you evaluate the quality of AI-generated plans?"

LLM-as-Judge evaluation framework:

python
BatchEvaluator:
  - Generates N plans per test persona
  - LLM judge scores against rubric
  - Scoring dimensions: structure, progression, balance, engagement

ScoreAnalyzer:
  - Identifies patterns in low scores
  - Generates suggestions for prompt improvement
  - Exports timestamped results for tracking over time

Test personas represented different fitness profiles (beginner, intermediate, advanced; different goals, injuries, equipment access). Each persona would generate multiple plans, and the evaluator would score them.

What I'd say about evaluation challenges:

  • LLM-as-judge has known biases (verbosity bias, position bias)
  • We used structured rubrics to constrain scoring
  • Ideally, you'd combine LLM evaluation with human expert review and user outcome data
  • The evaluation pipeline was iterative — low scores fed back into prompt refinement

Q5: "What was your RAG / retrieval approach?"

The system used retrieval-augmented generation in several ways:

  1. Exercise Library as Knowledge Base: When generating sessions, the system retrieved relevant exercises from the vector-indexed library based on the user's equipment, goals, and constraints

  2. Context Window Management: Each generation stage received context from previous stages:

    • Session generation received all previously-generated sessions in that week (to prevent muscle imbalance)
    • The system tracked push:pull ratios and weekly volume across sessions
  3. Exercise Enrichment: After generating exercise names, the system enriched them with vector embeddings to match against the canonical exercise library — ensuring exercises had proper metadata (difficulty, joint stress, equipment requirements)


Q6: "How did you handle prompt engineering?"

DSPy Signatures — not raw prompts. DSPy lets you define typed input/output contracts:

python
class GeneratePlanSignature(dspy.Signature):
    """Generate a structured workout plan."""
    fitness_context: OnboardingFitnessFields = dspy.InputField()
    plan: WorkPlanLLM = dspy.OutputField()

Key prompt engineering decisions:

  1. Per-stage model routing: Different models for different cognitive tasks

    • Complex planning → Grok-4 (large context, strong reasoning)
    • Session details → Claude Sonnet (good structured output)
    • Parsing → Claude Haiku (fast, cheap, sufficient for extraction)
  2. Structured outputs everywhere: Pydantic models defined exact schemas — no free-text parsing

  3. Context injection: Each stage received relevant prior context to maintain coherence

  4. Configuration-driven: settings.toml made model selection, token limits, and endpoints configurable without code changes


Q7: "What was your full tech stack?"

LayerTechnologyPurpose
LLM OrchestrationDSPy 3.0.4+Multi-stage pipeline, typed signatures
LLM ProvidersOpenRouter (Grok-4), Anthropic (Claude), OpenAI (GPT-4o)Multi-model routing
Backend APIFastAPIREST + WebSocket endpoints
DatabaseSupabase (PostgreSQL + pgvector)Relational data + vector search
EmbeddingsQwen3-Embedding-0.6B on ModalExercise similarity vectors
OCRPaddleOCR + GPT-4o visionCoach plan PDF ingestion
FrontendNext.js, React Query, Zustand, TypeScriptCoach portal + test client
MonorepoNxPackage management across apps/packages
DeploymentDocker + RailwayContainerized deployment via Railway
ObservabilityLangfuseLLM tracing and monitoring
StorageCloudflare R2Coach plan PDF/image storage

Q8: "How did you handle errors and edge cases in AI generation?"

  1. Prescription DSL Validation: PEG grammar rejects malformed prescriptions at parse time — no garbage data enters the system

  2. Exercise Match Quality Tracking: When mapping LLM-generated exercise names to the canonical library, the system tracked match confidence. Low-confidence matches were flagged for review.

  3. Injury Constraint Parsing: A dedicated DSPy signature parsed free-text injury descriptions into structured constraints (affected joints, movement restrictions, severity)

  4. Joint Stress Calculation: Another signature estimated cumulative joint loading across a session to prevent overloading injured or stressed areas

  5. Bilateral Exercise Handling: The system tracked whether exercises were unilateral or bilateral to correctly calculate volume

  6. WebSocket Queue System: The agent server used a queue system for async plan saves — if generation succeeded but the save failed, the plan wasn't lost


Q9: "What's the difference between AI-generated plans and coach plans in your system?"

Parallel data models:

  • WorkPlan / WorkPlanCycle / WorkSession / WorkSessionTask — AI-generated
  • CoachWorkPlan / CoachWorkPlanCycle / CoachWorkSession / CoachWorkSessionTask — Coach-created

Why separate models?

  • Coach plans have different metadata (source tracking, import quality scores)
  • AI plans carry generation context (the onboarding data that produced them)
  • Coach plans are treated as canonical/authoritative; AI plans are generated approximations

Feature flag approach: USE_COACH_PLAN = true → clone the coach plan directly instead of running the generation pipeline. This saved LLM costs and guaranteed coach-quality plans.


Q10: "How would you improve the system?"

  1. Structured diff: AI plans vs coach plans — quantitatively compare exercise overlap %, volume accuracy, push:pull ratio deviation. This diff drives DSPy prompt optimization, then SFT if enough data accumulates.

  2. RLHF from user feedback — users completing workouts could rate difficulty, enjoyment, effectiveness

  3. Better evaluation metrics — move beyond LLM-as-judge to outcome-based metrics (user adherence, progression over time)

  4. Hybrid retrieval — combine vector search with BM25 sparse retrieval for exercise lookup (Reciprocal Rank Fusion)

  5. Caching and cost optimization — semantic caching for similar user profiles

  6. Multi-modal input — accept video of exercises for form analysis


3. Understanding Qwen3 Embeddings — What Does It Mean to "Embed"?

The Short Answer

Qwen3-Embedding-0.6B was used to convert every exercise in the library into a 1024-dimensional vector. These vectors were stored in Supabase's pgvector extension and used for semantic similarity search.

What Does "Embedding" Actually Mean?

An embedding model is a neural network that converts text into a list of numbers (a vector). The key insight: semantically similar text produces similar vectors.

"Barbell Bench Press"  →  [0.23, -0.81, 0.45, ..., 0.12]  (1024 numbers)
"Dumbbell Chest Press"  →  [0.21, -0.79, 0.44, ..., 0.14]  (1024 numbers)  ← very similar!
"Barbell Back Squat"    →  [-0.55, 0.32, -0.18, ..., 0.67]  (1024 numbers)  ← very different

The model "understands" that bench press and chest press are semantically related (both are horizontal pressing movements targeting chest), while squats are a completely different movement pattern.

How It Worked in Practice

Step 1: Pre-compute embeddings for every exercise

  • Take the full exercise library (names + descriptions)
  • Run each one through Qwen3-Embedding-0.6B on Modal (serverless GPU)
  • Store the resulting 1024-dim vectors in Supabase pgvector

Step 2: Use vectors for search at runtime

  • When the LLM generates an exercise name like "Incline Dumbbell Press", embed that name
  • Find the closest vectors in the database using cosine similarity
  • Return the matching canonical exercise with all its metadata

Step 3: Exercise substitution

  • User says "this is too hard" or "I don't have a barbell"
  • System finds semantically similar but easier/different-equipment exercises
  • Vector similarity naturally groups exercises by movement pattern

Why Qwen3-Embedding-0.6B Specifically?

  • 0.6B parameters — small enough to run cheaply on Modal GPUs
  • 1024 dimensions — good balance of expressiveness vs. storage/compute cost
  • Exercise domain — exercises have distinct semantic signatures, so a smaller model suffices
  • Cost — embedding the full library is a one-time batch operation

4. What is Modal? (Serverless GPU Compute)

Modal is a serverless cloud platform for running code on GPUs without managing infrastructure. Think of it like AWS Lambda, but for ML workloads.

How It Was Used

Modal hosted the Qwen3-Embedding-0.6B model on a GPU and exposed it as an HTTP endpoint:

python
class ModalEmbeddingProvider:
    def __init__(self):
        self.endpoint_url = os.getenv(
            "MODAL_EMBEDDINGS_ENDPOINT",
            "https://embeddings.example.com/embed",
        )
        self._dimension = 1024  # Qwen3-Embedding-0.6B output dimensions

    async def embed(self, texts: list[str]) -> list[list[float]]:
        # POST to Modal-hosted endpoint, get back 1024-dim vectors
        response = await self.client.post(self.endpoint_url, json={"texts": texts})
        return response.json()["embeddings"]

Why Modal Over Alternatives?

ApproachCost ModelSetupUse Case
Own GPU serverPay 24/7 even when idleInstall CUDA, PyTorch, deploy model, maintainOverkill for infrequent embeddings
OpenAI Embeddings APIPer-token pricingZero setupNo control over model choice
ModalPay only for compute timeWrite a decorated functionRun any Hugging Face model, on-demand

5. Fine-Tuning Roadmap — How We Planned to Improve the AI

The Structured Diff Approach

Since both AI plans and coach plans share the exact same Pydantic data model, you can do a structured diff at every level:

Coach Plan (gold standard)          AI Plan (generated)
├── 4 cycles                        ├── 4 cycles              ✓ match
│   ├── Week 1: Upper/Lower split   │   ├── Week 1: Push/Pull  ✗ different split
│   │   ├── Bench Press 3x5@RPE8   │   │   ├── Bench Press 3x5@RPE8  ✓ exact match
│   │   ├── Barbell Row 3x8        │   │   ├── Cable Row 3x10        ~ partial match

Quantitative metrics from this diff:

MetricWhat It Measures
Exercise overlap %How many exercises match between AI and coach plan
Volume accuracyTotal sets/reps per muscle group vs coach plan
Push:pull ratio deviationBalance compared to coach standard
Prescription format match rateHow often AI prescriptions are valid
Progression alignmentDoes intensity increase across cycles like coach plan

Three Approaches (Simplest to Most Sophisticated)

Option A: DSPy Prompt Optimization (No Weight Changes)

  • DSPy's built-in optimizers try different prompt variations
  • Measures quality using the structured diff score
  • No model weights change — just better prompts

Option B: Supervised Fine-Tuning (SFT)

  • Train a smaller model on (user profile → coach plan) pairs
  • Needs 500+ plan pairs minimum
  • Result: cheaper model that produces coach-aligned outputs

Option C: DPO / Preference Training

  • Generate both AI and coach plans for same profile
  • Train model to prefer coach-like outputs
  • Doesn't need exact replication, just learns "better" direction

6. Connecting Theory to Practice

Interview TopicTextbook AnswerPractical Example
LLM Training StagesPretraining → Fine-tuning → AlignmentUsed pretrained models and orchestrated them with DSPy
Prompt EngineeringZero-shot, few-shot, CoT, DSPyDSPy signatures — typed I/O contracts — rather than raw prompting
RAGRetriever + Generator patternExercise library as knowledge base, vector search for retrieval
Embedding ModelsBERT derivatives, contrastive learningQwen3-Embedding-0.6B, 1024-dim vectors, deployed on Modal
Vector Databasespgvector, Qdrant, similarity metricsSupabase with pgvector, cosine similarity
EvaluationRAGAS, retrieval metricsCustom LLM-as-judge with BatchEvaluator
Structured OutputInstructor, JSON modePydantic models everywhere, DSPy signatures, PEG grammar
HallucinationGrounding, RAG, validationPrescription DSL + exercise library matching
Multi-model ArchitectureCost/performance tradeoffsGrok-4 for reasoning, Claude Sonnet for generation, Haiku for extraction

7. Key Technical Details — Quick Reference

  • Models: Grok-4 (planning), Claude Sonnet 4.5 (sessions), Claude Haiku 4.5 (parsing), GPT-4o (vision/OCR)
  • Framework: DSPy 3.0.4+ with MCP support
  • Embeddings: Qwen3-Embedding-0.6B, 1024 dimensions
  • Vector DB: Supabase pgvector
  • GPU Compute: Modal (serverless)
  • OCR: PaddleOCR + GPT-4o structured extraction
  • Prescription Grammar: PEG-based DSL (e.g., 3 x 5 @ RPE 8 / 180s)
  • Data hierarchy: WorkPlan → WorkPlanCycle → WorkSession → WorkSessionTask → WorkSessionTaskSet
  • API: FastAPI (REST + WebSocket)
  • Monorepo: Nx with packages: agents, relational, vector, plans-analyze
  • Observability: Langfuse for LLM tracing
  • Deployment: Docker + Railway (main apps), Modal (GPU services)
  • Storage: Cloudflare R2 for coach plan uploads

Key architectural decisions:

  1. Multi-model routing (cost/quality optimization per stage)
  2. DSPy over raw prompting (programmatic, testable, typed)
  3. Separate data models for coach vs AI plans (data integrity)
  4. PEG grammar for prescription validation (no garbage in)
  5. Vector-based exercise search over keyword search (semantic understanding)

8. Potential Curveball Questions

"Why not just fine-tune a model on coach plans?"

"Fine-tuning requires significant data volume and compute. We had a growing but limited corpus of coach plans. LLM orchestration with structured outputs gave us production-quality results faster, with the flexibility to swap models as better ones became available."

"How do you prevent the AI from generating dangerous exercises?"

"Three layers: (1) Injury constraint parsing identifies affected joints and movement restrictions, (2) Joint stress calculation estimates cumulative loading per session, (3) Exercise library matching ensures only known, vetted exercises are included."

"What happens if the LLM generates garbage?"

"Structured outputs via Pydantic enforce schema compliance. The prescription DSL rejects malformed formats at parse time. Exercise names get matched against the canonical library — unrecognized exercises are flagged."

"How did you handle latency for plan generation?"

"Plan generation is async — the agent server uses WebSocket to stream progress updates to the client. The multi-stage approach means each stage can complete and save independently. We used Claude Haiku for the highest-volume parsing stage specifically because it's fast and cheap."

"What's DSPy and why did you choose it over LangChain?"

"DSPy treats LLM calls as typed, composable modules rather than prompt templates. It gives you programmatic control — you define signatures (input/output contracts), and DSPy handles the prompting. Unlike LangChain's chain-of-prompts approach, DSPy is closer to traditional software engineering: typed interfaces, testable modules, and the ability to optimize prompts algorithmically."

"How did you decide which model to use for each stage?"

"Empirical testing. We tried different models at each stage and evaluated outputs. Planning needs strong reasoning and long context → Grok-4. Session generation needs good structured output → Claude Sonnet. Prescription parsing is a simpler extraction task → Claude Haiku (fast, cheap, sufficient). The settings.toml config made it easy to swap and test."