Skip to content

Experimental Design


Cross-Comparability Requirements

If the quantum and classical implementations differ in anything beyond the function approximator, results are not attributable to the quantum component.

What Must Be Identical

Component Classical Control Quantum Experiment Same?
State representation 5D vector (from rlstc_mdp.py) 5D vector
Action space {extend, cut} {extend, cut}
Reward function OD delta + CUT_PENALTY + EXTEND_COST Same
Replay buffer Size 5,000, uniform Size 5,000, uniform
Exploration ε-greedy, same schedule ε-greedy, same schedule
Target network Double DQN, same freq Double DQN, same freq
Function approximator MLP (various sizes) VQ-DQN (5q, 3L HEA) ❌ Variable
Optimizer SPSA (same hyperparams) SPSA (same hyperparams)
Loss Huber (δ=1.0) Huber (δ=1.0)
Dataset Same seed, same split Same seed, same split
Distance metric IED (rlstc_trajdistance.py) IED
Q-value clamping ±10.0 ±10.0
TD target clamping ±10.0 ±10.0
L_MIN constraint 3 3

Common Pitfalls

Pitfall Why It Breaks Comparability
Different optimizers (Adam vs SPSA) Optimizer effects dominate approximator effects
Different batch sizes Affects gradient variance independently
Different shot counts Added noise is an uncontrolled variable
Different random seeds Trajectory order and exploration path differ
Quantum with noise, classical without Measuring noise tolerance, not approximation quality

Classical Baselines

Control Architecture Params Purpose
A: Linear 5→2 (no hidden layers) 12 Is the problem trivially linear?
B: Medium MLP 5→64→2 514 Moderate capacity baseline
C: Deep MLP 5→32→32→2 1,314 High capacity — classical ceiling

Critical: All controls use SPSA (not SGD/Adam), identical to the quantum agent. This isolates the function approximator as the only variable.

Primary Metrics

Metric Measures Report As
ValCR (raw) Mean segment-to-center IED / base similarity Table + Pareto
nValCR (per-point) Mean of (IED/segment_length) / base similarity D1 diagnostic
wValCR (length-weighted) Total_IED / total_points / base similarity D1 diagnostic
CUT% Fraction of actions that are CUT Per-epoch
#Segments Total segments produced Per-epoch
Q-margin Q(extend) − Q(cut) D2 diagnostic
Parameter count Total trainable parameters One-time

Metric Pathology Awareness

Raw ValCR is structurally degenerate: IED grows with segment length, so cutting always lowers the metric. This is diagnosed by D1 and mitigated via budget-constrained reporting (Pareto table).

Experiment Matrix

Diagnostic Experiments (D1–D5)

ID Name Variable Measures
D1 ValCR vs CUT% Random CUT probability (0%–100%) Metric degeneracy; reports raw, nValCR, wValCR
D2 Q-margin Per-epoch Q(ext)−Q(cut) Policy bias formation
D3 Training action dist Per-epoch CUT% in training Action distribution drift
D4 Policy basin test Forced all-cut / all-extend / alternating Basin structure
D5 Buffer histogram Buffer CUT% vs on-policy CUT% Replay distribution drift

Core Benchmarks (E1–E6)

ID Name Variable Measures
E1 Core Quantum Utility VQ-DQN vs Controls A/B/C Parameter efficiency
E2 NISQ Viability Eagle / Heron noise models Noise degradation
E3 Shot Sensitivity 128 / 512 / 2048 shots Sampling noise floor
E4 Drift Resilience Temporal distribution shift Robustness
E5 Low-Data 10% / 25% / 50% data fractions Sample efficiency
E6 Version Progression Circuit architecture variants Design ablation

Scalability (S1)

ID Name Variable Measures
S1 Inference timing 250–1000 trajectories Wall-clock overhead

Multi-Seed Protocol

All E-series experiments support --seeds for multi-seed runs:

python experiments/run_thesis_experiments.py --experiments E1 --amount 50 --epochs 3 \
    --seeds 42,123,7,99,2025

Reports mean ± std across seeds for ValCR, CUT%, Q-margin, and #segments.

Analysis Plan

Pareto Frontier Analysis

  • Plot: agents overlaid on D1 random baseline curve (ValCR vs CUT%)
  • Table: best ValCR at CUT ≤ {5%, 10%, 20%, 30%, 40%, 50%}
  • Purpose: eliminates degenerate solutions from headline results

Q-Learning Dynamics

  • Q-value evolution pre/post clamping
  • Q-margin trends across epochs (VQ-DQN vs classical)
  • Replay buffer drift quantification

NISQ Sensitivity

  • ValCR vs shot count curves
  • Noise model degradation (Eagle, Heron)
  • Training stability under finite sampling

Next: Debugging Guide →