Experimental Design¶

Cross-Comparability Requirements¶

If the quantum and classical implementations differ in anything beyond the function approximator, results are not attributable to the quantum component.

What Must Be Identical¶

Component	Classical Control	Quantum Experiment	Same?
State representation	5D vector (from `rlstc_mdp.py`)	5D vector	✅
Action space	{extend, cut}	{extend, cut}	✅
Reward function	OD delta + CUT_PENALTY + EXTEND_COST	Same	✅
Replay buffer	Size 5,000, uniform	Size 5,000, uniform	✅
Exploration	ε-greedy, same schedule	ε-greedy, same schedule	✅
Target network	Double DQN, same freq	Double DQN, same freq	✅
Function approximator	MLP (various sizes)	VQ-DQN (5q, 3L HEA)	❌ Variable
Optimizer	SPSA (same hyperparams)	SPSA (same hyperparams)	✅
Loss	Huber (δ=1.0)	Huber (δ=1.0)	✅
Dataset	Same seed, same split	Same seed, same split	✅
Distance metric	IED (rlstc_trajdistance.py)	IED	✅
Q-value clamping	±10.0	±10.0	✅
TD target clamping	±10.0	±10.0	✅
L_MIN constraint	3	3	✅

Common Pitfalls¶

Pitfall	Why It Breaks Comparability
Different optimizers (Adam vs SPSA)	Optimizer effects dominate approximator effects
Different batch sizes	Affects gradient variance independently
Different shot counts	Added noise is an uncontrolled variable
Different random seeds	Trajectory order and exploration path differ
Quantum with noise, classical without	Measuring noise tolerance, not approximation quality

Classical Baselines¶

Control	Architecture	Params	Purpose
A: Linear	5→2 (no hidden layers)	12	Is the problem trivially linear?
B: Medium MLP	5→64→2	514	Moderate capacity baseline
C: Deep MLP	5→32→32→2	1,314	High capacity — classical ceiling

Critical: All controls use SPSA (not SGD/Adam), identical to the quantum agent. This isolates the function approximator as the only variable.

Primary Metrics¶

Metric	Measures	Report As
ValCR (raw)	Mean segment-to-center IED / base similarity	Table + Pareto
nValCR (per-point)	Mean of (IED/segment_length) / base similarity	D1 diagnostic
wValCR (length-weighted)	Total_IED / total_points / base similarity	D1 diagnostic
CUT%	Fraction of actions that are CUT	Per-epoch
#Segments	Total segments produced	Per-epoch
Q-margin	Q(extend) − Q(cut)	D2 diagnostic
Parameter count	Total trainable parameters	One-time

Metric Pathology Awareness¶

Raw ValCR is structurally degenerate: IED grows with segment length, so cutting always lowers the metric. This is diagnosed by D1 and mitigated via budget-constrained reporting (Pareto table).

Experiment Matrix¶

Diagnostic Experiments (D1–D5)¶

ID	Name	Variable	Measures
D1	ValCR vs CUT%	Random CUT probability (0%–100%)	Metric degeneracy; reports raw, nValCR, wValCR
D2	Q-margin	Per-epoch Q(ext)−Q(cut)	Policy bias formation
D3	Training action dist	Per-epoch CUT% in training	Action distribution drift
D4	Policy basin test	Forced all-cut / all-extend / alternating	Basin structure
D5	Buffer histogram	Buffer CUT% vs on-policy CUT%	Replay distribution drift

Core Benchmarks (E1–E6)¶

ID	Name	Variable	Measures
E1	Core Quantum Utility	VQ-DQN vs Controls A/B/C	Parameter efficiency
E2	NISQ Viability	Eagle / Heron noise models	Noise degradation
E3	Shot Sensitivity	128 / 512 / 2048 shots	Sampling noise floor
E4	Drift Resilience	Temporal distribution shift	Robustness
E5	Low-Data	10% / 25% / 50% data fractions	Sample efficiency
E6	Version Progression	Circuit architecture variants	Design ablation

Scalability (S1)¶

ID	Name	Variable	Measures
S1	Inference timing	250–1000 trajectories	Wall-clock overhead

Multi-Seed Protocol¶

All E-series experiments support --seeds for multi-seed runs:

python experiments/run_thesis_experiments.py --experiments E1 --amount 50 --epochs 3 \
    --seeds 42,123,7,99,2025

Reports mean ± std across seeds for ValCR, CUT%, Q-margin, and #segments.

Analysis Plan¶

Pareto Frontier Analysis¶

Plot: agents overlaid on D1 random baseline curve (ValCR vs CUT%)
Table: best ValCR at CUT ≤ {5%, 10%, 20%, 30%, 40%, 50%}
Purpose: eliminates degenerate solutions from headline results

Q-Learning Dynamics¶

Q-value evolution pre/post clamping
Q-margin trends across epochs (VQ-DQN vs classical)
Replay buffer drift quantification

NISQ Sensitivity¶

ValCR vs shot count curves
Noise model degradation (Eagle, Heron)
Training stability under finite sampling

Next: Debugging Guide →