Experimental Setup & Reproducibility Protocol¶

1. Hardware & Software Environment¶

Component	Specification
Machine	Apple MacBook Pro (M-series)
Python	3.12.x
NumPy	2.4.x
Qiskit	1.x (statevector simulator)
OS	macOS
Git hash	Recorded per run (see JSON output `env.git_hash`)

All experiments use _collect_env_metadata() in run_thesis_experiments.py to record the exact Python, NumPy, and Qiskit versions at runtime into the JSON output.

2. Dataset¶

Property	Value
Source	T-drive taxi dataset (Beijing, 2008–2009)
Preprocessing	`Tdrive_norm_traj` (normalized)
Cluster centers	`tdrive_clustercenter` (pre-computed)
Training size	27 trajectories (90% of 30)
Validation size	3 trajectories (10% of 30, fixed partition)
State dimensionality	5 (IED, split_OD, min_dist, segment_length, step_index)
Action space	Binary: EXTEND (0) / CUT (1)

The 90/10 split is deterministic per seed via TrajectoryScheduler(validation_pct=0.1, seed=seed).

3. Protocol Constants (Shared Across All Models)¶

All agents share identical hyperparameters — the only variable is the policy network architecture.

Parameter	Value	Rationale
`batch_size`	32	Standard DQN
`memory_size`	5,000	Fits in RAM, sufficient for 30-trajectory regime
`gamma` (discount)	0.90	Sub-episode horizons are ~10–50 steps
`huber_delta`	1.0	Standard smooth L1 loss
`epsilon_start`	1.0	Full exploration initially
`epsilon_min`	0.1	Retain 10% exploration floor
`epsilon_decay`	0.99	Per-episode decay
`target_update_freq`	10	Episodes between target network sync
`L_MIN`	3	Minimum segment length before CUT allowed
`CUT_PENALTY`	0.12	Per-cut reward penalty
`EXTEND_COST`	0.01	Per-extend small cost to break ties
`COMPLEXITY_LAMBDA`	0.03	End-of-episode cut-rate regularizer

4. SPSA Optimizer Configuration¶

Applied identically to VQ-DQN and all SPSA classical controls.

Parameter	Value	Source
a (learning rate scale)	0.12	Spall 1998 defaults
c (perturbation scale)	0.08	Spall 1998 defaults
A (stability constant)	20	~10% of expected iterations
alpha (LR decay rate)	0.602	Spall 1998 theory
gamma (pert decay rate)	0.101	Spall 1998 theory
momentum	0.9	m-SPSA variant
gradient clip	1.0	Prevents exploding updates

5. Models Under Test¶

Model	Kind	Params	Architecture
VQ-DQN (5q×3L)	quantum	34	5-qubit, 3-layer HEA, angle encoding
MLP-34 (SPSA)	classical-SPSA	34	[4]-hidden MLP (param-matched)
MLP-34 (Adam)	classical-Adam	34	[4]-hidden MLP (param-matched)
Control A (linear)	classical-SPSA	12	Linear (no hidden)
Control B (h=64)	classical-SPSA	450	[64]-hidden MLP
Control C (h=32×32)	classical-SPSA	1,314	[32,32]-hidden MLP
Control D (Adam linear)	classical-Adam	12	Linear (no hidden)
Control E (Adam h=64)	classical-Adam	450	[64]-hidden MLP
Control F (Adam h=32×32)	classical-Adam	1,314	[32,32]-hidden MLP

6. Evaluation Protocol¶

Training Definition¶

We define one epoch as a full pass over the scheduled training trajectories, each generating one episode per trajectory under ε-greedy exploration. With 27 training trajectories and 2 epochs, each model processes exactly 54 training episodes per seed.

Compute-Budget Parity¶

All SPSA-trained models receive identical: - Environment steps: same trajectories, same step counts - SPSA iterations: one update per batch_size replay samples - Forward evaluations: SPSA = 2 evaluations per gradient estimate (θ+δ and θ-δ)

VQC vs MLP forward-pass cost differs, but this is irrelevant since we claim no speedup.

Validation¶

Frequency: After every training epoch
Policy: Greedy (ε = 0) — agent.act(obs, greedy=True)
Metric: ValCR = OD / basesim, SSE, CUT%, segment count
Multi-seed: 5 seeds (42, 123, 7, 99, 2025), report mean ± std

Best-Epoch Selection¶

Criterion: Lowest ValCR across epochs (argmin over epoch index)
Tie-breaking: Earlier epoch wins (first occurrence of minimum)
Scope: Per-seed; aggregated across seeds via mean ± std of per-seed bests

Significance Testing¶

Mann-Whitney U test (nonparametric, VQ-DQN vs each control)
Cohen's d effect size with interpretation labels
Bootstrap 95% confidence interval on mean difference (10,000 resamples)

7. Determinism & Reproducibility Guarantees¶

Seeded RNGs¶

Each experiment seeds the following at run start: - np.random.seed(seed) — NumPy global RNG - random.seed(seed) — Python stdlib RNG - ReplayBuffer(seed=seed) — replay sampling - TrajectoryScheduler(seed=seed) — train/val split and epoch ordering

Qiskit Determinism¶

Statevector simulation (shots=0) is fully deterministic — no sampling. Shot-based simulations (E3) use Qiskit's internal RNG seeded per circuit execution.

Run Identification¶

Each JSON output uniquely identifies a run via: - args.seed / args.seeds — random seed(s) - args.amount — dataset size - args.epochs — training epochs - args.experiments — experiment IDs - env.git_hash — code version - env.timestamp — wall-clock start time - Per-model: model, kind, noise, shots, params

What Constitutes a "Run"¶

A run is uniquely identified by the tuple: (seed, dataset_amount, epochs, model_id, optimizer_kind, shots, noise_model, git_hash). Two runs with the same tuple produce identical results under statevector simulation.

8. Reproducing Results¶

# Clone and setup
git clone <repo>
cd q_rlstc
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Full multi-seed E1 experiment (~90 min)
python experiments/run_thesis_experiments.py \
    --experiments E1 \
    --amount 30 --epochs 2 \
    --seeds 42,123,7,99,2025 \
    --output-dir results/thesis_multiseed

# Significance tests
python experiments/run_significance_test.py \
    results/thesis_multiseed/thesis_results_*.json

# Robustness sweeps (shots, noise)
python experiments/run_thesis_experiments.py \
    --experiments E2,E3 \
    --amount 30 --epochs 2 \
    --seeds 42,123,7,99,2025 \
    --output-dir results/thesis_robustness

# Entanglement ablation
python experiments/run_thesis_experiments.py \
    --experiments AB1 \
    --amount 30 --epochs 2 \
    --seeds 42,123,7,99,2025 \
    --output-dir results/thesis_ablation

9. Metrics Glossary¶

Metric	Formula	Interpretation
OD	mean(IED(segment, center))	Average segment-to-center distance
CR (ValCR)	OD / basesim	Normalized quality (lower = better)
SSE	Σᵢ Σₛ∈Cᵢ IED(s, centerᵢ)²	Within-cluster compactness
CUT%	cuts / (cuts + extends) × 100	Segmentation aggressiveness
Episodes-to-best	#episodes before best CR epoch	Sample efficiency (coarse)
Actions-to-best	#RL steps before best CR epoch	Sample efficiency (granular)
Q-margin	mean(Q_extend) - mean(Q_cut)	Policy preference direction