Q-RLSTC Technical Design Deep Dive¶
How exactly Q-RLSTC differs from classical RLSTC, how Versions A/B/C/D diverge, and how the VQ-DQN design compares to dina-quantum's approach.
1. Classical RLSTC → Q-RLSTC: What Changed¶
| Component | Classical RLSTC | Q-RLSTC |
|---|---|---|
| Policy Network | Classical MLP | Variational Quantum Circuit (VQ-DQN) |
| Parameters | Thousands (neural net weights) | 20–56 (rotation angles) |
| Model Size | ~100 KB+ | 80–224 bytes |
| Optimizer | Backpropagation + Adam | SPSA (gradient-free, 2 function evals) |
| Exploration | ε-greedy | ε-greedy (A/B) or entropy-regularised SAC (C) |
| State Encoding | Raw feature vector | Angle encoding: RY(2·arctan(x_i)) |
| Hypothesis Space | Unconstrained function class | Constrained by circuit structure and encoding |
| Hardware Target | GPU/CPU | NISQ quantum processors (5–8 qubits) |
Why Quantum?¶
The key insight is not that quantum is faster — it isn't for this problem size. The advantages are:
- Extreme parameter efficiency: 20 parameters vs thousands means the model can be serialised in 80 bytes, enabling federated learning on mobile devices
- Possible implicit regularisation: The bounded output range and trigonometric activation functions (from angle encoding) may provide implicit regularization
- Entanglement as feature interaction: CNOT gates create correlations between features that would require many classical layers to represent
- NISQ hardware compatibility: The shallow circuits (depth 2–4) run within coherence times of current IBM/IonQ hardware
2. The VQ-DQN Circuit Architecture¶
Angle Encoding Layer¶
Each feature x_i from the state vector is encoded as a rotation angle on qubit i:
This maps ℝ → (−π, π) smoothly, avoiding the saturation problems of linear or sigmoid encodings. Specifically:
| Encoding | Formula | Range | Issue |
|---|---|---|---|
| Arctan (Q-RLSTC) | 2·arctan(x) | (−π, π) | None — smooth, bounded |
| Linear | π·x | Assumes x ∈ [−1,1] | Saturates outside range |
| Sigmoid | π·(2σ(x)−1) | (−π, π) | Flat gradients at extremes |
The arctan encoding was chosen because GPS-derived features (segment lengths, variances, curvatures) have unbounded ranges, and arctan handles arbitrary magnitudes gracefully.
Variational Ansatz Layers¶
Each variational layer applies:
For each qubit i:
RY(θ_{layer,i,0}) — Y-axis rotation (amplitude mixing)
RZ(θ_{layer,i,1}) — Z-axis rotation (phase)
followed by an entanglement layer:
This is a Hardware-Efficient Ansatz (HEA) — designed to minimise circuit depth on near-term quantum hardware where long gate chains cause decoherence.
[!NOTE] The RY-RZ pairing (2 rotations per qubit per layer) was chosen over the more common RX-RY-RZ (3 rotations) to keep parameter count low. With 5 qubits × 2 layers × 2 rotations = 20 parameters total.
Measurement & Q-Value Readout¶
Version A (Standard Readout):
Q(s, EXTEND) = ⟨Z₀⟩ (Pauli-Z expectation on qubit 0)
Q(s, CUT) = ⟨Z₁⟩ (Pauli-Z expectation on qubit 1)
Each expectation value is computed from measurement counts:
wherecount_0 = number of times qubit i measured |0⟩, count_1 = number of times measured |1⟩.
Version B (Multi-Observable Readout):
Q(s, EXTEND) = w₀·⟨Z₀⟩ + w₁·⟨Z₁Z₂⟩ + w₂·⟨Z₂Z₃⟩ + bias₀
Q(s, CUT) = w₃·⟨Z₁⟩ + w₄·⟨Z₃Z₄⟩ + w₅·⟨Z₅Z₆⟩ + bias₁
The parity terms ⟨Z_aZ_b⟩ capture entanglement-derived correlations — information that single-qubit measurements cannot access. This is what makes Version B "quantum enhanced": it reads out richer information from the same circuit execution.
3. Version A vs B vs C vs D¶
Architecture Comparison¶
| Feature | Version A | Version B | Version C | Version D |
|---|---|---|---|---|
| Label | Classical Parity (5q) | Quantum Enhanced (8q) | Next-Gen Q-RNN (6q) | VLDB Aligned (5q) |
| Qubits | 5 | 8 | 6 (5 data + 1 shadow) | 5 |
| State Dim | 5D | 8D | 5D + memory signal | 5D (VLDB exact) |
| Parameters | 20 | 32 | ~24 (EQC is more efficient) | 30 (5q × 3 layers) |
| Model Size | 80 bytes | 128 bytes | ~96 bytes | 120 bytes |
| State Features | OD, baseline, lengths | OD, baseline, angles, curvature, density | OD, baseline, lengths + shadow | OD_cut, OD_extend, baseline, start_hilbert, end_hilbert |
| Ansatz | HEA (RY-RZ + linear CNOT) | HEA (RY-RZ + linear CNOT) | EQC (RZ + circular CNOT) | HEA (RY-RZ + linear CNOT) |
| Readout | ⟨Z₀⟩, ⟨Z₁⟩ | Linear combo of ⟨ZZ⟩ parity | Soft π(a|s) via softmax | ⟨Z₀⟩, ⟨Z₁⟩, ⟨Z₂⟩ |
| Agent | ε-greedy DQN | ε-greedy DQN | SAC (entropy-regularised) | ε-greedy DQN |
| Actions | 2 [EXTEND, CUT] | 2 [EXTEND, CUT] | 3 [EXTEND, CUT, DROP] | 3 [EXTEND, CUT, SKIP] |
| Memory | None | None | Shadow qubit (Q-RNN) | None |
| Optimizer | SPSA | SPSA | m-SPSA (momentum) | SPSA |
| Shots | Fixed (512 train, 4096 eval) | Fixed | Adaptive (32 → 512) | Fixed |
Version A: Classical Parity (5q)¶
The baseline configuration. Matches classical RLSTC's decision quality with 5 qubits:
- 5 state features: segment length, local variance, centroid distance, trajectory progress, segment count
- Why 5 qubits? One qubit per state feature — the simplest possible mapping
- Readout: Each action's Q-value is read from a single qubit's Z-expectation
- Purpose: Prove that a quantum circuit can match classical MLP performance with ~1000× fewer parameters
Version B: Quantum Enhanced (8q)¶
Adds 3 additional state features that leverage the quantum circuit's structure:
- +3 features (8D total): angle spread variance, curvature gradient, segment density
angle_spread: Variance of arctan-encoded angles — measures Bloch sphere coveragecurvature_gradient: Rate of change of trajectory curvature — second-order geometric signalsegment_density: Local density of already-placed boundaries- 8 qubits for 8 features, with multi-observable readout
- Purpose: Investigate quantum utility for compact readout that captures entanglement correlations
Version C: Next-Gen Q-RNN (6q)¶
The Q-RLSTC 2.0 flagship. Bundles all advanced features:
- Shadow qubit (qubit 0): carries quantum state across time steps, creating recurrent memory
- EQC ansatz: SO(2)-equivariant gates that respect rotational symmetry of GPS data
- SAC agent: replaces ε-greedy with entropy-maximising soft policy
- DROP action: actively filters GPS noise/anomalies
- Adaptive shots: 32-shot burst for confident decisions, full 512 only when uncertain
- m-SPSA: momentum-averaged gradients for smoother convergence through shot noise
Version D: VLDB Aligned (5q)¶
Strict 1:1 mapping of the VLDB 2024 paper's MDP to a VQC function approximator:
Baseline (paper-exact):
- State vector: s_t = (OD_s, OD_n, OD_b, L_b, L_f) — Equation (19) of the paper
- Actions: Binary {EXTEND, CUT} — the paper's exact action space
- Reward: r_t = OD(s_t) - OD(s_{t+1}) — maximising total reward = minimising terminal OD
- OD_b: TRACLUS expert baseline (paper ablation confirms it improves results)
- DQN machinery: replay buffer, target network sync, ε-greedy — all classical
- Quantum substitution: Replace the paper's 2-layer FFN (5→64→2 ≈ 514 params) with a VQC producing Q(s, EXTEND) and Q(s, CUT) via Z-expectations
- 30 parameters (5q × 3 layers × 2 rotations) vs 514 classical — an empirical observation, not a proof of advantage
Extension (paper Section 5.10):
- Q-SKIP: opt-in ternary action {EXTEND, CUT, SKIP(S)} that fast-forwards S points. Requires defining s_{t+1} after skip and reward OD(s_i) - OD(s_{i+S+1}). Reduces circuit evaluations per trajectory.
Research Variants (labeled, NOT baseline): - Hilbert curve spatial anchors replacing L_b/L_f - Target network self-play replacing TRACLUS OD_b
4. SPSA Optimizer Design¶
Why Not Backpropagation?¶
Quantum circuits don't have analytic gradients accessible through the measurement process. Options for gradient estimation:
| Method | Circuit Evaluations per Step | Accuracy | Practicality |
|---|---|---|---|
| Parameter Shift Rule | 2 × n_params (= 40 for 20 params) | Exact | Expensive on hardware |
| SPSA | 2 (regardless of n_params) | Estimated | ✅ Practical |
| Finite Difference | 2 × n_params | Estimated | Expensive |
SPSA wins because it needs only 2 circuit evaluations per step — critical when each evaluation requires running a quantum circuit with 512 shots.
SPSA Hyperparameters¶
a_k = a / (A + k + 1)^α # Learning rate schedule
c_k = c / (k + 1)^γ # Perturbation schedule
# Defaults:
A = 20 # Stability constant (~10-20% of expected iterations)
a = 0.12 # Initial learning rate
c = 0.08 # Initial perturbation (larger than theory suggests for shot noise)
α = 0.602 # Theory optimal: 1.0, practice: ~0.6
γ = 0.101 # Theory optimal: 1/6, practice: ~0.1
Momentum-SPSA (Version C)¶
With β = 0.9, each gradient estimate is an exponential moving average of the last ~10 raw estimates. This smooths out the inherently noisy gradients from quantum measurement statistics.
5. Q-RLSTC VQ-DQN vs dina-quantum VQ-DQN¶
These two systems use fundamentally different architectures despite both being called "VQ-DQN":
| Design Choice | Q-RLSTC | dina-quantum |
|---|---|---|
| Framework | Pure Qiskit circuits, manual execution | PyTorch + qiskit_machine_learning + TorchConnector |
| State Encoding | arctan angle encoding → RY gates |
π/x angle encoding → RX gates (via AngleStateEncoder) |
| Feature Map | None (direct encoding) | ZZFeatureMap (entangling feature map) |
| Ansatz | Custom HEA: RY-RZ + linear CNOT | RealAmplitudes (TwoLocal) or BQN (Bayesian QNN) |
| Entanglement | Linear CNOT chain | Circular CNOT ring (RealAmplitudes default) |
| Trainable Gates | RY, RZ (2 per qubit per layer) | RY only (1 per qubit per layer — RealAmplitudes) |
| Optimizer | SPSA (gradient-free) | Adam / SPSA (via TorchConnector backprop) |
| QNN Interface | Manual counts → expectations | SamplerQNN → TorchConnector (autograd) |
| Output | Pauli-Z expectations ⟨Z_i⟩ | Full measurement distribution → truncation to n_actions |
| Ancilla Qubits | None (Version A/B) or 1 shadow (C) | Optional BQN ancilla register |
| Action Space | 2–3 actions (segmentation) | Variable (database index selection) |
| Domain | Trajectory segmentation | Database index tuning (Q-DINA) |
| Gradient Method | 2-eval SPSA (hardware-efficient) | PyTorch autograd via TorchConnector (simulation-only) |
Key Architectural Differences¶
1. Encoding Philosophy:
- Q-RLSTC: θ = 2·arctan(x) — smooth, handles unbounded GPS features
- dina-quantum: θ = π/x — maps (0, ∞) → (0, π], but singular at x=0 (guarded with torch.where)
2. Feature Map: - Q-RLSTC: No separate feature map — angle encoding directly serves as data embedding - dina-quantum: Uses ZZFeatureMap (entangling feature map) to create feature interactions before the variational layer. This adds circuit depth but enables richer input representations
3. Ansatz Architecture:
- Q-RLSTC (HEA): RY-RZ rotation pairs + linear CNOT — minimal depth, NISQ-friendly
- dina-quantum (TwoLocal): RealAmplitudes with only RY rotations + circular CNOT — even shallower but less expressive
- dina-quantum (BQN): Bayesian Quantum Neural Network with ancilla-controlled U-blocks — much deeper, more expressive, but requires more qubits
4. Measurement Strategy: - Q-RLSTC: Reads individual qubit expectations (⟨Z₀⟩, ⟨Z₁⟩) or parity correlations (⟨Z_aZ_b⟩) - dina-quantum: Reads full sampling distribution (2^n bitstring probabilities), then truncates to action dimension — uses more information but requires more shots for convergence
5. Gradient Pipeline:
- Q-RLSTC: Pure SPSA — 2 circuit evaluations, no backprop chain, works on real hardware
- dina-quantum: TorchConnector wraps the quantum circuit in PyTorch autograd — enables Adam optimiser and standard deep learning training loops, but only works in simulation (requires statevector access for gradient computation)
6. Reward Function Design¶
Reward Components (All Versions)¶
| Signal | Weight | Trigger | Purpose |
|---|---|---|---|
| Boundary sharpness | +0.5 × angle/π | On CUT | Encourage cuts at direction changes |
| Segment penalty | −0.1 | On CUT | Prevent over-segmentation |
| Variance delta | ±0.1 × Δvariance | Every step | Local clustering quality signal |
| Separability bonus | +0.3 × inter-centroid dist | On CUT (≥2 segs) | Encourage distinct clusters |
| Degeneracy penalty | −2.0 | Episode end, ≤1 segment | Prevent single-cluster collapse |
| Empty cluster penalty | −1.0 | Episode end, <2 effective segs | Prevent trivial segmentations |
| DROP anomaly bonus | +0.2 | On DROP (if actually anomalous) | Reward good noise filtering |
| DROP unnecessary penalty | −0.1 | On DROP (if not anomalous) | Discourage over-dropping |
| DROP micro-penalty | −0.05 | On DROP (always) | Baseline cost for data removal |
| SKIP reward (linear) | +0.05 × S | On SKIP (low variance segment) | Reward efficient fast-forward |
| SKIP penalty (non-linear) | −0.05 | On SKIP (high variance segment) | Discourage skipping over features |
Anti-Gaming Constraints¶
- Minimum segment length = 3 points (CUT disallowed if segment too short)
- Maximum segments = 50 (episode terminates early)
- CUT forced to EXTEND if below minimum length
7. Measurement Parameters Summary¶
| Parameter | Version A | Version B | Version C | Version D |
|---|---|---|---|---|
| Training shots | 512 | 512 | 32 burst → 512 fallback | 512 |
| Eval shots | 4096 | 4096 | 4096 | 4096 |
| Confidence threshold | N/A | N/A | 0.3 (margin for adaptive shots) | N/A |
| ε start | 1.0 | 1.0 | N/A (SAC uses entropy) | 1.0 |
| ε min | 0.1 | 0.1 | N/A | 0.1 |
| ε decay | 0.995 | 0.995 | N/A | 0.995 |
| Entropy α | N/A | N/A | 0.2 | N/A |
| SPSA momentum β | 0 (disabled) | 0 (disabled) | 0.9 | 0 (disabled) |
| Discount γ | 0.90 | 0.90 | 0.90 | 0.90 |
| Batch size | 32 | 32 | 32 | 32 |
| Target update freq | 10 episodes | 10 episodes | 10 episodes | 10 episodes |
| Skip distance S | N/A | N/A | N/A | 5 |
| Variational layers | 2 | 2 | 2 | 3 |
| Total parameters | 20 | 32 | ~24 | 30 |