Skip to content

Training Pipeline


Training Loop

The training loop is implemented in the experiment scripts (e.g. run_thesis_experiments.py):

for epoch in range(n_epochs):
    for trajectory in dataset.trajectories:
        state = env.reset(trajectory)

        while not done:
            q_values = agent.get_q_values(state)         # 512 shots
            action = agent.act(state)                     # ε-greedy or Boltzmann
            next_state, reward, done = env.step(action)   # EXTEND/CUT/DROP/SKIP
            replay_buffer.add(state, action, reward, next_state, done)

            if replay_buffer.is_ready(batch_size):
                states, actions, rewards, next_states, dones = \
                    replay_buffer.sample_batch(batch_size)
                agent.update(states, actions, rewards,
                             next_states, dones)          # SPSA step

            state = next_state

        agent.decay_epsilon()

    # Episode-end: cluster update + k-means evaluation
    if epoch % eval_interval == 0:
        update_all_centers(cluster_dict)                  # Incremental center recomputation
        od_score = compute_overdist(cluster_dict)         # Overall distance
        log_metrics(od_score)

SPSA Optimizer

Defined in spsa.py. SPSA estimates gradients with only 2 function evaluations, regardless of parameter count.

Algorithm

θ_{k+1} = θ_k − aₖ · ĝₖ

where:
    Δ ∼ Rademacher(±1)              # Random perturbation direction
    ĝₖ = [L(θ + cₖΔ) − L(θ − cₖΔ)] / (2cₖΔ)   # Gradient estimate
    aₖ = a / (k + A + 1)^α         # Decaying step size
    cₖ = c / (k + 1)^γ             # Decaying perturbation size

m-SPSA (Version C)

Version C adds momentum to the gradient estimate:

g̃_k = β · g̃_{k-1} + (1 − β) · ĝ_k      (β = 0.9)

This smooths out the inherently noisy gradients from quantum measurement statistics, at the cost of slightly delayed convergence.

Why SPSA (Not Backprop or Parameter-Shift)

Method Evals per Step Works with Shot Noise? Notes
Backpropagation 1 (forward+backward) N/A — requires differentiable model Cannot differentiate through quantum measurement
Parameter-shift 2 × n_params (40 for 20 params) Yes Exact quantum gradients, but expensive
SPSA 2 Yes Approximate but unbiased; O(1) cost
m-SPSA 2 + EMA Yes Extra-robust via momentum averaging

Hyperparameters

class SPSAOptimizer:
    a: float = 0.12         # Initial step size
    c: float = 0.08         # Initial perturbation size (larger to overcome shot noise)
    A: int   = 20           # Step size offset (stabilises early training)
    alpha: float = 0.602    # Step size decay rate (standard SPSA theory)
    gamma: float = 0.101    # Perturbation decay rate
    momentum: float = 0.0   # 0.0 for SPSA, 0.9 for m-SPSA (Version C)

Shot Noise Robustness

SPSA is naturally robust to shot noise because: 1. It only needs function values, not exact gradients 2. Random perturbations average out noise over iterations 3. Decaying perturbation size reduces noise impact over time

TD Loss (Huber)

def compute_td_loss(state, action, target):
    q_value = get_q_values(state)[action]
    td_error = target - q_value
    delta = 1.0
    if abs(td_error) <= delta:
        return 0.5 * td_error ** 2         # Smooth near zero
    else:
        return delta * (abs(td_error) - 0.5 * delta)  # Linear for outliers

Double DQN

Standard DQN overestimates Q-values because the same network selects and evaluates actions. Double DQN decouples these:

def compute_target(reward, next_state, done):
    if done:
        return reward

    # Online network selects best action
    best_action = argmax(get_q_values(next_state, use_target=False))

    # Target network evaluates that action
    target_q = get_q_values(next_state, use_target=True)

    return reward + gamma * target_q[best_action]

Target Network Updates

Strategy Used By Mechanism Default
Soft update RLSTC (original paper) θ_target ← τ·θ_online + (1−τ)·θ_target after each step τ = 0.05
Hard copy Q-RLSTC (all versions) θ_target ← θ_online every N episodes N = 10

Experience Replay

Defined in replay_buffer.py:

class Experience(NamedTuple):
    state: np.ndarray        # 5D or 8D
    action: int              # 0 (EXTEND), 1 (CUT), 2 (DROP/SKIP)
    reward: float
    next_state: np.ndarray
    done: bool

class ReplayBuffer:
    buffer: deque(maxlen=5_000)
    # Uniform random sampling (no prioritised replay)
    # Methods: add(), sample_batch(), sample_batch_stratified(), is_ready()

Training only starts when the buffer has ≥ batch_size samples.

Exploration

Versions A/B/D — ε-Greedy

Parameter Value
epsilon_start 1.0 (pure exploration)
epsilon_min 0.1 (always 10% exploration)
epsilon_decay 0.99 per episode

Version C — SAC Entropy Regularisation

Parameter Value
alpha Auto-tuned (target entropy = −log(n_actions))
Exploration Stochastic policy naturally explores
Decay N/A — entropy coefficient adapts

Hyperparameter Summary

Parameter A/B C D Description
gamma 0.90 0.90 0.99 Discount factor
batch_size 32 32 32 Replay batch size
memory_size 5,000 5,000 5,000 Replay buffer capacity
target_update_freq 10 10 10 Episodes between target sync
shots_train 512 32–512 (adaptive) 512 Training measurement shots
shots_eval 4,096 4,096 4,096 Evaluation measurement shots
variational_layers 2 2 3 HEA/EQC layers
total_parameters 20/32 ~24 30 Trainable circuit parameters

Next: Distance & Clustering →