MDP & Reward Engineering¶

MDP Formulation¶

The segmentation problem is modelled as a Markov Decision Process. At each point along a trajectory, the agent observes a state and decides to extend the current segment or cut to start a new one.

Implementation: rlstc_mdp.py → TrajRLclus

State Space (5D)¶

Computed directly in TrajRLclus.reset() and TrajRLclus.step():

#	Feature	Description	Range
0	`overall_sim`	Global overdistance (trajectory-to-nearest-center IED)	[0, ∞)
1	`split_overdist`	Running average OD incorporating current segment	[0, ∞)
2	`overall_sim × 10`	Scaled OD (omitted when `ablate_odb=True` → 4D state)	[0, ∞)
3	`len_backward`	Current segment length / total trajectory length	[0, 1]
4	`len_forward`	Remaining trajectory / total trajectory length	[0, 1]

Note: When ablate_odb=True, feature 2 is dropped and the state is 4D. This is tested in the ablation experiments.

Action Space¶

Action	Value	Effect
EXTEND	0	Add next point to current segment
CUT	1	End current segment, start new one

Anti-Gaming Constraints¶

Defined as parameters in TrajRLclus.__init__():

min_seg_len = 3  # CUT disallowed if segment < L_MIN points

Enforcement: If action == CUT but the current segment has fewer than min_seg_len points, or the remaining trajectory is shorter than min_seg_len, the action is silently forced to EXTEND. This prevents degenerate micro-segmentation policies.

# From rlstc_mdp.py step():
if action == 1:
    seg_len = index - self._seg_start_idx + 1
    remaining = self.length - index
    if seg_len < self.min_seg_len or remaining < self.min_seg_len:
        action = 0  # force EXTEND

Reward Function¶

The reward is computed in the experiment runner (run_thesis_experiments.py), not inside the MDP itself. The MDP step() returns reward=0; the outer training loop applies reward shaping.

Current Design (PROTOCOL constants)¶

# From run_thesis_experiments.py PROTOCOL dict:
EXTEND_COST     = 0.01    # Small cost to discourage idle extending
CUT_PENALTY     = 0.12    # Per-cut penalty to discourage over-segmentation
COMPLEXITY_LAMBDA = 0.02  # Complexity regularizer weight
scale_reward    = 100.0   # Amplification factor for OD improvement signal

The reward at each step:

if action == CUT:
    raw_reward = scale_reward × (old_overdist − new_overdist) − CUT_PENALTY
else:  # EXTEND
    raw_reward = scale_reward × (old_overdist − new_overdist) − EXTEND_COST

Why Externalised Rewards?¶

Flexibility — Different experiments can use different reward shaping without modifying the MDP
Clean separation — The MDP handles environment dynamics; the runner handles learning signals
Transparency — All reward constants are visible in one PROTOCOL dict

Q-Value Stability¶

Both the VQ-DQN and classical agents apply output clamping and TD target clamping to prevent value explosion:

# Output clamping (in agent._forward / agent.get_q_values):
q_values = np.clip(q_values, -10.0, 10.0)

# TD target clamping (in agent.compute_targets_batch):
targets = np.clip(targets, -10.0, 10.0)

This was introduced after observing Q-value explosion to ~78M in early experiments. The ±10 bounds are empirically sufficient given the reward scale.

Termination¶

End of trajectory — All points consumed (index + 1 == self.length)
The final segment is automatically closed and assigned to a cluster

Next: Quantum Circuit Design →