RL Agents¶

Overview¶

Q-RLSTC includes four DQN agents — one quantum and three classical — sharing a common interface so the experiment runner can treat them identically. This enables controlled comparison of quantum policy networks against classical baselines.

Agent Comparison¶

Agent	Module	Optimizer	Architecture	Key Hyperparameters
VQDQNAgent	`rl/vqdqn_agent.py`	SPSA	5q HEA quantum circuit	shots=512, n_layers=2 (default)
SPSAClassicalDQN	`rl/spsa_classical_agent.py`	SPSA	MLP (configurable)	Same SPSA config as VQDQN
AdamClassicalDQN	`rl/adam_classical_agent.py`	Adam (backprop)	MLP [64]	lr=1e-3, β₁=0.9, β₂=0.999
OriginalClassicalDQN	`rl/original_classical_agent.py`	SGD	MLP [64] (faithful RLSTCcode)	lr=0.001, γ=0.99, τ=0.05

Common Interface¶

All four agents implement the same public API:

agent.get_q_values(state, use_target=False) → np.ndarray  # (2,) Q-values
agent.act(state, greedy=False) → int                       # ε-greedy action
agent.update(states, actions, rewards, next_states, dones)  # batch update
agent.compute_targets_batch(rewards, next_states, dones)    # TD targets
agent.update_target_network()                               # copy online → target
agent.decay_epsilon()                                       # ε decay + target sync
agent.save_checkpoint(path)                                 # serialize
agent.load_checkpoint(path)                                 # deserialize

Design Rationale¶

Why Four Agents?¶

Each agent isolates a specific experimental variable:

VQDQNAgent — The quantum policy network under test
SPSAClassicalDQN — Controls for optimizer: same SPSA, classical network. Any performance difference vs. VQDQN is attributable to the quantum circuit.
AdamClassicalDQN — Controls for architecture: shows how well a classical MLP performs with a strong optimizer (Adam + backprop).
OriginalClassicalDQN — 1:1 faithful reproduction of the original RLSTCcode DQN (SGD, γ=0.99, soft Polyak updates). Ensures backward compatibility.

Architecture Details¶

VQDQNAgent (Quantum)¶

Encoding: Angle encoding on 5 qubits (one per state feature)
Ansatz: Hardware-Efficient Ansatz with n_layers repetitions of RY+RZ+CNOT
Readout: Z-expectation values → scale + bias → Q-values
Double DQN: Online params for action selection, target params for evaluation

Classical Agents (MLP)¶

Default architecture: 5→64→2 (single hidden layer with ReLU)
Configurable: hidden_sizes parameter allows arbitrary depth/width
Feature transforms: SPSAClassicalDQN supports RBF features for nonlinear readout
Weight init: Xavier uniform for weights, zero for biases

Supporting Components¶

Component	Module	Purpose
`ReplayBuffer`	`rl/replay_buffer.py`	Stores `(s, a, r, s', done)` transitions
`SPSAOptimizer`	`rl/spsa.py`	Gradient-free optimization with learning rate scheduling
`VQDQNCircuitBuilder`	`quantum/vqdqn_circuit.py`	Builds and evaluates quantum circuits
`BackendFactory`	`quantum/backends.py`	Creates Qiskit backends (ideal, noisy, IBM Runtime)

Configuration¶

Each agent has a corresponding @dataclass config:

AgentConfig         # VQDQNAgent
ClassicalAgentConfig # SPSAClassicalDQN
AdamAgentConfig     # AdamClassicalDQN
OriginalAgentConfig # OriginalClassicalDQN

Shared defaults: gamma=0.90 (except Original: 0.99), epsilon: 1.0→0.1 (decay=0.99), use_double_dqn=True, target_update_freq=10.

→ Data Layer | → Training Pipeline