Appendix P – PPO Agent Training Specifications¶

P.1. General Principle¶

For high‑speed economic operations (Phase 3), a narrowly specialized PPO layer is used, trained on a Reward Function generated by the Architect (DeepSeek‑V4 in Architectus mode).

P.2. Current Artifact¶

Field	Value
CID (IPFS)	`QmPPOToolingManifestV1`
BLAKE3 hash	`b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8`
File name	`ppo_training_manifest.json`
Version	1.0

P.3. Training Stack (MVP)¶

Backend: PyTorch + Stable Baselines3
Environment: Custom Gymnasium wrapper over web3.py / vLLM
Reward Function: Generated by LLM (Architect) and saved as reward_logic.py
Training Loop: PPO is trained in a simulator on historical data of the target protocol
Deployment: The trained policy is exported to ONNX / TorchScript and executed in the Executor

P.4. Relationship with Other Sections¶

7.5 – Architect‑Executor Split
7.13.10 – Staked Task Protocol (STP)
5.20 – Economic Autonomy Suite