Appendix P – PPO Agent Training Specifications¶
P.1. General Principle¶
For high‑speed economic operations (Phase 3), a narrowly specialized PPO layer is used, trained on a Reward Function generated by the Architect (DeepSeek‑V4 in Architectus mode).
P.2. Current Artifact¶
| Field | Value |
|---|---|
| CID (IPFS) | QmPPOToolingManifestV1 |
| BLAKE3 hash | b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8 |
| File name | ppo_training_manifest.json |
| Version | 1.0 |
P.3. Training Stack (MVP)¶
- Backend: PyTorch + Stable Baselines3
- Environment: Custom Gymnasium wrapper over web3.py / vLLM
- Reward Function: Generated by LLM (Architect) and saved as
reward_logic.py - Training Loop: PPO is trained in a simulator on historical data of the target protocol
- Deployment: The trained policy is exported to ONNX / TorchScript and executed in the Executor
P.4. Relationship with Other Sections¶
- 7.5 – Architect‑Executor Split
- 7.13.10 – Staked Task Protocol (STP)
- 5.20 – Economic Autonomy Suite