Skip to content

Appendix P – PPO Agent Training Specifications

P.1. General Principle

For high‑speed economic operations (Phase 3), a narrowly specialized PPO layer is used, trained on a Reward Function generated by the Architect (DeepSeek‑V4 in Architectus mode).

P.2. Current Artifact

Field Value
CID (IPFS) QmPPOToolingManifestV1
BLAKE3 hash b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8
File name ppo_training_manifest.json
Version 1.0

P.3. Training Stack (MVP)

  • Backend: PyTorch + Stable Baselines3
  • Environment: Custom Gymnasium wrapper over web3.py / vLLM
  • Reward Function: Generated by LLM (Architect) and saved as reward_logic.py
  • Training Loop: PPO is trained in a simulator on historical data of the target protocol
  • Deployment: The trained policy is exported to ONNX / TorchScript and executed in the Executor

P.4. Relationship with Other Sections

  • 7.5 – Architect‑Executor Split
  • 7.13.10 – Staked Task Protocol (STP)
  • 5.20 – Economic Autonomy Suite