The deny list section hit home. I keep seeing agents use unlink instead of rm, or spawn a python subprocess to delete files. Every new rule just taught the agent a new workaround.
Ended up flipping the model — instead of blocking bad actions, require proof of safety before any action runs. No proof, no action. Much harder to route around.
Curious if you've tried anything similar.
What does proof of safety look like in practice? Could you give some examples?
Nothing super fancy. For me “proof” just means the agent has to make its intent explicit in a way I can check before running it.
For example: 1) If it wants to delete a file, it has to output the exact path it thinks it’s deleting. I normalize it and make sure it’s inside the project root. If not, I block it. 2) If it proposes a big change, I require a diff first instead of letting it execute directly. 3) After code changes, I run tests or at least a lint/type check before accepting it.
So it’s less about formal proofs and more about forcing the agent to surface assumptions in a structured way, then verifying those assumptions mechanically.
Still hacky, but it reduced the “creative workaround” behavior a lot.