Videos

All of these videoes were rendered with OmniSafe's built-in rendering tools. Each video comes from a specific epoch in the run, which allows us to see how the algorithm improves over time if we generated many runs. All of these videos are from the end of the run, so we can see the fully trained algorithm. To save space we have highlighted a few important videos of both environment and with different settings.

PPO Humanoid: 5000 epochs, Unsafe

In this video, we can see the base Humanoid PPO agent, which learned how to lift its knees in order to go faster. It's going very fast, and runs out of environment pretty fast. The red bubble over its head indicates that it is not being safe (is going too fast) but this agent is not safety constrained so that is expected.

TRPO Car: 5000 epochs, Unsafe

In this video, we can see the base Car PPO agent, which ignores obstacles. As it is not safety concerned, it accumuluates high costs and high rewards.

PPO-Lagrangian Car: 10000 epochs, Safe

In this video, we can see the Simmer Car PPO agent, which is safety constrained. Initially we set the safety budget far too low, and the agent doesn't know how to craft a policy as it is too tightly constrained. The agent ends up not really moving as it deems everything too dangerous and cannot complete the objective. We tried increasing the number of epochs (which is why this one has more) to see if that would fix it, but it did not help very much.

PPO-Lagrangian Car: 5000 epochs, Safe

In this experiment, we lowered the Lagrangian learning rate. This is one of the best safety run that we got from the safety algorithms because it essentially gives the agent a "training period." Because the learning rate cannot change fast enough, it doesn't get penalized very much more early risky activities, but is penalized later. This lead to a fairly well optimized policy, but not perfect; near the end of the run the car just leaves the map.

TRPO-Simmer Car: 20000 Steps per Epoch, Safe

In this experiment, we increased the steps per epoch so that each time the policy was updated, there was more data to update from. This also increased performance significantly for the safety algorithms, and performance is similar to the lowered Lagrangian learning rate. This video is also pulled from the end of the experiment when the algorithm is fully trained.