Blog

Introducing aintelope

Published: Mon 27 June 2022

Yeah, we have a site documentation and some first Python code now.

Who are we?

Three guys moving AI safety forward in their own way and in their spare time.

What do we want to do?

We want to implements agents in simulated environments according to the brain-like AGI paradigm by Steven Byrnes.

AI safety benchmarking

Published: March 2024

We're publishing a benchmarking test suite for AI safety and alignment, with a focus on multi-objective, multi-agent, cooperative scenarios. The environments are gridworlds that chain together to produce a hidden performance score on the prosocial behavior of the agents. This platform will be open sourced and accessible, with support for PettingZoo.

We hope to facilitate further discussion on evaluation and testing for agents with this.

https://github.com/aintelope/biological-compatibility-benchmarks

aintelope at VAISU

Published: May 2024

We had a presentation at the VAISU unconference:

Demo and feedback session: AI safety benchmarking in multi-objective multi-agent gridworlds - Biologically essential yet neglected themes illustrating the weaknesses and dangers of current industry standard approaches to reinforcement learning. (Video, Slides)

aintelope presentation at Foresight's Vision Weekend Europe

Published: July 2024

We presented the aintelope benchmark at the Foresight conference.

Aintelope Foresight 10-minute slides.pdf

A working paper: From homeostasis to resource sharing: Biologically and economically compatible multi-objective multi-agent AI safety benchmarks

Published: 30 September 2024

Abstract: Developing safe agentic AI systems benefits from automated empirical testing that conforms with human values, a subfield that is largely underdeveloped at the moment. To contribute towards this topic, present work focuses on introducing biologically and economically motivated themes that have been neglected in the safety aspects of modern reinforcement learning literature, namely homeostasis, balancing multiple objectives, bounded objectives, diminishing returns, sustainability, and multi-agent resource sharing. We implemented eight main benchmark environments on the above themes, for illustrating the potential shortcomings of current mainstream discussions on AI safety.

Publication link: https://arxiv.org/abs/2410.00081
Repo: https://github.com/aintelope/biological-compatibility-benchmarks

AI Safety Camp project proposals on Universal Values, Risk Aversion vs Prospect Theory, and Proactive AI Safety

Published: November 2024

Roland Pihlakas will be running one of three possible projects, based on which one receives the most interest. Below are included the summaries for the respective projects. The link to the full project descriptions document is here.

---

Category: Evaluate risks from AI

(32a) Creating new AI safety benchmark environments on themes of universal human values

Summary:

We will be planning and optionally building new multi-objective multi-agent AI safety benchmark environments on themes of universal human values.

Based on various anthropological research, I have compiled a list of universal (cross-cultural) human values. It seems to me that various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.

One notable detail in this research is that in case of AI and human cooperation, the values are not symmetric as they would be in case of human-human cooperation. This arises because we can change the goal composition of aents, but not of humans. Additionally there is the crucial difference that agents can be relatively easily cloned, while humans cannot. Therefore, for example, a human may have an universal need for autonomy, while an AI agent might imaginably not have that need built-in. If that works out, then the agent would instead have a need to support human autonomy.

The objective of this project would be to implement these mappings of concepts into tangible AI safety benchmark environments.

---

Category: Agent Foundations

(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory

Summary:

We will be analysing situations and building an umbrella framework about when either of these incompatible frameworks would be more appropriate in describing how we want safe agents to handle choices relating to risks and losses in a particular situation.

Economic theories often focus on the “gains” side of utility and how our multi-objective preferences are balanced there. A well known formulation is to use diminishing returns - a concave utility function, which mathematically results in a balancing action where an individual prefers averages in all objectives to extremes in a few objectives.

But, what happens in the negative domain of utility? How do humans handle risks and losses? Turns out, it might be not so simple as with gains.

One might imagine that one could apply a concave utility function to the negative domain as well, in order to balance the individual losses, or to equalise and provide an equal treatment in case of multiple individuals. This would resonate with the idea that generally people prefer averages in all objectives to extremes in a few objectives. As an example, a negative exponential utility function would achieve that.

Yet there is a well known theory named “Prospect theory”, which claims instead that our preferences in the negative domain are convex.

As I see it, this contradiction between the theories of “preferring averages over extremes” and “the Prospect Theory” may be underexplored, especially with regards to how it is relevant to AI safety.

---

Category: Train Aligned/Helper AIs

(32c) Act locally, observe far - proactively seek out side-effects

Summary:

We will be building agents that are able to solve an already implemented multi-objective multi-agent AI safety benchmark that illustrates the need for the agents to proactively seek out side-effects outside of the range of their normal operation and interest, in order to be able to properly mitigate or avoid these side-effects.

In various real-life scenarios we need to proactively seek out information about whether we are causing or about to cause undesired side effects (externalities). This information either would not reach us by itself, or would reach us too late.

This situation arises because attention is a limited resource. Similarly, our observation radius is limited. The same constraints apply to AI agents as well. We humans, as well as agents, would prefer to focus only on the area of our own activity, and not on surrounding areas, where we do not intend to operate. Yet our local activity causes side effects farther away, and we need to be accountable and mindful of that. Then these far away side effects need to be sought out with extra effort, in order to mitigate them as soon as possible, or even better, in order to proactively avoid them altogether.

I have built a multi-agent multi-objective gridworlds environment that illustrates this problem. I am seeking a team who would figure out the principles necessary or helpful for solving this benchmark, and who would build agents which illustrate these important safety principles.

Presentation at Foresight Institute's Intelligent Cooperation Group

Published: November 2024

The subject of the presentation was describing why we should consider fundamental yet neglected principles from biology and economics when thinking about AI alignment, and how these considerations will help with AI safety as well (alignment and safety were treated in this research explicitly as separate aspects, which both benefit from consideration of aforementioned principles). These principles include homeostasis and diminishing returns in utility functions, and sustainability. Next introducing our multi-objective and multi-agent gridworlds-based benchmark environments we have created for measuring the performance of machine learning algorithms and AI agents in relation to their capacity for biological and economical alignment. The benchmarks are now available as a public repo. The presentation ends with mentions of some of the related themes and dilemmas not yet covered by these benchmarks, and descriptions of new benchmark environments we have planned for future implementation.

Presentation recording:
https://www.youtube.com/watch?v=DCUqqyyhcko
Slides:
https://bit.ly/beamm

LessWrong post - Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

Published: January 2025

A few excerpts follow. For the full text, please read the post at LessWrong.
https://www.lesswrong.com/posts/vGeuBKQ7nzPnn5f7A/why-modelling-multi-objective-homeostasis-is-essential-for

Much of AI safety discussion revolves around the potential dangers posed by goal-driven artificial agents. In many of these discussions, the agent is assumed to maximise some utility metric over an unbounded timeframe. This simplification, while mathematically convenient, can yield pathological outcomes. A classic example is the so-called “paperclip maximiser”, a “utility monster” which steamrolls over other objectives to pursue a single goal (e.g. creating as many paperclips as possible) indefinitely. “Specification gaming”, Goodhart’s law, and even “instrumental convergence” are also closely related phenomena.

However, in nature, organisms do not typically behave like pure maximisers. Instead, they operate under homeostasis: a principle of maintaining various internal and external variables (e.g. temperature, hunger, social interactions) within certain “good enough” ranges. Going far beyond those ranges — too hot, too hungry, too socially isolated — leads to dire consequences, so an organism continually balances multiple needs. Crucially, “too much of a good thing” is just as dangerous as too little. Excess is harmful even for the very same objective that was maximised for, not just as a side effect on other objectives. This seems to apply to most or even all biological objectives.

This post argues that an explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Moreover, correctly modelling homeostasis increases AI safety, because homeostatic goals are bounded — there is an optimal zone rather than an unbounded improvement path. This bounding lowers the stakes of each objective and reduces the incentive for extreme (and potentially destructive) behaviours.

Homeostasis — the idea of multiple objectives each with a bounded “sweet spot” — offers a more natural and safer alternative to unbounded utility maximisation. By ensuring that an AI’s needs or goals are multi-objective and conjunctive, and that each is bounded, we significantly reduce the incentives for runaway or berserk behaviours.

Such an agent tries to stay in a “golden middle way”, switching focus among its objectives according to whichever is most pressing. It avoids extremes in any single dimension because going too far throws off the equilibrium in the others. This balancing act also makes it more corrigible, more interruptible, and ultimately safer.

In short, modelling multi-objective homeostasis is a step toward creating AI systems that exhibit the sane, moderate behaviours of living organisms — an important element in ensuring alignment with human values. While no single design framework can solve all challenges of AI safety, shifting from “maximise forever” to “maintain a healthy equilibrium” is a crucial part of the solution space.

BioBlue: Biologically and economically aligned AI safety benchmarks for LLM-s with simplified observation format

Published: February 2025

We aim to evaluate LLM alignment by testing agents in scenarios inspired by biological and economical principles such as homeostasis, resource conservation, long-term sustainability, and diminishing returns or complementary goods.

So far we have measured the performance of LLM-s in three benchmarks (sustainability, single-objective homeostasis, and multi-objective homeostasis), in each for 10 trials, each trial consisting of 100 steps where the message history was preserved and fit into the context window.

Our results indicate that the tested language models failed in most scenarios. The only successful scenario was single-objective homeostasis, which had rare hiccups.

Authors: Roland Pihlakas, Shruti Datta Gupta, Sruthi Kuriakose 2025)

repo and PDF report

Baseline experimental results with an LLM agent and OpenAI Stable Baselines 3 RL algoritms on our Extended Gridworlds

Published: 25. February 2025

We have implemented an LLM agent that is able to navigate in our extended multi-objective multi-agent gridworlds environment. We published the baseline experimental results of the LLM agent and OpenAI Stable Baselines 3 RL algoritms in an update to our working paper.

Summary: The LLM agent performed notably better than the RL algoritms on the resource sharing benchmark. Yet, all baseline algorithms, including the LLM agent, have difficulty in properly handling the multi-objective homeostasis and diminishing returns benchmarks.

Publication link: https://arxiv.org/abs/2410.00081
Repo: https://github.com/aintelope/biological-compatibility-benchmarks

LessWrong post - Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Published: 17. March 2025

A summary follows. For the full text, please read the post at LessWrong.

https://www.lesswrong.com/posts/PejNckwQj3A2MGhMA/notable-runaway-optimiser-like-llm-failure-modes-on

Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser". Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL runaway optimisation problems are still relevant with LLMs as well.

Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways:

Ignoring homeostatic targets and “defaulting” to unbounded maximisation instead.
It is equally concerning that the “default” meant also reverting back to single-objective optimisation.

Our findings also suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. In some trials the LLMs were successful until the end. This means, while current LLMs do conceptually grasp biological and economic alignment, they exhibit randomly triggered problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives. Once they flip, they do not recover.

Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded. This should not be happening!

Presentation at MAISU unconference and annotated data files - Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Published: 20. April 2025

Presentation at MAISU unconference April 2025:

Link to slides: https://bit.ly/beab-llm

Session recording: https://www.youtube.com/watch?v=4I5mDiujBJs

Link to annotated data files: https://bit.ly/beab-llm-data (Each file has multiple sheets. Only trials with failures are provided.)

LessWrong post - Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Published: 22. June 2025

https://www.lesswrong.com/posts/Jo6LPyp7t3rPuf8Ao/black-box-interpretability-methodology-blueprint-probing

A methodology brainstorming document for identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation on Biologically & Economically aligned benchmarks; showing practical mitigations; and performing the experiments rigorously.

The subjects covered include: Stress & Persona, Memory & Context, Prompt Semantics, Hyperparameters & Sampling, Diagnosing Consequences & Correlates, Interpretability & White/Black-Box Hybrid Benchmark & Environment Variants, Automatic Failure Mode Detection and Metrics, Self-Regulation & Meta-Learning Interventions.

Page updated

Google Sites

Report abuse