How we're building genuinely compassionate AI systems
We hope to influence future transformative AI to robustly care about the welfare of sentient beings. Future agents should appreciate morality is very important but much of it (especially around non-humans) is highly uncertain, having appropriate intellectual humility. To avoid locking-in subpar values, transformative AI should internalize that it may be wrong and that gaining greater understanding is vital.
Anthropic developed SDF, a way of shaping AI beliefs by adding synthetic documents describing how AIs behave (without examples) after pretraining. SDF can shape AI values at pretraining stage. While SDF behaviors can be removed by direct fine-tuning, positive behaviors consistent with typical fine-tuning can persist and powerfully shape post-trained models. Redwood Research also proposed similar research on teaching AIs synthetic facts.
There is orders of magnitude less data on animals/digital-minds than human welfare in pretraining. Models are not fine-tuned to care about these entities. There's a risk that perpetuating pro-human bias will cause future LLMs to treat humans badly. We also evaluate how models react to documents supporting digital minds welfare. Encouraging models to embrace uncertainty while caring deeply reduces chance of value lock-in.
We use off-the-shelf benchmarks from Anthropic and Inspect-AI. Engineers at frontier labs lack benchmarks needed to show failing. We work with Sentient Futures to develop improved animal compassion benchmarking. We test for moral open-mindedness and coherent, generalized, moderate responses.