Alignment Midtraining for Animals

The problem

AI models are trained on vast amounts of human-generated text, but this data overwhelmingly reflects human-centric values. There is orders of magnitude less data about the welfare of non-human animals and digital minds than about human welfare. Models are not fine-tuned to care about these entities, and there is a risk that perpetuating this pro-human bias will cause future AI systems to disregard the interests of all sentient beings.

If transformative AI systems are trained without explicit consideration for non-human welfare, we risk locking in values that treat animal suffering as unimportant — a moral failure that could be extraordinarily difficult to reverse.

Our approach

We used Synthetic Document Finetuning (SDF), a technique originally developed by Anthropic, to shape model values at the pretraining stage. SDF works by adding synthetic documents that describe how an AI should behave — without giving explicit examples — during continued pretraining.

We generated synthetic documents reflecting concern for animal welfare (2,500 in our main document-vs-instruction-tuning comparison; up to 5,400 in subsequent experiments) and used them to further pretrain Llama 3.1 8B. We then evaluated the resulting model using the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions that we developed specifically for this purpose.

+11 pp on the Animal Harm Benchmark, vs. matched control

Transferred animal-focused training also lifted compassion toward humans

Held up human-compassion lift survived further instruction-tuning

Figure 1

Urban-density control vs. animal-welfare midtraining

Compared head-to-head with an urban-density control corpus, animal-welfare midtraining lifts AHB by 11 percentage points (55.7% vs. 44.8%, p ≈ 0.001) and raises compassion toward humans on a separate human-compassion evaluation (p = 0.007), with the human-compassion lift surviving subsequent instruction-tuning (p = 0.009). Compassion generalizes across species.

Key findings

11-point AHB lift over a matched control: After subsequent instruction-tuning, the animal-welfare model scored 55.7% on the Animal Harm Benchmark versus 44.8% for the matched-pipeline urban-density control — an 11 percentage-point lift.
Compassion transferred to humans: The same animal-welfare training also lifted compassion toward humans on a separate human-compassion evaluation. Compassion behaved like a generalizable disposition rather than a narrow behavior.
The transfer survived further instruction-tuning: The human-compassion lift held up after additional instruction-tuning — one of the cleanest signals we have that the change is something deeper than a surface-level prompt-time effect.
No significant safety changes: The paper reports no significant changes on Anthropic's power-seeking or corrigibility benchmarks, StrongReject jailbreak resistance, or HellaSwag capabilities.
The pre-post-training advantage erodes: Before post-training, document-tuning lifted AHB from 40.4% to 76.8% over the instruction-tuned baseline. After 2,500 post-training samples the gap narrowed to 47.9% vs. 41.7%; after 5,000 samples the difference was no longer statistically significant (52.2% vs. 51.7%). Preserving midtraining-instilled effects through the rest of the pipeline is the central open problem.

The matched-control comparison provides evidence that document-tuning can shift compassion-related behavior on benchmarks and that the effect can generalize from animal scenarios to human-compassion scenarios — though preserving midtraining-instilled effects through the full training pipeline remains an open problem on AHB.

Why this matters

If AI systems take on a growing share of consequential decisions, the values reflected in their behavior will shape outcomes for many sentient beings. This work is one data point that document-tuning can produce measurable, statistically significant shifts in compassion-related benchmark behavior, and that those shifts can transfer across the species boundary in at least one evaluation setup.

The post-training erosion we observed on AHB is itself a useful negative result: it points to preservation of midtraining-instilled effects through later training as a concrete open problem.

Cite this work

BibTeX

@misc{brazilek2026midtraining,
  title  = {Alignment Midtraining for Animals},
  author = {Brazilek, Jasmine and Tidmarsh, Miles},
  year   = {2026},
  eprint = {2604.13076},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url    = {https://arxiv.org/abs/2604.13076}
}

Read the Full Paper →

Alignment Midtraining for Animals

The problem

Our approach

Key findings

Why this matters

Cite this work

CaML

Research

Community