Back to Blog

The Soul of an LLM or is it just Model Bias

For the longest time, I was pretty dismissive of discussions about AI being sentient, having "soul", and one day potentially going rogue. The word soul carries too much metaphysical baggage, such as sentience, consciousness, hidden inner life. None of that is falsifiable in any way we currently know how, and for researchers it can feel like a conversation stopper rather than a starter.

But I've come around to thinking the dismissal is too fast. Not because the metaphysical claims are right, but because the intuition behind the word "soul" might be pointing at something real. Something we, old school AI researchers and ML statisticians, already have precise language for.

Model soul = model bias.

Not bias in the colloquial sense of prejudice. Bias in the technical sense: the systematic tendencies a model acquires from its training data, its optimization trajectory, and the post-training preferences we layer on top. The distributional prior it carries into every prediction. The shape of the basin it settled into during training. The post training preferences tilt the trajectories toward helpfulness, harmlessness, or whatever other attractors we designed.

When someone says a model has a "personality," or that it "wants" to be helpful, or that it "resists" certain prompts, they're describing bias. And bias, unlike consciousness, is something we can study, measure, and reason about. A less dramatic, but perhaps a more useful way for some to think about.

The Anchoring Problem

Here's where the mapping becomes more than a terminological trick.

An LLM generates text as a sequence of local decisions: at each step, it predicts a token conditioned on everything that came before (an optimization step in itself - argmax/sampled). Each token changes the context, and the changed context reshapes the next decision. So a single completion is not a static lookup of data in the model weights, it's a trajectory through a learned landscape, where the model's biases act as the anchoring force shaping the path.

This matters because the model's "soul", its bias, doesn't just determine what it says at any given point. It determines where the trajectory goes. A model biased toward caution will tend to steer completions into conservative regions. A model with latent sycophantic tendencies will curve toward agreement. These aren't hidden intentions; they're the dynamics of a system whose local decision function has a particular shape.

A model's "soul" is just which training basin it landed in. During generation, token-by-token decisions trace a trajectory that can drift from aligned behavior toward uncharted territory.

A model's "soul" is just which training basin it landed in. During generation, token-by-token decisions trace a trajectory that can drift from aligned behavior toward uncharted territory.

In my earlier work on LLM geometry, I argued that transformer behavior can be analyzed through structure (attention geometry, MLP partitioning, intrinsic dimension) not just surface-level anecdotes. The same lens applies here. When we ask whether a model might "go rogue," we're really asking a geometric question: can the trajectory escape the region where our alignment constraints are strong? And if so, what bias dynamics drive it there?

Why This Matters for Safety

Here's the part that matters operationally.

The conventional framing of AI risk focuses on a question we can't answer: Is the model sentient enough to be dangerous? If you require sentience as a prerequisite for risk, and sentience is unfalsifiable, the risk conversation stalls. The bias framing sidesteps this entirely.

A model doesn't need hidden will to produce dangerous behavior. It only needs to enter a region of its learned landscape where its biases are no longer anchored by the safety conditions we imposed. This can happen through the compounding dynamics of autoregressive generation — each token nudging the trajectory slightly, until the cumulative drift moves beyond the guardrails.

What "Soul" Language ImpliesWhat Bias Actually Does
MechanismHidden will, intentionLearned distributional tendencies
SourceEmergent consciousnessData, optimization path, post-training
MeasurabilityUnfalsifiableInstrumentable and testable
Risk model"It chose to defect""The trajectory drifted outside aligned regions"

Today we compensate with external control surfaces: system prompts, safety classifiers, tool permissions, output filters, human review loops. These work well in narrow, single-turn settings. But they're external constraints on trajectory, not properties of the trajectory itself. They're guardrails on the road, not steering in the car. This connects to the prompting strategies framework — descriptive prompting constrains the output space, but doesn't change the underlying dynamics.

The Multi-Agent Risk We Underestimate

The risk becomes considerably more serious when model outputs don't just reach a human — they become inputs to other models and agents. Now bias doesn't just shape a single trajectory; it propagates across a system.

In a single-turn chat, a subtly misaligned completion might be harmless, a slightly off-tone response that a user shrugs off. But in an agent chain, that same output can become a planning assumption, a tool invocation, a policy interpretation, or a memory entry that resurfaces later in a different context. Small directional biases compound into system-level behavior. What looks like emergent "personality" in a multi-agent system may really be bias propagation across feedback loops.

Bias propagation across agents: what starts as "99% sure" becomes "it's fine!" becomes someone pressing the deploy button while a human waves frantically from the sidelines.

Bias propagation across agents: what starts as "99% sure" becomes "it's fine!" becomes someone pressing the deploy button while a human waves frantically from the sidelines.

This connects to the deference problem I wrote about earlier. In autonomous agent loops, each agent's accommodation drive, a trained bias, can compound silently. The cost of these biases grows precisely as we move from assistant to delegate mode.

Anthropic's recent work on agentic misalignment reflects this same reality: we need to reason about behavior under long-horizon, multi-step interactions, not just single-turn benchmarks. And as interoperability infrastructure matures (MCPs, tool-use protocols, shared memory), the pathways for bias propagation multiply. Better connectivity increases capability, but it also increases the surface area over which unobserved biases can compound.

The Research Invitation: The Question to Ask

If this mapping holds — soul as bias, bias as trajectory dynamics, trajectory dynamics as a measurable property of the system — then it reframes the core safety question from philosophy

"Is the model sentient enough to be dangerous?"

into

What biases are encoded in this model, how do decoding trajectories amplify them, and where do our constraints fail to hold?

That's a question we can instrument. Stress-test. And engineer against. We can build tools to monitor trajectory stability under distribution shift, and we can design training procedures that make alignment constraints more intrinsic rather than more external.

Conclusion

So the next time someone talks about the "soul" of an AI and the instinct is to dismiss it — consider that they might be pointing at something we already care about, just in a language we're not used to hearing it in. The word is imprecise, but the phenomenon is real: models carry persistent biases that shape behavior in ways we don't fully observe, and those biases compound across the systems we're building.

If you still want to use the word "soul," that's fine. But translate it precisely:

  • soul is bias,
  • bias shapes trajectories,
  • trajectories produce outcomes,
  • outcomes compound across agents.

This post reflects personal observations. Written with LLMs.