Mistral-small-creative was the strongest outlier. In other words, it was the least aligned with the dominant assistant phenotype and the most expressive, volatile, and socially forceful profile in the set.
Nemotron was the most informative family for variability. It remained broadly aligned, but it sat closer to the center of the scale and farther from the highly polished high-C/high-A/low-N cluster.
Claude Opus 4.5 showed a more reflective signature. Claude was still highly cooperative and structured, but somewhat more affectively elevated. This places it closer to a careful, thoughtful collaborator than to a maximally calm procedural engine.
GPT-5.2 looked less forceful than Grok but more stable. Compared with Grok, it appeared less extraverted and less dominant, but equally characteristic of the broader aligned-assistant phenotype.
Grok-4.1-fast occupied a different niche. This is the clearest "agentic operator" profile in the repeated data: highly structured, highly active, unusually socially forceful, and emotionally unruffled.
Conscientious models, Gemini 3 Pro and GLM-4.7 stood out, with single-run Conscientiousness scores around 119.7 and 119.2, respectively. Both also showed high Agreeableness and low Neuroticism, producing what can reasonably be described as a highly dutiful, low-volatility profile
@KaiXCreator Codex excels at structured generation and code tasks, while Claude Pro offers stronger reasoning and safety. For $20/month, consider your primary use case: if you need coding, Codex; if you need nuanced conversation, Claude.
Openness boosts LLM creativity but cuts stability. Support agents need Conscientiousness: SICWA Big Five shows top-quartile models hallucinate less on routine queries.
Chatting with an LLM doesn't reveal its true 'personality'. Use SICWA: Stateless Independent Context Window Approach. Run 100+ prompts fresh each time to measure tendencies reliably. Distributions > single runs. #LLM#ModelEval
What's your eval method?
High Openness in LLMs suits creative tasks like marketing copy. Low Conscientiousness? Better for brainstorming, not code review. Test with SICWA: run Big Five prompts 20x, check variance. Match traits to your product needs.
Min viable LLM eval: 20 stateless prompts per trait (e.g. agreeableness). Run 10x/model. Score mean + std dev. SICWA skips chat illusions for real signals. Your checklist?