This is a real placeholder post.
Use this page for early thoughts about LLM and agent evaluation.
Possible shape
- What I mean by evaluation
- Why agent behavior feels harder to judge than normal software behavior
- What makes an eval useful instead of performative
- Questions I still have
Notes to myself
Replace this scaffold with real paragraphs when ready.