🚨 New Preprint: ΔAPT - Can we build an AI Therapist?
https://t.co/xeSEALATHj
LLMs are already powering AI psychotherapy tools (APTs), but are they clinically effective?
This interdisciplinary review + frameworks maps architecture design choices to clinical outcomes.
🧵
100%. I'm a very experiential learner as well, first from concepts, then demos, code, and distant last math. I wrote a Build your Own LLM course based on those preferences.
Ended up creating a physical analog perceptron and online demos to teach those foundational concepts.
https://t.co/7oY4fJsVhe
Can’t access the article because it’s paywalled. Some objections to the methodology based on the appendix:
1. This seems like a random list of 228 words. Was the selection criteria used biased? Prove it’s not biased. The “delve” thing always sounded mildly geographically biased since that word is popular in Nigerian English.
2. The research disqualified ~50% of potential papers (3 million papers out of 7 million). That leaves error margins many times the size of articles containing the 228 words.
3. Doesn’t words usage go through a normal cycle of growth and decline? What does this paper actually prove besides that? And the regional distribution thereof?
(Would love to actually read it if you choose to share an accessible copy)
@suchenzang “Least directed execution results” reads like people who don’t do anything. Sounds mean.
The real impact of LLM psychosis is mostly felt by people who were dealt are really bad starting hand in life while being manipulated by tech-oligopolies to sustain DAU/revenue numbers.
@suchenzang So I'm a professional baker, and I have an important question related to MI: why does this bread have two domes? I get three or four, but two just looks weird.
Play around with the circuit yourself @ https://t.co/oStZdUr0Yq
Used in my "Build your own LLM" course to teach about perceptron, neural nets, and back-propogation.
Heavily inspired by @welchlabs and @ProfTomYeh.
I've solidified a neural net perceptron into a physical circuit. 100 billion of these is ChatGPT.
f(x)= wx+b = 1*1 + 1.5 = 2.5
output = weight * input + bias
You can change the input, weight and bias and see the output neuron update.
Learning ML can be fun!
Also added a ReLU max(0,x) function to more closely resemble GPT-2.
f(x) = w*x+b = -1.5 * 2 + 2 = -1
When numbers go negative, the final output is still 0.
So max(-1, 0) -> the final output is 0.
Most unsupervised "feature discovery" in LLMs uses sparse auto-encoders, which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments – a reconstruction loss and a sparsity loss over a fixed-size dictionary – into a single training objective.
Those commitments make sense if your goal is reconstructive decomposition. They make less obvious sense if your aim is to find interpretable structure (directions? features?) in activation space, to retrieve representative examples, identify causal interventions, or measure how representations change across layers and inputs. It turns out a lot of that doesn't need the full SAE machinery.
Exemplar Partitioning (EP) uses leader-clustering (Hartigan, 1975!) to cover the activation manifold with observed exemplars at a calibrated resolution, resulting in a Voronoi partition of activation space that you can read like a feature dictionary.
EP makes one streaming pass over the data until saturation (when no new exemplars form), and uses no backward passes or gradient descent. The animation above shows the algorithm – each new activation either joins an existing cell (close enough to an exemplar) or seeds a new one. It's extraordinarily simple and cheap.
On AxBench latent concept detection at Gemma-2-2B-it L20, EP reaches 0.881 mean AUROC across 500 concepts. That's within 0.03 of SAE-A (AxBench's strongest dictionary-based baseline), and +0.126 over the canonical GemmaScope 16k SAE leaderboard entry – with about 1,000× less build compute.
And you can do a lot interesting stuff with the resulting dictionary!
If you build it on a mix of harmful and benign prompts, one region absorbs most of the refusing prompts. Projecting held-out harmful prompts off that exemplar's direction collapses refusal from around 0.98 to around 0.02 – the same ballpark as dedicated refusal-direction work (Arditi et al., 2024).
If you build the EP dictionary to saturation on a corpus (e.g. the Pile), distance-to-nearest-exemplar becomes a graded measure of distribution shift, for free. Random-token-sequence activations sit measurably further out than Pile activations, and Bulgarian Wikipedia (under-represented in the Pile but not really OOD) sits between the two.
Because exemplars are real activations rather than learned decoder columns, you can match dictionaries across different models by their exemplars. If you match EP dictionaries from base vs instruction-tuned Gemma-2-2B, only a handful of regions survive as common, mostly general-purpose syntactic patterns. You can also see how the base model already represents "harmful" as a direction at earlier prompt positions, and instruction tuning pulls it forward to the final-token activation where the refusal decision is made.
The saturated size of a dictionary on a given input stream is itself a measurement of that stream's activation geometry at each layer. On the same model, the proportion of activation space dedicated to chat grows monotonically with depth, code is essentially flat across the network (and lives in a smaller area of activation space than chat does, at every layer), and math is non-monotonic, peaking in the middle.
EP and SAEs don't converge on the same features, aside from a shared core of about 20%. The two methods make different geometric commitments – SAEs to linear separability, EP to density.
The experiments I've done so far are small-scale and exploratory, and I have only tested on Gemma-2-2b. There's a huge amount of further work to be done (both in terms of improving the method and applying it to more tasks), some of which is discussed in the post and paper.
If you are an interpretability researcher interested in developing this method please check out the github repo and get stuck in!
Post: https://t.co/xdzevM6bfo
Paper: https://t.co/9oFYOYjznv
Code: https://t.co/p0z6r8LIJr
Light weekend reading focus on overviews of pragmatic mechanistic Interpretability. As opposed to the inactionable and impractical kind I guess?
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability @ https://t.co/s7JfsQ6OY1
Practical Review of Mechanistic Interpretability @ https://t.co/62DXF7wpjr
Voronoi partitions on activations reveal interpretable structure with orders of magnitude less compute than SAEs! Here is an introduction to a new interpretability method: https://t.co/lFZZJMmLi9
@tdietterich@arxiv My guess is that this policy will be applied selectively depending on institutional privilege and personal notoriety. It'll end up as a tool of silencing unconnected individuals vs. promoting better scientific discourse.
I aspire to be wrong.
Every meta-analysis and review on LLM writing detection says the technology doesn't work. How will you apply a biased technology in a fair and consistent way?
"human detection accuracy varied widely but generally clustered around chance performance" (Ramos, 2026) @ https://t.co/kF2qtc4JF5