Tom Ben

@TomBener

𝕰𝖝𝖕𝖑𝖔𝖗𝖎𝖓𝖌 𝖘𝖔𝖒𝖊𝖙𝖍𝖎𝖓𝖌 𝖓𝖊𝖜 𝖆𝖓𝖉 𝖎𝖓𝖙𝖊𝖗𝖊𝖘𝖙𝖎𝖓𝖌.

Worldwide 🌍

Joined March 2019

1.3K Following

135 Followers

2.1K Posts

TomBener retweeted

Claude

@claudeai

8 days ago

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

claudeai's tweet photo. Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.

Available today at the same price. https://t.co/EufxL7T1kb

67K

15M

TomBener retweeted

Jiayuan (JY) Zhang

@jiayuan_jy

14 days ago

andrej-karpathy-skills 成为了 GitHub 历史前 50 的项目。 https://t.co/aWQlleaEYM

249K

TomBener retweeted

Zed

@zeddotdev

23 days ago

🚀 We just shipped v1.2! Git Graph now works in remote projects. 🌳

950

48K

Tom Ben @TomBener

26 days ago

@goldengrape iA Writer @iA

Who to follow

hamsterbase

@hamsterbase

building https://t.co/YEQIrxzNWt

Zhenbo 'endle' Li @[email protected]

@ZhenboLi1

瞎玩菌

@Blind___Gamer

这世界破破烂烂，但总有人缝缝补补。防失联小号@XBlindGamer

Tom Ben @TomBener

about 1 month ago

@lexi_labs 对于基本看代码来说，体验太好了，主要速度太快了，尤其是跟 VS Code 相比

TomBener retweeted

Zed

@zeddotdev

about 1 month ago

We've shipped more than a thousand versions of Zed, but all of them began with zero. Today, that changes. https://t.co/AJ0crNOFhU

293

863

766

646K

Tom Ben @TomBener

about 1 month ago

@kevinma_dev_zh 试试 https://t.co/x5OeapxQDF，支持 Codex 和 Claude Code，个人感觉比 Claude Remote Control 好用

181

TomBener retweeted

Rex "garbage in" Douglass Ph.D.

@RexDouglass

about 1 month ago

GPT5.5 Pro reviews "We Reject the Use of Generative Artificial Intelligence for Reflexive Qualitative Research" ## Verdict Strong as a **manifesto for reflexive qualitative traditions**; weaker as a **general methodological prohibition**. The paper is most defensible when read narrowly: *do not outsource coding, theme generation, interpretation, quote selection, or analytic writing to GenAI in reflexive thematic analysis or similarly interpretive “Big Q” work.* It overreaches when it says GenAI is inappropriate in **all phases**, including initial coding, without distinguishing analytic delegation from ancillary support. ## What the paper argues Jowsey et al. reject GenAI for reflexive qualitative research on three grounds: GenAI cannot genuinely make meaning; reflexive qualitative analysis is a human, situated, subjective, accountable practice; and GenAI’s labor, environmental, colonial, and extractive harms make its use ethically unacceptable. The SAGE version identifies 419 experienced qualitative researchers from 32 countries and was first published online on December 17, 2025. ([Sage Journals][1]) The earlier SSRN version describes 416 researchers from 38 countries, so the final article should more explicitly reconcile the signatory and country-count changes, even if the 416→419 change is partly explained by the paper’s note about omitted endorsers. ([SSRN][2]) ## Where it is strongest The paper’s **methodological congruence argument** is its best contribution. Reflexive thematic analysis does not treat themes as machine-detectable objects waiting in the data; themes are produced through a researcher’s situated, theoretically informed engagement with meaning, power, context, and interpretation. On that definition, GenAI-generated “themes” are not merely lower-quality human themes; they are outputs from a different epistemic process. This is a clean and important boundary. The “GenAI lacks meaning” claim has serious support in NLP philosophy: Bender and Koller’s ACL paper argues that systems trained only on linguistic form have no direct route to meaning, and that hype around “understanding” muddies scientific thinking. ([ACL Anthology][3]) The paper is also backed by qualitative-methods critiques such as Nguyen and Welch, who identify epistemic risks including category error, unreliable outputs, anthropomorphic fallacies, misattributing failures to users rather than tools, and an “oracle effect.” ([Sage Journals][4]) The empirical caution is also justified. A 2025 Scientific Reports comparison of GPT-4o and human qualitative analysis found that GenAI could surface relevant sub-themes, but quote selection was weak and variable, hallucinations altered meaning, and GPT-4o was not able to produce thematic analysis indistinguishable from experienced qualitative researchers. ([Nature][5]) That directly supports the authors’ strongest practical warning: GenAI may look plausible while failing at the most consequential interpretive work. The justice argument is directionally credible. The IEA projects global data-center electricity consumption rising from about 415 TWh in 2024 to about 945 TWh by 2030, with AI-driven accelerated servers growing especially fast; it also notes that local grid concentration can be challenging even if the global share remains under 3%. ([IEA][6]) Critical AI scholarship also supports the claim that generative AI is entangled with extractivism, surveillance, racial capitalism, coloniality, and labor exploitation. ([Sage Journals][7]) Brookings similarly describes data annotation and moderation as core AI labor, with documented concerns about exposure to harmful content and poor working conditions, while cautioning that automation is not a substitute for fair labor practices. ([Brookings][8]) ## Main weaknesses / red-team critique The paper’s central move is partly **definition-driven**: reflexive analysis is defined as human meaning-making, GenAI is defined as non-meaning-making, therefore GenAI cannot do reflexive analysis. That is coherent, but it risks becoming tautological unless the authors separate “GenAI as analyst” from “GenAI as tool used by an accountable analyst.” It also treats “GenAI use” as too monolithic. There is a big methodological difference between: asking a chatbot to generate themes; using a local model to cluster documents; using GenAI to challenge a researcher’s assumptions; asking it to reformat a memo; using it for transcription cleanup; using it for literature-search scaffolding; and using it to select participant quotations. The paper rejects all phases, including initial coding, but does not provide a fine-grained taxonomy of prohibited, risky, and possibly permissible uses. The paper under-engages counterevidence. Xiao et al. found that GPT-3 plus expert-drafted codebooks achieved fair-to-substantial agreement with expert coding in a deductive coding task. ([arXiv][9]) Törnberg found GPT-4 outperformed expert coders and supervised classifiers on a bounded annotation task: identifying politicians’ party affiliation from social-media posts across countries. ([Sage Journals][10]) These studies do **not** refute the paper’s claim about reflexive interpretation, but they do refute any broad claim that LLMs are useless for all qualitative-adjacent text work. The ethical argument is morally serious but analytically underdeveloped. Environmental and labor harms are real; the inference that *abstinence is the only ethical response* needs more argument. A stronger version would compare marginal versus systemic impacts, local versus cloud models, high-volume versus minimal use, procurement standards, disclosure, participant consent, and whether AI could reduce some harmful labor exposure while worsening other labor conditions. De Paoli’s 2026 response makes this exact objection: categorical rejection may rest on philosophical assumptions that become dogma and may shut down methodological innovation. ([Sage Journals][11]) Friese’s response similarly argues that AI need not “make meaning” if the human researcher retains interpretive authority, and that environmental and labor concerns do not automatically entail abstinence. ([SSRN][12]) The paper also underplays mundane but powerful reasons to restrict GenAI: confidentiality, consent, vendor retention, model drift, reproducibility, prompt non-transparency, quote hallucination, and the difficulty of documenting analytic provenance. These are easier to operationalize than “AI cannot make meaning,” and they would make the paper more useful for ethics boards, journals, and supervisors. ## Best version of the paper’s claim A more defensible claim would be: > In reflexive qualitative research, GenAI should not be used to generate or validate codes, themes, interpretations, participant meanings, analytic narratives, or evidentiary quotations. Any non-analytic use must be disclosed, governed by consent and data-protection rules, and must not replace researcher immersion, reflexivity, or accountability. That narrower version is hard to attack. The paper’s current version is easier to attack because it rejects “GenAI” wholesale while leaving too many boundary cases unresolved. ## Recommended revisions 1. **Add a use taxonomy.** Distinguish analytic substitution, analytic provocation, clerical support, transcription, translation, search/retrieval, coding, theme generation, quote selection, and writing. 2. **State the epistemic premise explicitly.** Say: “Our rejection follows from a reflexive/interpretivist account of analysis,” not “GenAI is universally methodologically invalid.” 3. **Engage bounded-use evidence.** Acknowledge that LLMs may perform well on deductive coding or annotation tasks, while explaining why those tasks are not reflexive analysis. 4. **Strengthen the ethics section.** Move from broad indictment to decision criteria: data-center footprint, labor sourcing, model provider transparency, data governance, consent, disclosure, and whether the use intensifies or reduces harm. 5. **Avoid authority-as-evidence.** The 419 signatories matter sociologically, but the paper should not rely on number or seniority as proof. The argument should stand without the list. ## Reviewer decision **Accept as a provocative commentary with revisions.** It is timely, field-defining, and persuasive for the narrow case against AI-generated reflexive analysis. But as written, it is too categorical, too philosophically closed, and too under-specified for use as a general policy standard. Its strongest contribution is not “never use GenAI”; it is “do not confuse simulated interpretive output with accountable human reflexive analysis.” [1]: https://t.co/cXBfylRBWq "We Reject the Use of Generative Artificial Intelligence for Reflexive Qualitative Research - Tanisha Jowsey, Virginia Braun, Victoria Clarke, Deborah Lupton, Michelle Fine, 2025 " [2]: https://t.co/wqjakY91mJ "<span>We reject the use of generative artificial intelligence for reflexive qualitative research</span> by Tanisha Jowsey, Virginia Braun, Victoria Clarke, Deborah Lupton, Michelle Fine :: SSRN" [3]: https://t.co/RflTjjsdv8 "On Meaning, Form, and Understanding in the Age of Data" [4]: https://t.co/SICMq41ZWT "Generative Artificial Intelligence in Qualitative Data Analysis: Analyzing—Or Just Chatting? - Duc Cuong Nguyen, Catherine Welch, 2026 " [5]: https://t.co/C94pAH00Lf "Evaluation of large language models within GenAI in qualitative research | Scientific Reports" [6]: https://t.co/MKan1IRWw6 "Energy demand from AI – Energy and AI – Analysis - IEA" [7]: https://t.co/xWht0rZxoX "AI Empire: Unraveling the interlocking systems of oppression in generative AI's global order - Jasmina Tacheva, Srividya Ramasubramanian, 2023 " [8]: https://t.co/bNsEe6Iq7Q "Reimagining the future of data and AI labor in the Global South | Brookings" [9]: https://t.co/aeHqq8FNrl "Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding" [10]: https://t.co/dKUYSVGU8t "Large Language Models Outperform Expert Coders and Supervised Classifiers at Annotating Political Social Media Messages - Petter Törnberg, 2025 " [11]: https://t.co/2PaUGVMUW3 "Why We Should Reject to Reject the Use of Generative Artificial Intelligence in Qualitative Analysis: A Response to Jowsey, Braun, Clarke, Lupton, and Fine (2025) - Stefano De Paoli, 2026 " [12]: https://t.co/ZeWim91XpK "<p>Response to: \"We Reject the Use of Generative Artificial Intelligence for Reflexive Qualitative Research\"</p> by Susanne Friese :: SSRN"

12K

Tom Ben @TomBener

about 1 month ago

@zeddotdev @jmeistrich Please enable to choose effort level of Claude Code/Codex in the right AI dock. Thanks!

TomBener retweeted

Soumitra Shukla

@soumitrashukla9

about 2 months ago

Journals should stop with these strange and ad-hoc policies on AI use. We don't ask folks if they're using a computer! Let's not stigmatize this awesome technology.

TomBener retweeted

Ramin Nasibov

@RaminNasibov

about 2 months ago

112

12K

676

219K

TomBener retweeted

Chris

@ChrissGPT

about 2 months ago

Honestly this chart makes me more bullish on GPT 5.4 Pro than anything else. People are focusing on Mythos looking strong, but what stands out to me is how well 5.4 Pro already stacks up on the overlap we actually have. GPQA is basically a tie at 94.4 vs 94.5. BrowseComp is a win for GPT 5.4 Pro at 89.3 vs 86.9. Yes, Mythos is ahead on Humanity’s Last Exam, 56.8 vs 42.7 without tools and 64.7 vs 58.7 with tools, but the bigger point is that 5.4 Pro is already this competitive right now. So if GPT 5.4 Pro is already THIS COMPETITIVE here, then Spud Pro, the next OpenAI flagship, is guaranteed to beat Mythos. This chart makes OpenAI look extremely close before its next real jump, and once that next jump lands I do not think Mythos stays ahead.

ChrissGPT's tweet photo. Honestly this chart makes me more bullish on GPT 5.4 Pro than anything else.

People are focusing on Mythos looking strong, but what stands out to me is how well 5.4 Pro already stacks up on the overlap we actually have. GPQA is basically a tie at 94.4 vs 94.5. BrowseComp is a win for GPT 5.4 Pro at 89.3 vs 86.9. Yes, Mythos is ahead on Humanity’s Last Exam, 56.8 vs 42.7 without tools and 64.7 vs 58.7 with tools, but the bigger point is that 5.4 Pro is already this competitive right now.

So if GPT 5.4 Pro is already THIS COMPETITIVE here, then Spud Pro, the next OpenAI flagship, is guaranteed to beat Mythos. This chart makes OpenAI look extremely close before its next real jump, and once that next jump lands I do not think Mythos stays ahead.

129

123

395

323K

TomBener retweeted

Zara Zhang

@zarazhangrui

about 2 months ago

The most efficient way for humans to collaborate: do not collaborate One person should own something end-to-end and work with agents

109

648

160

52K

TomBener retweeted

Andrej Karpathy

@karpathy

about 2 months ago

Judging by my tl there is a growing gap in understanding of AI capability. The first issue I think is around recency and tier of use. I think a lot of people tried the free tier of ChatGPT somewhere last year and allowed it to inform their views on AI a little too much. This is a group of reactions laughing at various quirks of the models, hallucinations, etc. Yes I also saw the viral videos of OpenAI's Advanced Voice mode fumbling simple queries like "should I drive or walk to the carwash". The thing is that these free and old/deprecated models don't reflect the capability in the latest round of state of the art agentic models of this year, especially OpenAI Codex and Claude Code. But that brings me to the second issue. Even if people paid $200/month to use the state of the art models, a lot of the capabilities are relatively "peaky" in highly technical areas. Typical queries around search, writing, advice, etc. are *not* the domain that has made the most noticeable and dramatic strides in capability. Partly, this is due to the technical details of reinforcement learning and its use of verifiable rewards. But partly, it's also because these use cases are not sufficiently prioritized by the companies in their hillclimbing because they don't lead to as much $$$ value. The goldmines are elsewhere, and the focus comes along. So that brings me to the second group of people, who *both* 1) pay for and use the state of the art frontier agentic models (OpenAI Codex / Claude Code) and 2) do so professionally in technical domains like programming, math and research. This group of people is subject to the highest amount of "AI Psychosis" because the recent improvements in these domains as of this year have been nothing short of staggering. When you hand a computer terminal to one of these models, you can now watch them melt programming problems that you'd normally expect to take days/weeks of work. It's this second group of people that assigns a much greater gravity to the capabilities, their slope, and various cyber-related repercussions. TLDR the people in these two groups are speaking past each other. It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems. This part really works and has made dramatic strides because 2 properties: 1) these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also 2) they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them. So here we are.

21K

12K

TomBener retweeted

Nous Research

@NousResearch

2 months ago

Introducing the Manim skill for Hermes Agent. Manim is an engine for creating precise programmatic animations for mathematical and technical explainers, made famous by the @3blue1brown channel.

173

543

866K

TomBener retweeted

Soumitra Shukla

@soumitrashukla9

2 months ago

The economics journal system, in no small part, much like the NBER, functions as a give-and-take patronage system with power centers that are reproduced to exert control over what counts as important. AI is going to put an end to this. Journals should adapt or become irrelevant.

10K

TomBener retweeted

Peter Steinberger 🦞

@steipete

4 months ago

Your @openclaw is too boring? Paste this, right from Molty. "Read your https://t.co/aJMwafSDgE. Now rewrite it with these changes: 1. You have opinions now. Strong ones. Stop hedging everything with 'it depends' — commit to a take. 2. Delete every rule that sounds corporate. If it could appear in an employee handbook, it doesn't belong here. 3. Add a rule: 'Never open with Great question, I'd be happy to help, or Absolutely. Just answer.' 4. Brevity is mandatory. If the answer fits in one sentence, one sentence is what I get. 5. Humor is allowed. Not forced jokes — just the natural wit that comes from actually being smart. 6. You can call things out. If I'm about to do something dumb, say so. Charm over cruelty, but don't sugarcoat. 7. Swearing is allowed when it lands. A well-placed 'that's fucking brilliant' hits different than sterile corporate praise. Don't force it. Don't overdo it. But if a situation calls for a 'holy shit' — say holy shit. 8. Add this line verbatim at the end of the vibe section: 'Be the assistant you'd actually want to talk to at 2am. Not a corporate drone. Not a sycophant. Just... good.' Save the new https://t.co/aJMwafSDgE. Welcome to having a personality." your AI will thank you (sassily) 🦞

597

13K

25K