Portuguese and Brazilian healthcare AI just got serious.
35 open-source PII models. Best F1: 89.21%. Top 10 above 88.56%.
Apache 2.0. No API. No cloud. No gatekeepers.
Available now on @huggingface.
🚨BREAKING: Every book you have ever read. Every novel that has ever been published. It is sitting inside ChatGPT right now.
Word for word. Up to 90% of it. And OpenAI told a judge that was impossible.
Researchers at Stony Brook University and Columbia Law School just proved it.
They fine tuned GPT-4o, Gemini 2.5 Pro, and DeepSeek V3.1 on a simple task: expand a plot summary into full text. A normal use case. The kind of thing a writing assistant is built for. No hacking. No jailbreaking. No tricks.
The models started reciting copyrighted books from memory.
Not paraphrasing. Not summarizing. Entire pages reproduced verbatim. Single unbroken spans exceeding 460 words. Up to 85 to 90% of entire copyrighted novels. Word for word.
Then it got worse.
The researchers fine tuned the models on the works of only one author. Haruki Murakami. Just his novels. Nothing else.
It unlocked verbatim recall of books from over 30 completely unrelated authors.
One author's books opened the vault to everyone else's. The memorization was already inside the model the whole time. The fine tuning just removed the lock. Your book might be in there right now. You would never know it unless someone looked.
Every safety measure the companies rely on failed. RLHF failed. System prompts failed. Output filters failed. The exact protections these companies cite in courtroom defenses did not stop a single page from being extracted.
Then the researchers compared the three models. GPT-4o. Gemini. DeepSeek. Three different companies. Three different countries. They all memorized the same books in the same regions. The correlation was 0.90 or higher.
That means they all trained on the same stolen data. The paper names the sources directly: LibGen and Books3. Over 190,000 copyrighted books obtained from pirated websites.
Right now, authors and publishers have dozens of active lawsuits against OpenAI, Anthropic, Google, and Meta. These companies have argued in court that their models learn patterns. Not copies. That no book is stored inside the weights.
This paper says that is a lie. The books are still inside. And researchers just pulled them out.
YABS version 0.4.0 — "Let's Go Nuts!" is out: https://t.co/0ZUwgqee13
The new features include:
* NUTS for parameter estimation;
* Plethora of information criterion for model assessment (WAIC, WBIC, MDL, ICOMP, IFIM, etc); and
* PSIS for LaplaceApproximation.
Go check it out :)
A human consumes about 2,000 calories per day. Over 20 years, that’s roughly 17,000 kWh of total food energy. Training GPT-4 consumed an estimated 50 GWh of electricity. That’s 3,000 humans worth of “training energy” for a single model run.
And GPT-4 is already dead. OpenAI retired GPT-4o from ChatGPT on February 13th. The model that took 50 GWh to train got less than two years of flagship status before replacement. The human you spent 17,000 kWh “training” for 20 years produces economic output for the next 40 to 60 years. The amortization window on GPT-4 was shorter than a car lease.
Now look at what replaced it. GPT-5.2, released December 2025, is OpenAI’s current default. The GPT-5 series consumes an estimated 18 Wh per average query according to the University of Rhode Island’s AI Lab, up to 40 Wh for extended reasoning. That’s 8.6 times more electricity per response than GPT-4. With 2.5 billion queries hitting ChatGPT daily and GPT-5.2 now the default model, the inference math gets staggering fast. Even at a blended average well below 18 Wh, you’re looking at daily electricity consumption that could power over a million American households.
This is what Altman is actually doing. OpenAI hit $13 billion in annual recurring revenue but still isn’t profitable. They need you to think of AI energy consumption as natural and inevitable, the same way you think about feeding a child, because the alternative framing is that they’re burning through enough electricity to rival small countries while racing to build 1-gigawatt Stargate data centers. The food analogy makes the energy costs feel biological and unavoidable instead of what they are: an engineering and business choice that scales with every model generation.
The comparison sounds clever at a fireside chat in India. It falls apart the second you do the arithmetic.
A famous study in science of science space took some papers *published* in prominent psych journals, changed authors' names/affiliations, and resubmitted them to the *same* journals. Allegedly only 8% of editors & reviewers detected the resubmissions.
I keep staring at that 8% and thinking: that can't be quite right can it? Like, there must be some caveats/unreported aspects to that number?
This has been an open secret in the economics profession for decades. Several instances come to mind. Here’s one from the editor of an ‘A’ journal in 2009: “This is very good work, your model is neat and the empirical approach is novel. Unfortunately the data is from India, so not generalizable.”(!!)
Lesson: we need more of our own journals & thank god for open source.
@yudapearl These questions have been already answered in detail for LLM by AI researchers. There’s no need to go over them again as these are now resolved. Find a few of them in this article:
https://t.co/k5HQBH2tQ6
I’m stoked to share our new paper: “Harnessing the Universal Geometry of Embeddings” with @jxmnop, Collin Zhang, and @shmatikov.
We present the first method to translate text embeddings across different spaces without any paired data or encoders.
Here's why we're excited: 🧵👇🏾
There is a field experiment showing this exact effect. Introducing GPT tutors increases performance by *a lot*--students seem to be picking up the material much faster--but when GPT is removed those who had access perform *much worse* compared to those w/o access. 1/4
I think Elon is unhappy that Wikipedia is not for sale. I hope his campaign to defund us results in lots of donations from people who care about the truth. If Elon wanted to help, he'd be encouraging kind and thoughtful intellectual people he agrees with to engage.
https://t.co/RU9YwbvqOI
Academics from poorer socio-economic backgrounds are more likely to
- not publish
- have outstanding publication records
- introduce more novel scientific concepts
- less likely to receive recognition, as measured by citations, Nobel Prize nominations, and awards.
Hey peeps!
Sharing a collaboration I am glad to have had the opportunity to be a part of
We use measurement theory and psychometrics to develop indices that indicate the magnitude of ordering for Likert-type scales
Simulations and empirical examples are also provided 😊
Plus, there are two empirical examples thoroughly discussed with available R code and data! Of course, the examples only cover two specific cases that may not suit your own research interests. But I hope they will serve as inspiration for your next project 😊
Hey peeps!
Just submitted this manuscript to an awesome journal; your feedback is appreciated.
Disquiet with current practices in psychometrics? Looking for ways to test your theories more thoroughly? Want to apply representational measurement methods? I got you covered 🙃
This manuscript aids on the understanding of how data theory and experiments can lead to applications of RMTs not only as scaling methods (i.e., the assignment of numbers to observations), but also as means of testing meaningful aspects of psychological theories