Phantomus @phantomus_host - Twitter Profile

about 7 hours ago

@kholden @chooserich At some point there will be an app to highlight the idiotic comments so people in real life stay away for sometime like you. for fuck sakes man, get a life.

0

17

Phantomus

@phantomus_host

2 days ago

Its interesting to observe comments about the launch of another AI product, many of which with a level of hate, from “I can do it myself” to “How disconnected you have to be to need an AI family assistant” or that its only for a small subset of users. All may be true, but does it really matter? At this point, teams can build and ship product in fraction of time, some may find their fit other won’t, the teams will move on. We are so vocal about every little thing now, maybe spend more time building something of your own, so that you can get a few hateful comments.

Bill Lennon

@blennon_

3 days ago

AI can now make you a great parent. Introducing Ollie: the world’s first AI family assistant that manages your family life better than any human. Here’s how it works:

580

2K

4K

2K

2M

1

0

480

Phantomus

@phantomus_host

6 days ago

One agent if its an active coding task, max 2-3 if the output of the tasks delegated has nothing to do with building the products. Frankly, after so much time, I think I’ll be happy with one. The scaling of productivity, or an an illusion of it running concurrent AI tasks, hits life outside of business related tasks. The mind is rampant, sleep is lost, thinking of what to be orchestrated next - health toll at the end. We might need to start limiting the time spend on work related activity if we are so damn productive, but we all move the opposite direction, taking on more things and burning out.

0

146

Phantomus

@phantomus_host

17 days ago

@danshipper People should do everything better, Dan. It does come down to reward in many cases though, if someone would write a good book but has to feed their family only for that book not to hit the up trend, is a risky one, not an easy space to earn a living I would assume.

0

16

Who to follow

KIN Kong

@KinKong333

@okaybears ride or die

Hamzaa♏🇵🇸

@HamzaaBMC

nothing just chilling

Alex🔅F🟠r

@FormanAlexis1

https://t.co/wg0BwRf0Qm

Phantomus

@phantomus_host

18 days ago

There is a bunch of slop in AI code, which especially becomes visible when you bring AI coding into existing product that was built without it, stop looking at the output, and then accidently peak in 😂 However, too much emphasis on the slop in communities of experienced engineers, because: - most likely newer versions will clean this and more in near future. Yes, it can be ugly, some say “how to maintain this”, etc. but by the time this argument will gain more traction, the new models will remove with one pass through. - majority of products, and I mean vast majority, never will grow neither in the codebase, functionality, nor customer impact from usage as people think, so might as well embrace the slop, as long as its secure, and if that product lucky (or unlucky, depending on how to look at it), it will grow to millions $ in revenue and dozen or more engineers needed to maintain it to worry about that slop later down the line. last time I heard someone be concerned in person about scaling, security, and AI slop managment, was already unemployed, from C level tech position, probably in favour of some slop.

Mo

@atmoio

18 days ago

Andrej Karpathy admits he’s struggling with AI

113

3K

188

2K

330K

0

1

0

41

Phantomus

@phantomus_host

22 days ago

This is correct, but publically mentioning this to a person you know even slightly, doesn’t say much of your engagement approach either. Hiding their name behind the cut off doesn’t help in exposing the individual if someone will go to your linkedin, hurting their reputation further as part of this rant.

Gergely Orosz

@GergelyOrosz

25 days ago

A person I know (and who is? was? a good professional al) left an AI-generated comment under my LinkedIn post. Full-on AI slop. I asked: why do this? He replied. It’s because of “engagement.” People are burning their profession al reputation, paying for AI tools, for nothing

GergelyOrosz's tweet photo. A person I know (and who is? was? a good professional al) left an AI-generated comment under my LinkedIn post. Full-on AI slop.

I asked: why do this?

He replied. It’s because of “engagement.”

People are burning their profession al reputation, paying for AI tools, for nothing https://t.co/BxEgpt8Ppm

79

829

34

64

123K

0

76

Phantomus

@phantomus_host

28 days ago

@Choom2049 @ClaudeDevs its vibe coding music

0

63

Phantomus

@phantomus_host

about 1 month ago

The other thing I would be nervous about in general is a backup of 3 months old; I am terrified of any backup that is older than 1 hour. Let’s be honest, majority of our projects are not petabytes of data, and hourly backup doesn’t cost nor interrupts business in majority of cases.

0

354

Phantomus

@phantomus_host

about 1 month ago

OpenAI's new frontier model, GPT-5.5, dropped sixteen hours ago. Already running tests. As I said in the first test run, the benchmarks that ship with model releases are general-purpose capability benchmarks (SWE-bench, HLE, etc.) and often won't reflect the tasks you actually run them against. These tests show exactly that - switching models based purely on benchmarks would give completely unacceptable output. Added a new "creative" mode to the page builder for this run, which tells agents to put more of a personal touch on deliverables instead of sticking strictly to the specs and examples in existing templates. Three models tested on page building, following a built-in page builder's guidelines. Working on neutral text examples for the next runs so I can share full page outputs if anyone's interested. Clear winner is still Claude - consistency, output quality, ability to follow rules, creativity without going off the rails, and research that puts it ahead on content generation. This run, Kimi K2.6 just gave up - the rabbit went off course, found Japanese text in the context, and refused to jump any further. Most surprising though is the new GPT-5.5 - creative mode shows up as huge letters, repetitive design across different contexts, and pretty distasteful color palettes. Overall: clear guidelines with a bit of creativity is the best strategy for page generation. Claude Opus 4.7 still wins for this use case. Open source performed poorly here with a small context window. This is one specific use case - it doesn't reflect what any of these models can do outside of it. But one thing's for sure: if you're running agentic tasks, this kind of research matters. With how fast AI is moving, spending a percentage of your time on R&D is a must. P.S. Guidelines also restrict models from leaning on the design principles, frameworks, and examples they were trained on. Building without constraints usually falls back to the training set - you'll see similar color themes and elements across models. If you've played with this, you probably already recognize which elements came out of which model. P.P.S. We can all agree though — OpenAI's Image-2 is a visual monster :)

phantomus_host's tweet photo. OpenAI's new frontier model, GPT-5.5, dropped sixteen hours ago. Already running tests.

As I said in the first test run, the benchmarks that ship with model releases are general-purpose capability benchmarks (SWE-bench, HLE, etc.) and often won't reflect the tasks you actually run them against. These tests show exactly that - switching models based purely on benchmarks would give completely unacceptable output.

Added a new "creative" mode to the page builder for this run, which tells agents to put more of a personal touch on deliverables instead of sticking strictly to the specs and examples in existing templates.

Three models tested on page building, following a built-in page builder's guidelines. Working on neutral text examples for the next runs so I can share full page outputs if anyone's interested.

Clear winner is still Claude - consistency, output quality, ability to follow rules, creativity without going off the rails, and research that puts it ahead on content generation.

This run, Kimi K2.6 just gave up - the rabbit went off course, found Japanese text in the context, and refused to jump any further.

Most surprising though is the new GPT-5.5 - creative mode shows up as huge letters, repetitive design across different contexts, and pretty distasteful color palettes.

Overall: clear guidelines with a bit of creativity is the best strategy for page generation. Claude Opus 4.7 still wins for this use case. Open source performed poorly here with a small context window.

This is one specific use case - it doesn't reflect what any of these models can do outside of it. But one thing's for sure: if you're running agentic tasks, this kind of research matters. With how fast AI is moving, spending a percentage of your time on R&D is a must.

P.S. Guidelines also restrict models from leaning on the design principles, frameworks, and examples they were trained on. Building without constraints usually falls back to the training set - you'll see similar color themes and elements across models. If you've played with this, you probably already recognize which elements came out of which model.

P.P.S. We can all agree though — OpenAI's Image-2 is a visual monster :)

0

64

Phantomus

@phantomus_host

about 1 month ago

@simonw SVG it

0

160

Phantomus

@phantomus_host

about 1 month ago

@scottjla @simonw 🤣

0

1

0

1K

Phantomus

@phantomus_host

about 1 month ago

@GavinSherry This is amazing room Gavin, where is that?

0

18

Phantomus

@phantomus_host

about 1 month ago

@horwitzben Nice one, Larry

0

14

0

1K

Phantomus

@phantomus_host

about 1 month ago

Kimi did try the research and sources were read, however the brief / summary was not produced. Quite possible with the proper summary, the output would have been different, but this was a raw test against existing set of rules, without priority or help to any of the models.

0

16

Phantomus

@phantomus_host

about 1 month ago

LLM stats are shiny objects — we're drawn to them and rush to try new models the moment they drop. But standard benchmarks putting a new model above existing ones don't mean it'll perform the same on your specific use case. If your setup works well now, it shouldn't be upgraded without being thoughtfully tested. Just like @simonw's pelican test, you need your own evals against your actual tasks. Coding against the Linux kernel (30+ years of code) is a different problem than running a process based on your input parameters. One business I operate has a page builder that's gotten quite good and consistent - working with both Claude and Codex against a defined set of rules, supported components, layouts, and rendering options. Last night I ran the same prompt against Opus, Codex, and the latest open-source release, Kimi K2.6. The subagent builds a page from the input prompt combined with company research kicked off during the build. Here are the stats that matter for this process. On the UI side, the Awards section delivers pretty much the same verdict. Gives me an idea: extend the test by running against the full set of guidelines and supported components to compare UI output too.

phantomus_host's tweet photo. LLM stats are shiny objects — we're drawn to them and rush to try new models the moment they drop.

But standard benchmarks putting a new model above existing ones don't mean it'll perform the same on your specific use case. If your setup works well now, it shouldn't be upgraded without being thoughtfully tested.

Just like @simonw's pelican test, you need your own evals against your actual tasks. Coding against the Linux kernel (30+ years of code) is a different problem than running a process based on your input parameters.

One business I operate has a page builder that's gotten quite good and consistent - working with both Claude and Codex against a defined set of rules, supported components, layouts, and rendering options.

Last night I ran the same prompt against Opus, Codex, and the latest open-source release, Kimi K2.6. The subagent builds a page from the input prompt combined with company research kicked off during the build.

Here are the stats that matter for this process. On the UI side, the Awards section delivers pretty much the same verdict.

Gives me an idea: extend the test by running against the full set of guidelines and supported components to compare UI output too.

1

0

36

Phantomus

@phantomus_host

about 1 month ago

@Mappletons @joshuaday @brian_lovin You are an exceptional writer, not just by that piece, but so many others that I’ll need to catch up on your blog. the rest like the buzzwords comment are opinions of others, sometimes people have hard time containing them.

0

1

0

183

Phantomus

@phantomus_host

about 1 month ago

This ☝️ If you gain such a boost in performance, there is no reason any developer would not pay higher amount, I don’t even understand what this entire conversation is about outside of engagement farming. If its not testing, its the right move to cut low tier subs for high tier models.

0

1

0

65

Phantomus

@phantomus_host

about 2 months ago

@dejavucoder great movie

0

1

0

125

Phantomus

@phantomus_host

about 2 months ago

soon enough this won’t matter either. the quality will ultimately drive the roadmap, it already does, not only for products, but content around it, presentation, onboarding…everything. the quality is the king now.