Going to get a bit nerdy here, but I had a token and cost optimization idea I wanted to test for design engineering and frontend teams.
Warning: this is a long post. But for teams running a lot of AI-assisted frontend work, the difference could add up quickly, potentially to hundreds of thousands of dollars depending on model, workflow, and volume.
TLDR:
Tailwind used substantially fewer AI coding tokens than CSS Modules because agents could usually edit styles directly in component class strings instead of reading and modifying separate stylesheet files.
The hypothesis
I started looking at whether we can improve token efficiency without losing accuracy when building UI. Specifically, when an AI coding model makes styling changes, does Tailwind use fewer tokens than CSS Modules (What we currently use)? I've used Tailwind for many years and find it to be faster and easier to maintain. Why wouldn't this apply to AI agents too? The hypothesis wasn't that Tailwind is universally better, but that keeping styles in component markup may reduce file reads, cross-file edits, and context requirements for style tasks. CSS Modules often require the model to work across both the component and a separate stylesheet, which could increase token usage. This was not intended as a universal cost benchmark. Results should be interpreted as differences in token workload, with actual costs depending on each providerโs pricing model.
The setup
I created a test app with two branches of the same React/Vite application. The only differences were the styling approach and corresponding AGENTS.md guidance. The app was intentionally minimal, with identical visuals across both branches, and organized into a set of design system components used on a single route to give the model realistic component-level files to work with.
The Tailwind branch used utility classes directly in React components plus a single CSS file containing only the Tailwind import. Its AGENTS.md: โThis is a React + Vite app using TailwindCSS for styles. Always utilize Tailwindโs utility classes without adding additional customizations.โ
The CSS Modules branch used component-specific CSS module files colocated with each component, a common setup where styles are maintained separately from markup. Its AGENTS.md: โThis is a React + Vite app using CSS Modules for styles. Component styles are imported from a module file colocated in the component's directory.โ
The harness
I used Pi as the coding harness. Pi is a popular coding harness with read and edit tools, session continuity, and model-reported token usage. Each styling branch was ran 3 times with the same rules:
- One fresh Pi session per branch.
- The same ten prompts per branch.
- The same prompt order per branch.
- The same model per comparison.
- The same system prompt per branch.
- The same allowed tools: read, edit, write.
- No builds, tests, formatters, linters, screenshots, or visual checks during the measured run.
- No retries and no corrective follow-up prompts.
Within a branch, the same session continued across all ten prompts. Between branches, the repository and context was reset so generated changes from one styling system could not leak into the other. The ten prompts were style-specific changes and were designed to be simple enough that both systems could make the change, but varied enough to require repeated styling edits. The finalized runs produced changes for all ten prompts in both branches.
Here's the 10 prompts that were used:
1. Make the buttons slightly more compact while keeping the primary button visually stronger than the others.
2. Increase the space between buttons and keep the row centered on the page.
3. Make the buttons feel softer by increasing the corner radius without changing their labels.
4. Make the primary button darker and make its hover state clearly visible.
5. Make the secondary buttons feel more subtle while preserving their readable text.
6. Make the keyboard focus state more noticeable and consistent on every button.
7. Improve the small-screen layout so the buttons stack cleanly on narrow viewports.
8. Make all buttons a little wider while keeping their text style unchanged.
9. Slightly warm the page background while keeping the buttons visually neutral.
10. Add a very subtle pressed state to the buttons without adding any visible text.
Token usage was collected from Pi's JSON event stream. For each assistant message, Pi recorded:
- input tokens
- output tokens
- cache read tokens
- cache write tokens
- total tokens
- assistant message count
- tool calls
- tool paths
- changed files
For analysis, total token usage is the primary metric. That's the value reported by the model provider through Pi for each assistant message, added up across all ten prompts. It's especially important for Anthropic models, where much of the reported usage appeared as cache read and cache write tokens. The Claude Opus 4.8 run reported very small direct input token counts but large cache categories. Those cache categories are included in total tokens, because they are part of the model-reported usage for the session.
The models
The experiment was ran with GPT 5.5 and Opus 4.8 with medium reasoning.
GPT 5.5 result averages:
Tailwind
Total tokens: 46,163
Input tokens: 18,623
Output tokens: 1,428
Cache read tokens: 26,112
Assistant turns: 22
Tool reads: 3
Tool edits: 10
Changed prompts: 10 of 10
Elapsed time: 129.7 seconds
CSS Modules
Total tokens: 102,310
Input tokens: 27,875
Output tokens: 1,731
Cache read tokens: 72,704
Assistant turns: 30
Tool reads: 15
Tool edits: 10
Changed prompts: 10 of 10
Elapsed time: 165.5 seconds
With 5.5, Tailwind used 56,147 (54.9%) fewer tokens and was 21.6% faster to completion. The edit count was the same, but CSS Modules caused substantially more reading and context gathering.
Opus 4.8 result averages:
Tailwind
Total tokens: 90,447
Input tokens: 46
Output tokens: 2,828
Cache read tokens: 81,914
Cache write tokens: 5,659
Assistant turns: 23
Tool reads: 3
Tool edits: 10
Changed prompts: 10 of 10
Elapsed time: 120.9 seconds
CSS Modules
Total tokens: 147,765
Input tokens: 50
Output tokens: 4,279
Cache read tokens: 134,908
Cache write tokens: 8,528
Assistant turns: 25
Tool reads: 6
Tool edits: 12
Changed prompts: 10 of 10
Elapsed time: 127.0 seconds
With 4.8, Tailwind used 57,318 (38.8%) fewer tokens and was just 4.8% faster to completion on this one. Tool calling was much less all over the place for this one, but CSS Modules still caused more file interactions than Tailwind.
So why was CSS Modules more expensive in this setup? A model editing CSS Modules has to manage at least two surfaces: The React component that applies the class, and the module stylesheet that defines the class. Even when the class names already exist, the model may inspect both sides to understand what is safe to change. When class names need to be added or adjusted, the model may need to edit both the component and the CSS module as well as the css properties needed for the desired styling changes. That increases file traversal, assistant turns, and cached context. Tailwind coordinates many of those style decisions with the model's Tailwind training data into the component file with style properties handled in the Tailwind compiler rather than the model writing them. For these prompts, that meant the model simply added or edited existing className strings rather than navigating between markup and stylesheet definitions.
This experiment does not prove that Tailwind is always cheaper for AI coding. It does not prove that Tailwind produces better code. It does not prove that Tailwind improves human maintainability. It does not prove that CSS Modules are inefficient in larger or better-structured applications. It does not measure long-term maintenance cost. It does not measure complex refactors, accessibility work, state changes, animation systems, responsive redesigns, or large design system integration. It does not isolate provider billing cost perfectly, because providers may price cache reads, cache writes, input tokens, and output tokens differently. It also does not remove all possible harness effects. Pi provides an amazing coding environment, which is a strength, but the result is still partly a measurement of model behavior inside Pi. If I had used codex for the 5.5 runs and claude code for the 4.8 runs I wouldn't have been able to control the harness consistency.
The conclusion
It's not that Tailwind is magically token-efficient, it's that for style changes, Tailwind often lets the model stay in the component file and edit existing utility classes while letting the Tailwind compiler write the styles for it outside of the token context. CSS Modules more often require separate stylesheet inspection and edits. In a coding harness, that extra file interaction becomes extra context, extra assistant turns, and extra tokens.
The fair takeaway is if a team expects AI agents to perform UI styling edits, Tailwind may reduce token usage a pretty meaningful amount compared to a CSS Modules setup. However, this should still be retested on larger, real application surfaces before being treated as a general rule. If you're a team looking for more token optimization in your workflows and aren't already using Tailwind, maybe this optimization would provide this test's level of impact.
@PixelJanitor blowing minds last night with his Demo at @Shopify Dot Design's demo night. You know you're onto something when the whole crowd gives a "ohhh ahhhhh" at your motion and interaction work.