At @coinbase our AI spend is down nearly half this quarter while token usage keeps climbing. My team built the infrastructure behind it: routing, caching, cheaper defaults, and the spend services that track it.
We route everything through our own gateway: a single endpoint and format for dozens of models, with cross-provider failover, redaction, logging, and cost controls all applied before anything reaches a vendor.
We started with cheaper defaults and caching. 91% of employees weren't hitting their usage caps. Instead of lowering caps, we set cheaper model defaults to cut spend. Caching took more work to get consistent across every tool and model family. A cache hit needs the prefix to match exactly, so we keep building a long, stable prefix across turns. Each request only pays full rate on the new tokens and reads the rest from cache.
Our routing accounts for caching too. The naive approach scores each turn on its own and sends it to whichever model fits, which seems reasonable but would run up spend. The cache is per-model, so switching mid-conversation invalidates it. Our router weighs cache state alongside how hard the task is: a conversation keeps its model while the cache is warm, and the chance to re-route comes only when it goes quiet long enough for the TTL to lapse. Once it does, the router is free again to pick the best model for the task.
These improvements happened at the gateway, so they apply across every team and tool. Next we're going deeper on the coding harness, where we have the most signal and flexibility, tuning how subagents and context get managed.
Good lessons on managing AI spend. We support a lot of this natively on Databricks with Unity AI Gateway, which makes it easy to analyze and control usage in one place, and it’s also easy to set these up as policies with the open source https://t.co/zh1P01h1B5 framework.
@heng_yan Oh you're right! Key point being that open source GLM is over 300 tokens per second. This matters for agents (we all hate waiting 10 minutes for responses). Proprietary frontier models are at best at 100 tokens per second. So 3x speedup really matters for agentic workloads.
Super excited about our new partnership with @AnthropicAI and native availability of Claude 3.7 Sonnet in Databricks on AWS, Azure and GCP! Stay tuned for more integrations and great support across the entire agent development and MLOps stack. https://t.co/EWHNMCGlss
@deedydas@deedydas "90% of the world's organizational data is in PDFs." Is this really true? Intuitively, I would have thought the majority of data is in ERP, SaaS and maybe in logs from devices.
@SteveRattner@Morning_Joe@SteveRattner@Morning_Joe - Kamala Harris claims that she will make billionaires pay their fair tax share. Based on this graph, taxes will go up by 13.6% for anyone with an income >200K! i.e. non-billionaire and overtaxed folks. Is that true?
Super excited about the new 1B and 3B Llama models and multimodal Llama! The cost-performance of open source AI is improving dramatically, and private, device-local AI is becoming practical with these models. Proud that @databricks is a launch partner.
The great thing is that for customers wishing to build such models that natively understand their data, the cost could be even less. We have the checkpoints, data cleaning pipeline, instruction tuning pipeline, etc from DBRX — just apply these to your data.
How will generative AI change data platforms? We argue that the impact will be fundamental, not incremental, through a new generation of platforms that deeply understand the content of the data (Data Intelligence Platforms). https://t.co/EqLHQnosWA
The founders of Databricks put together this strategy blog on where we think data platforms are headed in the future. We're moving Databricks quickly in this direction. This is very exciting and is the outcome of the MosaicML acquisition we did earlier this year!
https://t.co/EyO9H7I8Tc
Thrilled to receive this award; the credit is due to my students, my mentors, my collaborators in academia and open source, and my colleagues at Databricks for making all this work happen!
Excited to be launch partners with Meta on the Llama2 release. This move by Meta will have a big positive impact on the industry and ecosystem. Technically, the first version of Llama already was available to everyone except anyone who had a commercial use case could not innovate using it. This new version is commercially viable and thus enables the market mechanism to kick in. As Joseph Schumpeter put it in "Creative Destruction", long standing practices get disrupted by new methods, and companies that embrace them replace the laggards that miss out. Llama2 is likely that creative destruction! Very excited for what it brings next!
https://t.co/qajZSvZMLo
When we started @MosaicML we wanted to bring choice and cost reduction in AI to everyone. Today we've demonstrated that our stack fully supports @AMD Mi250! And it does it with good performance making it a viable alternative to Nvidia GPUs for cost/perf.
We believe that, when hardware competes, you win.
Big news: we've agreed to acquire @MosaicML, a leading generative AI platform. I couldn’t be more excited to join forces once the deal closes. https://t.co/L4TyrruUEU
Super excited to launch Unity Catalog's Apache Hive Metastore API, which allows any system that understands Hive to connect to Unity! Hive is the most widely used catalog API in the industry, so tools like Athena, Presto, Trino & EMR now work with UC. https://t.co/8O7qf5jkPb
Databricks was long able to achieve best-in-class cost-performance and scalability for data warehousing, but required table tuning for some cases (e.g. small tables). With new auto optimization features, we're removing this need for data of all sizes: https://t.co/eeMsv3SlR6
After years admiring their work from the sidelines I have joined the board of directors of the Rainforest Trust. If you are not done with your end-of-year giving plans, I invite you to consider them. There are few more direct ways to combat climate change and protect wildlife.