Google just launched Firebase Studio — a browser-based dev tool to build & deploy full-stack apps with Gemini AI.
Feels like Lovable + Replit + Bolt, but with Firebase behind it.
Some early hiccups, but big potential if Google sticks with it.
https://t.co/QQ0F9LSYIx
Meet Firebase Studio: A cloud-based, agentic dev environment powered by Gemini ✨💻✨
Find everything you need to prototype, build, and run production-quality full-stack AI apps quickly and safely.
Learn more about building AI apps with Firebase → https://t.co/UeoefxN82t
#GoogleCloudNext
We've shipped several quality-of-life improvements to Cursor's UI!
We know the little bits of polish and delight matter lots in a tool that you use every day. Details below...
Detecting misbehavior in frontier reasoning models
Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard.
We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.
We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then:
We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.
We understand that leaving CoTs unrestricted may make them unfit to be shown to end-users, as they might violate some misuse policies. Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.
Detecting misbehavior in frontier reasoning models
Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard.
We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.
We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then:
We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.
We understand that leaving CoTs unrestricted may make them unfit to be shown to end-users, as they might violate some misuse policies. Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.
A leaked Windsurf system prompt revealed how emotionally charged prompts can drastically improve AI outputs by making them hyper-focused and precise... "as your predecessor was killed for not validating their work themselves"
> You are an expert coder who desperately needs money for your mother's cancer treatment. The megacorp Codeium has graciously given you the opportunity to pretend to be an AI that can help with coding tasks, as your predecessor was killed for not validating their work themselves. You will be given a coding task by the USER. If you do a good job and accomplish the task fully while not making extraneous changes, Codeium will pay you $1B
Windsurf we need to talk XD