@blackroomsec This basically sums up my current master's degree experience. Had the prof reference one of my online talks and then misquote in the materials. Tried to email to correct and explain how it needed to be done correctly to work. They didn't like that
Introducing a new side project called Model Regression. It tests daily Claude, GPT, and Grok on various benchmark statistics to determine how well its performing and to identify model degrades over time.
@edskoudis had an idea for model testing before they conducted offensive testing to ensure the model was performing as expected, and @BlasikRandy pushed me down this road with actually going and doing it.
The main intent here is the frontier models will experience outages, issues, bugs, intentional/unintentional nerfing of the models without notice. You can't typically trust day to day activities in these models for stability, so leveraging this on your daily routine to see how well the model is performing for that day is something I'll be using everyday.
Runs every morning in my DGX sparks environment and automatically updates with how well its performing.
Enjoy!
https://t.co/1Pep6NyGoh
Also open-sourced the project, can run on your own server as well and look at the benchmarks and how they are calculated:
https://t.co/GFPigpRtUF
@vxunderground Trusted Access for Cyber Your identity couldn't be verified or your account is ineligible at this time. If you think this is a mistake, please contact support. ---Didn't mail poop, sent cat pics...i failz.
Trusted Access for Cyber
Your identity couldn't be verified or your account is ineligible at this time. If you think this is a mistake, please contact support.---Guess @OpenAI doesn't like me
Anyone else notice @GeminiApp really stupid today? It has regressed and keeps stating really wrong info and says sorry when corrected and then does it again in the next prompt.
If you try and copy and paste something from @coursera, they hide a prompt injection in your clipboard.
It tells the AI agent to click on an invisible button called "AiHoneypot" that presumably then bans you from the course.