There have been numerous informal observations about prompt drifts in Large Language Models (LLMs), with the most notable case being GPT-4 showing signs of laziness, especially for coding tasks by the end of the previous year. Discussions on Twitter also hint at a decline in Claude Sonnet’s effectiveness over the past few days. Given the closed-source nature of these models, it's impossible to know what happens behind the scenes, and most often, these drifts go unnoticed until they get flagged by the community.
Today, a structured approach to tracking these shifts in the model’s performance is lacking. So we at @UpTrainAI decided to undertake this as a community initiative to monitor prompt drift and identify any regressions systematically. You can check it out and learn more about our methodology here: https://t.co/rjP1ON4Ynd
While building this out was fun, performance monitoring presents many challenges—notably how to do this efficiently (from a cost perspective) and yet get good, stable results. We made a slew of improvements to get the standard deviation down to acceptable levels while using GPT-3.5 and running on as few as 25 data points.
Looking ahead, we plan to enlarge our benchmarking dataset as well as include additional models (ex: Claude 3).
Open-source evaluation with @UpTrainAI!
GenAI applications are complex and unpredictable, so you need to run evaluations to know whether the changes you make are improving your outcomes. Uptrain is a way to move beyond "vibes" based evaluation.
Check out their guest blog post: https://t.co/i25r59L1us
And our docs: https://t.co/Tx7areX7UD
And their docs: https://t.co/tnTknFf2oK
Their announcement tweet: https://t.co/TsXGMbG95P
We are excited to announce the @llama_index <> @UpTrainAI integration!
It’s been months in making, but we wanted to deliver something of real value to our community. Evaluations are not just about computing a final score for your application but getting actionable insights on where things are going wrong and how to improve the performance.
With this integration, you can evaluate all individual components of your RAG pipeline, such as retrieval, reranking, sub-query, etc. and get deep insights into where your LlamaIndex pipelines need improvements, all with a single line of code.
At UpTrain, we are building the gold standard of LLM evaluations with high-quality scores that learn your preferences.
• Evaluate different aspects of your application with 20+ preconfigured checks
• A high degree of customisation allows you to modify eval prompts, choose evaluator LLM or create your own checks.
• Experiment with prompts, LLMs, embedding models, RAG modules, etc.
• Do root cause analysis to find failure modes and hidden patterns.
and finally,
• Interactive dashboards to visualise results and do side-by-side comparisons [More coming soon]
Check out the blog: https://t.co/1gekXxhro1
Check out UpTrain: https://t.co/nRmptEjNWc
It was great fun collaborating with the LlamaIndex team - @ravithejads@seldo@jerryjliu0!
@shikha_xyz
Evals are fast becoming one of Langfuse's most adopted features after core observability.
When logging a lot of production usage to Langfuse, teams start layering model-based evals on top of the manual checks and reviews to scale their evaluation.
UpTrain 🤝 @langfuse integration
Now, you can seamlessly track the quality, latency and cost of your LLM applications, all in one place.
Read more about it: https://t.co/CalystsCy6
Link to the tutorial: https://t.co/tEPzLsKyO8
With @UpTrainAI 🤝 @anyscalecompute integration, you can now use open-source LLMs like Mistral 7B, Llama2 (7B, 13B, 70B, CodeLlama), etc hosted on Anyscale's endpoints to evaluate your LLM applications with UpTrain.
🚨 Clueless about the LLM ecosystem?
Join us for an exciting session about LLMs, RAG & much more with @SourabhAgr03, CEO @UpTrainAI
Full Announcement:
https://t.co/i3fH4BgI4k
A Chevy dealer's chatbot agrees to sell a Tahoe for $1! This is a classic example of jailbreaking through an LLM system and why an evaluation tool is needed
Check out many such tidbits and more in our chat with @qdrant_engine here: https://t.co/VSadJjosVh
🚀 Elevate your LLM game with another Vector Space Talk this week!
Discover the intricacies of using LLM as a judge in evaluating applications with @SourabhAgr03, CEO & Co-Founder at UpTrain AI. 🤯
📅 Date: Feb. 8, 2024
🕒 Time: 5:00 pm CET
🌐 Link: https://t.co/gzm6QPpZJz
.@UpTrainAI (YC W23) is a full-stack LLMOps platform to evaluate, experiment, monitor, and test LLM applications.
It is open-source, enabling customization, and can be self-hosted to satisfy your data governance needs.
https://t.co/bwJoY1tiLK
Exciting news to start the day! @UpTrainAI has been featured in @ycombinator's Top Generative AI Startups 2024 🚀
We're excited to continue pushing the boundaries of generative AI and making a difference in the industry. 💪
Check out OSS here - https://t.co/LGsdR0ZN8u
🚀 It was great fun integrating SPADE, a novel framework for synthesizing LLM evaluations, with @UpTrainAI.
Big shoutout to the authors:
@sh_reya HaotianLi ParthAsawa @MadelonHulsebos YimingLin J.D. Zamfirescu @hwchase17 Will Fu-Hinthorn AdityaParameswaran @sirrice
What a great experience collaborating with @SourabhAgr03 and the team @UpTrainAI on this blog post where we break down what you can do to evaluate your RAGs when building with vector databases and LLMs. The Uptrain team are A+ players and I'm so glad we met in 2023!
If you're building RAG pipelines, I'd love to get your feedback!
You can check out the blog post here:
https://t.co/t19TdF4IbX
Starting 2024 with a b[ang]log 💥
We recently wrote a blog, in collaboration with the @weaviate_io team, on the power of Retrieval Augmented Generation (RAG) to overcome the limitations of Language Models!