Testing LLMs (and prompts) like we test software: https://t.co/bZlIhgZEFh
TL;DR: (1) You should, (2) How to test: specific properties, evaluate these with LLMs (perception is easier than generation), (3) What to test: get the LLM to help you figure it out.
Also highly relevant: guidance from microsoft
"Guidance programs allow you to interleave generation, prompting, and logical control"
Also internally handles subtle but important tokenization-related issues, e.g. "token healing".
https://t.co/eEc1rywuWP
Blog post: playing with Vicuna-13B, ChatGPT (3.5), MPT-7B-Chat on harder stuff https://t.co/u2YQEuP6rV
TL;DR: We think ChatGPT is still way ahead, but sometimes the extra control from open source models is worth it.
I never tweet, but here is a blog post I wrote for an intern, may be useful for others too...
Part 1: https://t.co/DhE55I7ie7
Part 2: https://t.co/ncpJTb4otF