@Hi_Mrinal Hehe π the getting old part is a hard pill to swallow. You can definitely accelerate your growth but there's a cap to that speed, you cannot outrun time
For anyone who has coded even a little, it is obvious which parts of code are tedious and boilerplate and which ones are interesting/custom to the project.
With that understanding, it should be clear when to use AI and when not to. Why is there so much amnesia around this topic? Why do we have polar opposites?
AI adoption isn't even a thing worth contemplating. Just swallow it like any other developer tool or abstraction you have swallowed (and normalised) already!
Absolutely! Reports/papers on past failures and incidents are the best learning resources for any systems engineer. It is also very assuring since we acknowledge the fact that even the big players make both - trivial and catastrophic mistakes.
Twitter's 'Fan Out' architectural problem or the infamous Netflix ELB cascading failure, etc teach far more than any book or lecture would.
TBH there is soo much untapped learning hidden in technical reports by tech companies which they release after a major downtime .....
these three were my recent reads
I am particularly curious about evals for startups at stages where they don't have traces at all yet - unlike examples where you can evaluate conversations already held by AI.
This could mean pre-production products, or apps where the nature of the response is completely different from chat.
What would an evals solution look like for a startup that is still deciding the model, params, prompt, and context?
Surely, different decisions here can yield radically different outputs.
The obv solution that comes to my mind is to generate synthetic/manual representative cases and run a configuration tournament across model + params + context combinations, while accounting for things like position bias and other similar mathematical aspects.
Is there a better way to think about evals before real traces exist? Curious about how @HamelHusain and @sh_reya would think about this
@HamelHusain hmm. I'm going to test this properly: seed evals with synthetic/manual cases, then run a tournament across model + prompt + context + cost configs, with judge-bias checks and confidence intervals. Will share concrete numbers once I have them.