Working on a side project, bit of a citizen science thing. Can’t reveal too much at this point but was gratified that my LLM assistant eventually asked “so to which local newspaper do you want to pitch this story?”
Had a conversation with the boss yesterday and I think I managed to convince him how pointless it is to set the random number generator seed in analysis notebooks.
Currently playing with more modern alternatives to traditional R functions. skimr::skim() instead of summary() and the gtsummary package instead of the venerable tableone.
Data catalogs like Alation are phenomenally useful to understand your data landscape but should NOT be part of your analysis code. Your queries, including possibly cached results, should be part of your main analysis code.
Most (all?) statistics textbooks that deal with online experiments seem to assume that the experimental unit is the user, but it's hard to believe this scales well.
Still working my way through Jennison & Turnbull’s textbook on group sequential testing. Got to the point where they describe tests that can accept the null early, which is critically important for me.
My current understanding is that traditional sequential testing like SPRT takes one measurement at a time and allows for accepting the null early, but group sequential methods like Pocock take groups of measurements but don’t seem to have a mechanism for accepting the null early
Last week I’ve been exploring ways to break my bad habit of doing data science in two steps: 1) prepare data in one tool 2) work with data in another. I’m convinced it’s possible to do everything in one tool.
Precision alone can't give you a true estimate of prevalence, but recall and specificity can help adjust the predicted prevalence. Learn how to make this adjustment and get unbiased results from @dlindelof's article.
https://t.co/DlZR7MQMws
Even when a classifier isn’t perfect, we can still get useful prevalence estimates. By using recall and specificity, we can adjust the predicted prevalence without relying on precision, which can be biased by the actual prevalence. Check out how this works in @dlindelof's latest article.
https://t.co/DUsaBzhipE