๐ PIK-led research work in @ClimateActionSN generates the map of research on #ClimatePolicy from 85,000 individual studies. It informs existing gaps in knowledge and facilitates the summarisation of the state of knowledge for governments. #MachineLearning
https://t.co/io7sqADo6A
Incidentally, if you are interested in working with us on how we can responsibly use ML to assist evidence synthesis, we have an open position at PIK. https://t.co/8jAPvAQTtr Today is the last day the position is open, but please get in touch ASAP if you need extra time to apply
To help users navigate this landscape, we argue that organisations like @cochranecollab and @CampbellReviews need to update their methodological guidance to help users distinguish between well-justified and ill-justified stopping criteria 12/N
There is a lot more to do on improving and evaluating stopping criteria. We set out a blueprint for some of this in the paper, but it requires lots of engagement and further work, some of which is already happening - e.g. in DESTINY https://t.co/rh0pmPJvIO @wellcometrust
We also argue that software providers (@Covidence @EPPIReviewer @asreviewlab@rayyanapp@evidencepartner@PICO_Portal) need to provide better guidance on how their ML-prioritisation tools can be used responsibly (i.e. with appropriate stopping criteria). 11/N
In the paper, we argue that we should prefer the former type of criteria to the latter. I don't think this should be controversial, but again and again when I have argued this, I have met resistance. If you disagree, please tell me why you think we don't need statistics here!10/N
Some stopping criteria make transparent assumptions and use appropriate statistics to communicate the risk of missing relevant studies (like the one we developed 4 years ago https://t.co/c9sAl5bXRX, other promising alternatives are available :)) 8/N
This is unrelated to how fancy our model is. Whenever we use an ML-generated prediction, we need ways to manage and communicate the uncertainty that comes with relying on that prediction. This is a *necessary condition* for the *responsible* use of AI/ML 7/N
Stopping criteria offer ways to *estimate* an appropriate time to stop screening, managing the risk of missing relevant studies while hopefully minimising the time spent screening irrelevant studies. We can only ever estimate this, because we don't have all the relevant info 6/N
This means we can stop screening before we have seen all the potentially relevant documents. But to do this, and actually save some work, we need to know when to stop. This is where stopping criteria come in ๐ 5/N
These products employ ML (or AI if we want to sound fancy)-prioritised screening: we screen some records by hand, and use these to train a model to predict the relevance of further records, and screen these by hand in descending order of predicted relevance, then retrain 4/N
There has been so much work done on ML for screening in systematic reviews https://t.co/iOBVBcnJ5C, and the vast majority of this work applies fancier and fancier models (the latest example being LLMs: https://t.co/IjMmGbKXWy) to promise ever greater work savings 2/N
While I was away on parental leave, our paper was published on the urgent need for well-justified stopping criteria when using ML to speed up screening in systematic reviews: https://t.co/5aPtQ6zU29 1/N
@srtoolbox We have a nice rule that works reliably: https://t.co/UpqBkP7Bvs, that also comes with a handy calculator and R and Python packages: https://t.co/z8QpCNm07p
@srtoolbox Stopping after 50/100 consecutive irrelevant records is - without further analysis, and except for in some limited senses - statistically incoherent. Just because we use machine learning, it doesn't mean we should abandon the high quality methods we otherwise emply in SRs