@VMSuriyakumar Great work!
I have only given it a first pass, so apologies if this is obvious. If I understand correctly, the claim is that "worsenalization" is specific to model-dataset combinations? That is, this is not probabilistic when models are applied to randomly distributed variates?
Preparing to teach introductory statistics this coming term, and I am confronted with the usual dilemma of explaining the unbiased variance estimator...
Thoughts on simply dividing by 'n' until bias in estimation is introduced later on? #StatsTwitter
@statacake I think that given the 98%+ missingness it is probably reasonable to assume that by matter of practice race was not recorded for these data. This makes me feel like it is more a matter of how the data are being recorded, policies there, etc.
@statacake If the record came from only the citation file, it appears almost certain that it would be missing. (Over 98% of these records). I don't see any real difference between whether race is recorded here.
@statacake Interesting. I wonder if it is a Chicago problem.
The Chicago readme states "Data includes warnings and arrests, but is missing warnings".
This appears to be a typo, but I wonder if Chicago data contains "warnings" under "citations" which could be the bulk of the stops?
@statacake Is this looking across all locations? And how much does the data skew towards "no citation"?
The reported coverage rates (https://t.co/hDZaXDRGWH) and your results seem to suggest that something like 95% of the total records would have to be "no citation". Is it that dramatic?
@statacake I think that for those who have little to no R experience, the burden of learning ggplot2 (or tidyverse broadly) is not appreciably larger than the burden of learning base R graphics.
The big struggle is all of us coming from years of experience with base R.