Senior Data Scientist at Aimsun
Classification problems with highly unbalanced datasets, such as incident detection, pose the trade-off between true positive rate and false negative rate. A key point for choosing the final trade-off is the selection of the metric to evaluate the performance of the system, and this decision must be made by putting oneself in the skin of the user and asking if the chosen metric and the corresponding results are representative of the concept of usefulness.
For example, the following two tables are examples of confusion matrices of an imbalanced dataset with 9x more negative examples than positive examples, for example, no-incident vs incident. In such a situation, accuracy is not a good metric, not even balanced accuracy. Precision might be good if you must detect as many positives as possible, even at the risk of getting lots of false positives. On the other hand, recall is more oriented to answering whether we can trust predicted positives. The F1-score is a trade-off between precision and recall. In incident detection, not only is it important to detect incidents, but also to avoid overwhelming the traffic operator with false incident detections. Therefore, the F1-score is the best choice from among these four metrics.