Is it enough to report performance on a single dataset? Can we trust the reported improvement of new model architectures? We created 144 experiments for hate speech detection to test the generalization performance of various models.
Zen and the Art of Generalisation
Zen and the Art of Generalisation
Zen and the Art of Generalisation
Is it enough to report performance on a single dataset? Can we trust the reported improvement of new model architectures? We created 144 experiments for hate speech detection to test the generalization performance of various models.