Fleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement between two raters. The measure calculates the degree of agreement in classification over that which would be expected by chance and is scored as a number between 0 and 1. There is no generally agreed on measure of significance, although guidelines have been given.
D. Freitag, M. Blume, J. Byrnes, E. Chow, S. Kapadia, R. Rohwer, and Z. Wang. Proceedings of the Ninth Conference on Computational Natural Language Learning, page 25--32. Stroudsburg, PA, USA, Association for Computational Linguistics, (2005)
K. Dellschaft, and S. Staab. Proceedings of the 5th International Conference on The Semantic Web, page 228--241. Berlin, Heidelberg, Springer-Verlag, (2006)
M. Bhattacharyya, Y. Suhara, M. Rahman, and M. Krause. Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, page 147--150. New York, NY, USA, ACM, (2017)