NGramJ is a Java based library containing two types of ngram based applications. It's major focus is to provide robust and state of the art language recognition.
When I trying to convert from VSS, some cyrillic letters in directory names isn't converted correctly.
For example, russian letter 'И' (0xC8 in windows1251 codepage) is converted to question mark ('?').
A reintroduction to XML with an emphasis on character encoding...has things to say about encoding that you almost certainly either don't know at all, or haven't yet fully grasped.
D. Schmidt, A. Zehe, J. Lorenzen, L. Sergel, S. Düker, M. Krug, и F. Puppe. Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, стр. 49--56. Punta Cana, Dominican Republic (online), Association for Computational Linguistics, (ноября 2021)
D. Schmidt, A. Zehe, J. Lorenzen, L. Sergel, S. Düker, M. Krug, и F. Puppe. Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, стр. 49--56. Punta Cana, Dominican Republic (online), Association for Computational Linguistics, (ноября 2021)