We also preprocess the parallel corpus for machine translation research or training. The parallel corpus is based on Odia–English parallel texts extracted from online resources and formally corrected by volunteers.
In this work, we focus on one of the low-resource languages, Odia, and build an Odia–English parallel (OdiEnCorp) and an Odia monolingual (OdiMonoCorp) corpus. A multilingual country like India needs language corpora for low-resource languages not only to provide its citizens with technologies of natural language processing (NLP) readily available in other countries, but also to support its people in their education and cultural needs.