Vedavaapi: A Platform for Community-sourced Indic Knowledge Processing at Scale
Paper on Vedavaapi’s platform architecture.
The full paper is available here.
Indic heritage knowledge is embedded in millions of manuscripts at various stages of digitization and analysis. Numerous powerful tools and techniques have been developed for linguistic analysis of Samskrit and Indic language texts. However, the key challenge today is employing them together on large document collections and building higher level end-user applications to make Indic knowledge texts intelligible. We believe the chief hurdle is the lack of an end-to-end, secure, decentralized system platform for (i) composing independently developed tools for higher-level tasks, and (ii) employing human experts in the loop to work around the limitations of automated tools to ensure curated content always. Such a platform must define protocols and standards for interoperability and reusability of tools while enabling their autonomous evolution to spur innovation. This paper describes the architecture of an Internet platform for end-to-end Indic knowledge processing called Vedavaapi that addresses these challenges effectively. At its core, Vedavaapi is a community-sourced, scalable, multi-layered annotated object network. It serves as an overlay on Indic documents stored anywhere online by providing textification, language analysis and discourse analysis as value-added services in a crowd-sourced manner. It offers federated deployment of tools as microservices, powerful decentralized user / team management with access control across multiple organizational boundaries, social-media login and an open architecture with extensible and evolving object schemas. As its first application, we have developed human-assisted text conversion of hand-written manuscripts such as palm leaf etc leveraging several standards-based open-source tools including ones by IIIT Hyderabad, IIT Kanpur and University of Hyderabad. We demonstrate how our design choices enabled us to rapidly develop useful applications via extensive reuse of state-of-the-art analysis tools. This paper offers an approach to standardization of linguistic analysis output, and lays out guidelines for Indic document metadata design and storage.