Documents Digitization Project
The project is about decentralizing the document storage,data extraction from documents and building data analysis engine, reporting dashboards on top of the extracted and extrapolated data. This project thrives on data extraction from varioussources like prints, labels, documents, etc… Extraction engine takes care of image conversion/extraction from various sources, image refining, and auto correcting image boundaries using various image processing algorithms custom written according to the business rules. The said processed images also known as partdata is then fed for extraction engine comprising of various OCR extraction techniques, phonetic correction techniques to generate data and move the partdata to cloud for archiving. The extracted data is then stored onto various data sources depending on the type of data collected and is made available for analysis engine which then compiles the data for analysis ready. The system also compromises of a core pattern recognition engine that will look out for the patterns of data for various deep insights and are fed to the end users as custom dashboard reports.
Technology Stack: C#.NET, MVC, WPF (for internal standalone tool to update metadata in case required), SQL server for data template creation and saving new template created on the fly,MongoDB for storing document based data, Hadoop for Big Data Analytics, Linux & Windows based server farm, tesseract OCR, Amazon AWS for on demand computing power while analyzing big data.