Show simple item record

dc.contributor.authorMakan, Rakshit
dc.date.accessioned2022-12-15T12:58:45Z
dc.date.available2022-12-15T12:58:45Z
dc.date.issued2022-12-14
dc.identifier.urihttp://hdl.handle.net/10222/82147
dc.description.abstractReal estate administrative documents present a unique challenge where each condo corporation's documents come from multiple sources and in varying formats. Moreover, each record has varying sizes, making it very time-consuming to read and recognize these documents systematically. A condo buyer must understand these documents before proceeding with the purchase. Our industry partner is a company that gathers and summarizes the condo corporation's documents and highlights essential information that the buyer should consider. In this process, the most laborious task is categorizing these segmented documents where the number of categories is unknown and changes based on the domain expert's knowledge. It is only possible to create an automatic classification system for such challenging data without the involvement of domain experts. So we propose a user-centric, scalable, and reliable multi-page document classification solution for our industry partner using a Human-in-a-loop (HITL) framework. It is a web application with an interactive annotation system for the user to provide feedback at the page and document level to train a two-step classification system. The classification pipeline is a modular system that incrementally learns from the domain expert's page and document-level annotation independently. We evaluated the level of feedback required by the HITL system by comparing our proposed system quantitatively and qualitatively with document-level input via a user study. In the same survey, we also assessed the need for the HITL system in real-world classification problems by comparing the automatic system with our proposed page-level annotation system. Unlike previous research, we explicitly used textual features and state-of-the-art (SoTA) language models to determine the predictive model suitable for real estate administrative documents that can be used in an online setting. In our work, we compare NLP models for multi-class classification for multi-page scanned documents and found that frequency-driven models performed better than deep language models. We demonstrated that the HITL system with page-level annotation ability works better than fully automated classification and document-level annotation for a real-world multi-page document classification problem.en_US
dc.language.isoenen_US
dc.subjectHuman-in-the-loopen_US
dc.subjectMulti Page Document Classificationen_US
dc.subjectDocument Classificationen_US
dc.subjectDeep Learningen_US
dc.subjectNatural language Processingen_US
dc.subjectDomain Specific Dataen_US
dc.titleHuman-in-the-loop Classification for Multi-page Administrative Documentsen_US
dc.typeThesisen_US
dc.date.defence2022-12-02
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeMaster of Computer Scienceen_US
dc.contributor.external-examinern/aen_US
dc.contributor.graduate-coordinatorMichael McAllisteren_US
dc.contributor.thesis-readerEnayat Rajabien_US
dc.contributor.thesis-readerMalcolm Heywooden_US
dc.contributor.thesis-supervisorEvangelos Miliosen_US
dc.contributor.ethics-approvalReceiveden_US
dc.contributor.manuscriptsNoen_US
dc.contributor.copyright-releaseNoen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record