Augmentation of Financial Datasets and Evaluating Financial Text Generated By A.I.

Taylor, Stacey Dianne

dc.contributor.author	Taylor, Stacey Dianne
dc.date.accessioned	2024-06-07T18:00:17Z
dc.date.available	2024-06-07T18:00:17Z
dc.date.issued	2024-06-07
dc.identifier.uri	http://hdl.handle.net/10222/84277
dc.description.abstract	Information is fundamental to decision-making. Yet, data is very sparse for the financial domain, even though, in this era of big data, it seems abundant. The work presented in this thesis addresses that scarcity over seven projects which investigate and examine creating synthetic financial data, both quantitative and textual. In the first two projects, we examine methods to generate synthetic financial statement data as well as the effects of synthetic data on a downstream classification task. The next four projects evaluate how well ChatGPT generates textual financial data for the notes to the financial statements, selected parts of financial reports, as well as how it adapts its responses based on the identified knowledge of its end users, ranging from a non-financial user to a financially sophisticated user. The authorship attribution project is of the utmost importance particularly since company authorship attribution has not been studied yet, to the best of our knowledge. We have author profiles and a good understanding for identified authors such as William Shakespeare, Mary Shelley, or George Washington, but we do not yet have that depth of understanding and identifiability for corporate communication. This attribution task is a non-trivial problem given that lengthy corporate communication is often collaboratively written by many authors, many (or all) of which are never identified, with contributions by non-writing authors as well who vet and review the text or sign off on the text, for example. This plethora of unidentified authors means that we have to treat the text as a single "figurehead" author, with the understanding that many (likely) unidentified authors (writing and not) have contributed to the work. In our experiments, the Common N-Gram Distance algorithm provided the best and most consistent results, achieving between 95% and 100% accuracy for character n-grams and 100% accuracy for word n-grams. Tools like ChatGPT can be exploited and used to commit fraud. Given the potential for significant effect and harm on the capital markets, tools that can quickly detect fraudulent corporate communication will be needed. Our research contributes to that effort.	en_US
dc.language.iso	en	en_US
dc.subject	Machine Learning	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Generative AI	en_US
dc.subject	Accounting	en_US
dc.subject	Finance	en_US
dc.title	Augmentation of Financial Datasets and Evaluating Financial Text Generated By A.I.	en_US
dc.date.defence	2024-05-02
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Doctor of Philosophy	en_US
dc.contributor.external-examiner	Dr. Howard Hamilton	en_US
dc.contributor.thesis-reader	Dr. Evangelos Milios	en_US
dc.contributor.thesis-reader	Dr. Malcolm Heywood	en_US
dc.contributor.thesis-reader	Dr. Vladimir Lucic	en_US
dc.contributor.thesis-supervisor	Dr. Vlado Keselj	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.manuscripts	Yes	en_US
dc.contributor.copyright-release	Not Applicable	en_US

Find Full text

Files in this item

Name:: StaceyTaylor2024.pdf
Size:: 7.449Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record