Safety Desk: Extraction and analysis of textual information to build a reporting system

Ferreira, Bruno Carlos Luís

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/102975

DC Field	Value	Language
dc.contributor.advisor	Silva, Catarina Helena Branco Simões da	-
dc.contributor.advisor	Oliveira, Hugo Ricardo Gonçalo	-
dc.contributor.author	Ferreira, Bruno Carlos Luís	-
dc.date.accessioned	2022-10-17T22:02:28Z	-
dc.date.available	2022-10-17T22:02:28Z	-
dc.date.issued	2022-09-16	-
dc.date.submitted	2022-10-17	-
dc.identifier.uri	https://hdl.handle.net/10316/102975	-
dc.description	Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia	-
dc.description.abstract	As the amount of available data grows, working with large amounts of text data has become hectic and more time-consuming. Therefore, companies and organizations need to rely on techniques and algorithms to automate manual work with intelligent algorithms in order to reduce human effort, reduce expenses, and make the process less error-prone and more efficient.The Safety Desk project outlined in this dissertation, in collaboration with Instituto Pedro Nunes and Talent Ingredient, aims to optimize the current reporting generation process of chemical substances done by the Talent Ingredient company, both in terms of saving human resources as in time saving. This process is very important for the company since the reports generated are the selling product in their business model, so the integration of an automatized system in the platform currently used (Cosmedesk) is a objective of the Talent Ingredient company.That said, this thesis discusses the importance of Information Extraction (IE) and Machine Reading Comprehension (MRC) in the acquisition of information from unstructured data, in the case of this project PDFs documents, and exposes the work developed in the implementation of the pipeline proposed for the Safety Desk project.The proposed pipeline is made up of five phases: (1) the Preprocessing Phase where the document is divided into sections in order to provide the right inputs to the Question Answering (QA) models used. (2) The IE Process that uses Extractive QA models that, given a context, i.e., the sections obtained from the first phase of the pipeline, and question, it extracts the answer that predicts to be right. (3) The Data Verification Phase is where the information extracted from the second phase is clean and (4) Data-to-text (D2T) Phase generates a toxicological profile of the chemical substance. In last, the Safety Desk service can be integrated via a (5) RESTfull API implemented, where endpoints were created to establish the communication in the actual platform, Cosmedesk, and the Safety Desk work.In the evaluations performed, the work developed presented solid results (0.74 F-Score, 0.78 Precision, 0.71 Recall and 0.77 Accuracy) for the documents used, although in terms of execution time the Safety Desk took an average of 191 tokens/second analysed, which in a average document with 30000 tokens takes 2’30 minutes.	eng
dc.description.abstract	À medida que a quantidade de dados disponíveis cresce, trabalhar com grandes quantidades de dados de texto tornou-se agitado e mais demorado. Portanto, empresas e organizações precisam contar com técnicas e algoritmos para automatizar o trabalho manual com algoritmos inteligentes, a fim de reduzir o esforço humano, reduzir despesas e tornar o processo menos propenso a erros e mais eficiente.O projeto Safety Desk detalhado nesta dissertação, em colaboração com o Instituto Pedro Nunes e Talent Ingredient, visa otimizar o atual processo de geração de relatórios de substâncias químicas feito pela empresa Talent Ingredient, tanto em termos de economia de recursos humanos como em economia de tempo. Esse processo é muito importante para a empresa, pois os relatórios gerados são o produto de venda no modelo de negócios, portanto a integração de um sistema automatizado na plataforma atualmente utilizada (Cosmedesk) é um objetivo da Empresa Talent Ingredient.Dito isso, esta dissertação discute a importância da Extração de Informação (IE) e da Compreensão de Leitura de Máquina (MRC) na aquisição de informações a partir de dados não estruturados, no caso deste projeto documentos PDFs, e expõe o trabalho desenvolvido na implementação do pipeline proposto para o projeto Safety Desk. O pipeline proposto é composto por cinco fases: (1) a Fase de Pré-processamento onde o documento é dividido em seções para fornecer as entradas corretas para os modelos Questão Resposta (QA) utilizados. (2) O processo EI que usa modelos Extrativos QA que, dado um contexto, i.e., as seções obtidas da primeira fase do pipeline, e pergunta, extrai a resposta que prevê estar correta. (3) A Fase de Verificação de Dados é onde as informações extraídas da segunda fase são limpas e (4) a Fase Geração de Linguagem Natural gera um perfil toxicológico da substância química. Por fim, o serviço Safety Desk pode ser integrado através de uma (5) RESTfull API implementada, onde foram criados endpoints para estabelecer a comunicação na plataforma, Cosmedesk, e o Safety Desk.Nas avaliações realizadas, o trabalho desenvolvido apresentou resultados sólidos (0.74 F-Score, 0.78 Precision, 0.71 Recall e 0.77 Accuracy) para os documentos utilizados, embora, em termos de execução, o Safety Desk processou em média 191 tokens/segundo, que numa média de 30.000 tokens por documento demora 2’30 minutos a processar.	por
dc.description.sponsorship	Universidade de Coimbra - This work was partially funded by: the project SafetyDesk: Smart Toxicological Analysis of Chemical Substances (CENTRO-01-0247-FEDER-113485), co-financed by the European Regional Development Fund (FEDER), through Portugal 2020 (PT2020), and by the Regional Operational Programme Centro 2020; and national funds through the FCT – Foundation for Science and Technology, I.P., within the scope of the project CISUC – UID/CEC/00326/2020 and by the European Social Fund, through the Regional Operational Program Centro 2020.	-
dc.language.iso	eng	-
dc.rights	openAccess	-
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	-
dc.subject	Extração de Informação	por
dc.subject	Compreensão de Leitura de Máquina	por
dc.subject	Resposta a Perguntas	por
dc.subject	Transformers	por
dc.subject	Geração de Dados para Texto	por
dc.subject	Information Extraction	eng
dc.subject	Machine Reading Comprehension	eng
dc.subject	Question Answering	eng
dc.subject	Transformers	eng
dc.subject	Data-to-text Generation	eng
dc.title	Safety Desk: Extraction and analysis of textual information to build a reporting system	eng
dc.title.alternative	Safety Desk: Recolha e análise de informação textual para construção de um sistema de relatórios	por
dc.type	masterThesis	-
degois.publication.location	DEI - FCTUC	-
degois.publication.title	Safety Desk: Extraction and analysis of textual information to build a reporting system	eng
dc.peerreviewed	yes	-
dc.identifier.tid	203077814	-
thesis.degree.discipline	Informática	-
thesis.degree.grantor	Universidade de Coimbra	-
thesis.degree.level	1	-
thesis.degree.name	Mestrado em Engenharia Informática	-
uc.degree.grantorUnit	Faculdade de Ciências e Tecnologia - Departamento de Engenharia Informática	-
uc.degree.grantorID	0500	-
uc.contributor.author	Ferreira, Bruno Carlos Luís::0000-0002-1148-8887	-
uc.degree.classification	18	-
uc.degree.presidentejuri	Barata, João Nuno Lopes	-
uc.degree.elementojuri	Silva, Catarina Helena Branco Simões da	-
uc.degree.elementojuri	Macedo, Luís Miguel Machado Lopes	-
uc.contributor.advisor	Silva, Catarina Helena Branco Simões da	-
uc.contributor.advisor	Oliveira, Hugo Ricardo Gonçalo	-
item.openairetype	masterThesis	-
item.fulltext	Com Texto completo	-
item.languageiso639-1	en	-
item.grantfulltext	open	-
item.cerifentitytype	Publications	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
Appears in Collections:	UC - Dissertações de Mestrado

Files in This Item:

File	Description	Size	Format
TESE___Safety_Desk___Final_Super_Final_Report.pdf		4.96 MB	Adobe PDF	View/Open

Show simple item record

Page view(s)

99

checked on Jul 16, 2024

Download(s)

104

checked on Jul 16, 2024

Google Scholar^TM

Check

This item is licensed under a Creative Commons License

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM