Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study

Vaghela, Uddhav; Rabinowicz, Simon; Bratsos, Paris; Martin, Guy; Fritzilas, Epameinondas; Markar, Sheraz; Purkayastha, Sanjay; Stringer, Karl; Singh, Harshdeep; Llewellyn, Charlie; Dutta, Debabrata; Clarke, Jonathan M.; Howard, Matthew; Serban, Ovidiu; Kinross, James; Sá-Marta, Eduarda; et al.

doi:10.2196/25714

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/105182

DC Field	Value	Language
dc.contributor.author	Vaghela, Uddhav	-
dc.contributor.author	Rabinowicz, Simon	-
dc.contributor.author	Bratsos, Paris	-
dc.contributor.author	Martin, Guy	-
dc.contributor.author	Fritzilas, Epameinondas	-
dc.contributor.author	Markar, Sheraz	-
dc.contributor.author	Purkayastha, Sanjay	-
dc.contributor.author	Stringer, Karl	-
dc.contributor.author	Singh, Harshdeep	-
dc.contributor.author	Llewellyn, Charlie	-
dc.contributor.author	Dutta, Debabrata	-
dc.contributor.author	Clarke, Jonathan M.	-
dc.contributor.author	Howard, Matthew	-
dc.contributor.author	Serban, Ovidiu	-
dc.contributor.author	Kinross, James	-
dc.contributor.author	Sá-Marta, Eduarda	-
dc.contributor.author	et al.	-
dc.date.accessioned	2023-02-08T09:23:49Z	-
dc.date.available	2023-02-08T09:23:49Z	-
dc.date.issued	2021-05-06	-
dc.identifier.issn	1438-8871	pt
dc.identifier.uri	https://hdl.handle.net/10316/105182	-
dc.description.abstract	The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented "infodemic"; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis-related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. Objective: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. Methods: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. Results: REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. Conclusions: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.	pt
dc.description.sponsorship	This work was supported by Defence and Security Accelerator (grant ACC2015551), the Digital Surgery Intelligent Operating Room Grant, the National Institute for Health Research Long-limb Gastric Bypass RCT Study, the Jon Moulton Charitable Trust Diabetes Bariatric Surgery Grant, the National Institute for Health Research (grant II-OL-1116-10027), the National Institutes of Health (grant R01-CA204403-01A1), Horizon H2020 (ITN GROWTH), and the Imperial Biomedical Research Centre.	-
dc.language.iso	eng	pt
dc.publisher	JMIR Publications Inc.	pt
dc.rights	openAccess	pt
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	pt
dc.subject	structured data synthesis	pt
dc.subject	data science	pt
dc.subject	critical analysis	pt
dc.subject	web crawl data	pt
dc.subject	pipeline	pt
dc.subject	database	pt
dc.subject	literature	pt
dc.subject	research	pt
dc.subject	COVID-19	pt
dc.subject	infodemic	pt
dc.subject	decision making	pt
dc.subject	data	pt
dc.subject	data synthesis	pt
dc.subject	misinformation	pt
dc.subject	infrastructure	pt
dc.subject	methodology	pt
dc.subject.mesh	COVID-19	pt
dc.subject.mesh	Data Interpretation, Statistical	pt
dc.subject.mesh	Datasets as Topic	pt
dc.subject.mesh	Humans	pt
dc.subject.mesh	Internet	pt
dc.subject.mesh	Longitudinal Studies	pt
dc.subject.mesh	SARS-CoV-2	pt
dc.subject.mesh	Search Engine	pt
dc.subject.mesh	Natural Language Processing	pt
dc.title	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study	pt
dc.type	article	-
degois.publication.firstPage	e25714	pt
degois.publication.issue	5	pt
degois.publication.title	Journal of Medical Internet Research	pt
dc.peerreviewed	yes	pt
dc.identifier.doi	10.2196/25714	pt
degois.publication.volume	23	pt
dc.date.embargo	2021-05-06	*
uc.date.periodoEmbargo	0	pt
item.grantfulltext	open	-
item.cerifentitytype	Publications	-
item.languageiso639-1	en	-
item.openairetype	article	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.fulltext	Com Texto completo	-
Appears in Collections:	FMUC Medicina - Artigos em Revistas Internacionais

Files in This Item:

File	Description	Size	Format
Using-a-Secure-Continually-Updating-Web-Source-Processing-Pipeline-to-Support-the-RealTime-Data-Synthesis-and-Analysis-of-Scientific-Literature-Development-and-Validation-StudyJournal-of-Medical-Internet-Researc.pdf		1.59 MB	Adobe PDF	View/Open

Show simple item record

SCOPUS^TM
Citations

1

checked on Apr 29, 2024

WEB OF SCIENCE^TM
Citations

1

checked on May 2, 2024

Page view(s)

25

checked on May 7, 2024

Download(s)

29

checked on May 7, 2024

Files in This Item:

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Page view(s)

Download(s)

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Page view(s)

Download(s)

Google ScholarTM

Altmetric

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM