Strategies for the analysis of large social media corpora: sampling and keyword extraction methods

dc.centroFacultad de Filosofía y Letrases_ES
dc.contributor.authorMoreno-Ortiz, Antonio Jesús
dc.contributor.authorGarcía Gámez, María
dc.date.accessioned2023-05-03T08:29:04Z
dc.date.available2023-05-03T08:29:04Z
dc.date.issued2023
dc.departamentoFilología Inglesa, Francesa y Alemana
dc.description.abstractIn the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.es_ES
dc.description.sponsorshipFunding for open access publishing: Universidad de Málaga/CBUA. This work was funded by the Spanish Ministry of Science and Innovation [Grant No. PID2020-115310RB-I00], the Regional Govvernment of Andalusia [Grant No. UMA18-FEDERJA-158] and the Spanish Ministry of Education and Vocational Training [Grant No. FPU 19/04880]. Funding for open access charge: Universidad de Málaga / CBUAes_ES
dc.identifier.citationMoreno-Ortiz, A., García-Gámez, M. Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods. Corpus Pragmatics (2023). https://doi.org/10.1007/s41701-023-00143-0es_ES
dc.identifier.doi10.1007/s41701-023-00143-0
dc.identifier.urihttps://hdl.handle.net/10630/26442
dc.language.isoenges_ES
dc.publisherSpringeres_ES
dc.rightsAtribución 4.0 Internacional*
dc.rights.accessRightsopen accesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.subjectCOVID-19 - Proceso de datoses_ES
dc.subjectMedios de comunicación sociales_ES
dc.subjectBases de datos - Gestiónes_ES
dc.subject.otherCOVID-19 languagees_ES
dc.subject.otherLarge-scale social media corpuses_ES
dc.subject.otherSampling methodses_ES
dc.subject.otherSampling sizeses_ES
dc.subject.otherKeyword extractiones_ES
dc.titleStrategies for the analysis of large social media corpora: sampling and keyword extraction methodses_ES
dc.typejournal articlees_ES
dc.type.hasVersionVoRes_ES
dspace.entity.typePublication
relation.isAuthorOfPublication3233c4af-5a32-40f2-9c82-103bc48c43cd
relation.isAuthorOfPublication.latestForDiscovery3233c4af-5a32-40f2-9c82-103bc48c43cd

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
s41701-023-00143-0.pdf
Size:
1.31 MB
Format:
Adobe Portable Document Format
Description:

Collections