Inca Analysis

Data

Data Sources

The analysis that is reported in this Highlight section on the INCA-project website is based on a data set composed of media content across European countries related to Gafam companies.
This dataset is composed of the selected countries (Austria, Belgium, Bulgaria, Czech Republic, Denmark, France, Germany, Ireland, Italy, Poland, Portugal, Spain, Switzerland, United Kingdom) and all news content from the biggest daily newspapers in these countries (if available) from 2007 to 2022 was retrieved that was tagged to be about the GAFAM companies. For Estonia, Finland and the Netherlands, media content from the selected news outlets that mentioned the GAFAM companies in the title was retrieved

Newspapers names and number of articles for every country involved.
Country	Num of Articles	Sources
Austria	2783	Kurier, Der Standard, Kronen Zeitung, Kleine Zeitung
Belgium	338	Le Soir
Bulgaria	162	Dnevnik, 24 Chasa
Czech Republic	218	Pravo
Denmark	171	Morgenavisen Jyllands-Posten
Estonia	8833	Delfi, Postimees, Õhtuleht, ER
Finland	5066	YLE
France	10557	Le Figaro, Les Echos, Le Parisien, Capital Finance, Le Particulier Pratique, Capitalfinance.fr, Le Particulier
Germany	15422	Süddeutsche Zeitung, Bild
Ireland	6749	Irish Times, Irish Independent
Italy	4911	Corriere della Sera, La Repubblica
Netherlands	4642	de Volkskrant, De Telegraaf, AD/Algemeen Dagblad
Poland	2582	Rzeczpospolita, Gazeta Wyborcza, Fakt
Portugal	3147	Publico, Correio da Manha, Jornal de Notcias
Spain	5421	El Mundo, El Pais
Switzerland	3857	Tages Anzeiger, 20 Minuten
United Kingdom	17710	The Times, The Daily Telegraph, Daily Mail, The Sun, Guardian Unlimited, The Sunday Times

Data Treatment

In order to analyse all these texts in a single analysis and using one model, the following pre-processing steps needed to be completed. First, all texts that were not in English were machine translated to English using the “facebook/nllb-200-distilled-600M” model from the Python library “dl_translate” (Tang et al. 2000; Fan et al. 2022; NLLB Team 2022). It has been shown that machine translation is not detrimental to the results of the kinds of models that we use (Reber 2019). The English translations of the texts were pre-processed using the “en_core_web_lg” model from the Spacy Python library. All words were lemmatized (converted to their primary form) and only nouns, verbs and adjectives were retained. Additionally, we used named entity recognition to detect people’s names and the latter were included in the following analyses as well. All terms that were present in less than 0.0005 of the documents (about 67 documents this is to exclude very rare terms that usually do not contribute much to the overall analysis) were dropped for the topic modelling analysis.

Data Distribution

This map shows the number of documents relative to every country involved in the analysis. The colouring helps to have a glance of the distribution of the document over the nations.

Topic Modeling

Topic modelling, also known as Latent Dirichlet Allocation (LDA), introduced by Blei et al. (2003, see also Blei 2012 for a short overview), has gained considerable ground over the last two decades as a tool for exploring and classifying the content of large text corpora. It's one of the most common methods for analysing extensive text collections. Topic modelling is an unsupervised, inductive method that automatically detects topics from a corpus, though more complex versions allow for guiding the model towards specific discourses that can to some extent be pre-defined (see Eshima et al. 2023). The model outputs consist of probabilities for each text in the corpus to belong to any of the detected topics and probabilities for each word in the corpus to belong to each of the topics. Regarded as a statistical language model (DiMaggio et al. 2013), it estimates these probabilities through, in a way, reverse engineering the way a text is produced in natural language. It assumes that texts can be made up of various topics and that the same words can be part of different topics. The model starts guessing probabilities of topics for texts and probabilities of words for topics until these probabilities give as close as possible a reproduction of the original word frequencies in a text. The model does not consider the sequence of words in a document, just their overall frequency - it is a so-called bag-of-words model. Despite being a bag-of-words model, it reflects the idea that meaning is relational (Mohr and Bogdanov 2013), because it groups words into topics in such a way that some words have a higher probability of occurring together in texts than others. It is therefore especially relevant for an analysis of concepts like framing, polysemy, and heteroglossia (DiMaggio et al. 2013). A topic in the context of this method is a probability distribution over a corpus's vocabulary, helping identify words likely to co-occur in text. These co-occurring words usually share a common theme or discourse, often interpreted as a frame that presents a specific viewpoint (DiMaggio et al. 2013; Heidenreich et al. 2019; Gilardi et al. 2020; Ylä-Anttila et al. 2021). The method's ability to capture meaning's relationality also aligns it with various discourse analysis strands, from critical discourse analysis to post-structuralist theories of discourse (Aranda et al. 2021; Jacobs and Tschötschel 2019). There are various statistical implementations for topic models. In our analysis we use the “tomotopy” library (Lee 2022) in Python, because of its speed in estimation as well as its functionality. The package implements the basic LDA model (Blei et al. 2003) as well as various subsequent developments of this model. For the analysis that we report here, we used the basic model, because of the exploratory nature of the task. Estimating a complicated topic model can be computationally very demanding, especially for a large corpus of text, while fitting the basic model is relatively fast.

Sentiment Analysis

There is no gold standard for sentiment analysis - i.e. the detection of emotional content in textual data - and there are various both dictionary-based as well as machine learning and language model based approaches that have been suggested and validated over recent years. A dictionary-based approach uses a sentiment dictionary, which is a pre-defined set of e.g. positive and negative words, to count emotional words in a text. Such counts would then characterise the overall emotional content of a text. In recent years such approaches have been supplemented by language models that have been trained to classify emotions on the basis of annotated texts for which their level of emotionality is known. As a first step in our sentiment analysis of the combined GAFAM and media text corpus, we used a selection of various methods to determine the emotionality of texts:

The Lexicoder Sentiment Dictionary (Young and Soroka 2012) as implemented in the “quanteda” (Benoit et al. 2018) package in R.
The Flair sentiment classifier (Akbik et al. 2018), Link
The Vader sentiment analysis tool (Hutto and Gilbert 2014), Link
The Twitter-roBERTa-base for Sentiment Analysis model (Camacho-collades et al. 2022; Loureio et al. 2022), Link

We estimated a sentiment score according to these methods for each of our texts and then used principal component analysis to aggregate the estimates from each of the different methods. The first principal component accounts for 69.9% of the variance in all the separate sentiment scores that we derived from the various methods and is thus a very good summary of all of them. We scaled the principal component so that higher scores indicate more positive emotions and use it thus as a sentiment measure for the texts in our corpus.

References

Tang, Y. et al. (2020) 'Multilingual Translation with Extensible Multilingual Pretraining and Finetuning,' arXiv (Cornell University) [Preprint]. https://doi.org/10.48550/arxiv.2008.00401.
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M. & Joulin, A., 2020. Beyond English-Centric Multilingual Machine Translation. arXiv preprint arXiv:2010.11125.
Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., & NLLB Team. (2022). No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
Reber, U. (2019). 'Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora', Communication Methods and Measures, 13:2, 102-125, DOI: 10.1080/19312458.2018.1555798.
Young, L. and Soroka, S. (2012) 'Affective news: the automated coding of sentiment in political texts,' Political Communication, 29(2), pp. 205-231. https://doi.org/10.1080/10584609.2012.671234.
Akbik, A., Blythe, D. a. J. and Vollgraf, R. (2018) 'Contextual string embeddings for sequence labeling,' International Conference on Computational Linguistics, pp. 1638-1649. https://aclanthology.info/papers/C18-1139/c18-1139.
Hutto, C., & Gilbert, E. (2014, May). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).
Camacho-collados, J., Rezaee, K., Riahi, T., Ushio, A., Loureiro, D., Antypas, D., Boisson, J., Espinosa Anke, L., Liu, F. & Martínez Cámara, E., et al. (2022) “TweetNLP: Cutting-Edge Natural Language Processing for Social Media” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Abu Dhabi, UAE: Association for Computational Linguistics, pp. 38-49. Available at: https://aclanthology.org/2022.emnlp-demos.5.
Loureiro, D. et al. (2022) 'TimeLMs: Diachronic Language Models from Twitter,' arXiv (Cornell University) [Preprint]. https://doi.org/10.48550/arxiv.2202.03829.
Blei, D.M. (2012) 'Probabilistic topic models,' Communications of the ACM, 55(4), pp. 77-84. https://doi.org/10.1145/2133806.2133826.
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003) 'Latent dirichlet allocation,' Journal of Machine Learning Research, 3, pp. 993-1022. https://doi.org/10.5555/944919.944937.
Eshima, S., Imai, K. and Sasaki, T. (2023) 'Keyword-Assisted topic models,' American Journal of Political Science [Preprint]. https://doi.org/10.1111/ajps.12779.
DiMaggio, P., Nag, M. and Blei, D.M. (2013) 'Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding,' Poetics, 41(6), pp. 570-606. https://doi.org/10.1016/j.poetic.2013.08.004.
Mohr, J.W. and Bogdanov, P. (2013) 'Introduction—Topic models: What they are and why they matter,' Poetics, 41(6), pp. 545-569. https://doi.org/10.1016/j.poetic.2013.10.001.
Heidenreich, T. et al. (2019) 'Media Framing Dynamics of the 'European Refugee Crisis’: A Comparative topic modelling approach,' Journal of Refugee Studies, 32(Special_Issue_1), pp. i172-i182. https://doi.org/10.1093/jrs/fez025.
Gilardi, F., Shipan, C.R. and Wüest, B. (2020) 'Policy diffusion: the Issue-Definition Stage,' American Journal of Political Science, 65(1), pp. 21-35. https://doi.org/10.1111/ajps.12521
Ylä-Anttila, T., Eranti, V. and Kukkonen, A.K. (2021) 'Topic modeling for frame analysis: A study of media debates on climate change in India and USA,' Global Media and Communication, 18(1), pp. 91-112. https://doi.org/10.1177/17427665211023984.
Aranda, A.M. et al. (2021) 'From Big Data to Rich Theory: Integrating Critical Discourse Analysis with Structural Topic Modeling,' European Management Review, 18(3), pp. 197-214. https://doi.org/10.1111/emre.12452.
Jacobs, T. and Tschötschel, R. (2019) 'Topic models meet discourse analysis: a quantitative tool for a qualitative approach,' International Journal of Social Research Methodology, 22(5), pp. 469-485. https://doi.org/10.1080/13645579.2019.1576317.
Lee, M., (2022). bab2min/tomotopy: 0.12.3. [software] Version v0.12.3. Zenodo. Available at: https://doi.org/10.5281/zenodo.6868418. DOI: 10.5281/zenodo.6868418.