The analysis that is reported in this Highlight section on the INCA-project website is based on a data set composed of media content across European countries related to Gafam companies.
This dataset is composed of the selected countries (Austria, Belgium, Bulgaria, Czech Republic, Denmark, France, Germany, Ireland, Italy, Poland, Portugal, Spain, Switzerland, United Kingdom) and all news
content from the biggest daily newspapers in these countries (if available) from 2007 to 2022 was retrieved that was tagged to be about the GAFAM companies.
For Estonia, Finland and the Netherlands, media content from the selected news outlets that mentioned the GAFAM companies in the title was retrieved
Country | Num of Articles | Sources |
---|---|---|
Austria | 2783 | Kurier, Der Standard, Kronen Zeitung, Kleine Zeitung |
Belgium | 338 | Le Soir |
Bulgaria | 162 | Dnevnik, 24 Chasa |
Czech Republic | 218 | Pravo |
Denmark | 171 | Morgenavisen Jyllands-Posten |
Estonia | 8833 | Delfi, Postimees, Õhtuleht, ER |
Finland | 5066 | YLE |
France | 10557 | Le Figaro, Les Echos, Le Parisien, Capital Finance, Le Particulier Pratique, Capitalfinance.fr, Le Particulier |
Germany | 15422 | Süddeutsche Zeitung, Bild |
Ireland | 6749 | Irish Times, Irish Independent |
Italy | 4911 | Corriere della Sera, La Repubblica |
Netherlands | 4642 | de Volkskrant, De Telegraaf, AD/Algemeen Dagblad |
Poland | 2582 | Rzeczpospolita, Gazeta Wyborcza, Fakt |
Portugal | 3147 | Publico, Correio da Manha, Jornal de Notcias |
Spain | 5421 | El Mundo, El Pais |
Switzerland | 3857 | Tages Anzeiger, 20 Minuten |
United Kingdom | 17710 | The Times, The Daily Telegraph, Daily Mail, The Sun, Guardian Unlimited, The Sunday Times |
In order to analyse all these texts in a single analysis and using one model, the following pre-processing steps needed to be completed. First, all texts that were not in English were machine translated to English using the “facebook/nllb-200-distilled-600M” model from the Python library “dl_translate” (Tang et al. 2000; Fan et al. 2022; NLLB Team 2022). It has been shown that machine translation is not detrimental to the results of the kinds of models that we use (Reber 2019). The English translations of the texts were pre-processed using the “en_core_web_lg” model from the Spacy Python library. All words were lemmatized (converted to their primary form) and only nouns, verbs and adjectives were retained. Additionally, we used named entity recognition to detect people’s names and the latter were included in the following analyses as well. All terms that were present in less than 0.0005 of the documents (about 67 documents this is to exclude very rare terms that usually do not contribute much to the overall analysis) were dropped for the topic modelling analysis.
Topic modelling, also known as Latent Dirichlet Allocation (LDA), introduced by Blei et al. (2003, see also Blei 2012 for a short overview), has gained considerable ground over the last two decades as a tool for exploring and classifying the content of large text corpora. It's one of the most common methods for analysing extensive text collections. Topic modelling is an unsupervised, inductive method that automatically detects topics from a corpus, though more complex versions allow for guiding the model towards specific discourses that can to some extent be pre-defined (see Eshima et al. 2023). The model outputs consist of probabilities for each text in the corpus to belong to any of the detected topics and probabilities for each word in the corpus to belong to each of the topics. Regarded as a statistical language model (DiMaggio et al. 2013), it estimates these probabilities through, in a way, reverse engineering the way a text is produced in natural language. It assumes that texts can be made up of various topics and that the same words can be part of different topics. The model starts guessing probabilities of topics for texts and probabilities of words for topics until these probabilities give as close as possible a reproduction of the original word frequencies in a text. The model does not consider the sequence of words in a document, just their overall frequency - it is a so-called bag-of-words model. Despite being a bag-of-words model, it reflects the idea that meaning is relational (Mohr and Bogdanov 2013), because it groups words into topics in such a way that some words have a higher probability of occurring together in texts than others. It is therefore especially relevant for an analysis of concepts like framing, polysemy, and heteroglossia (DiMaggio et al. 2013). A topic in the context of this method is a probability distribution over a corpus's vocabulary, helping identify words likely to co-occur in text. These co-occurring words usually share a common theme or discourse, often interpreted as a frame that presents a specific viewpoint (DiMaggio et al. 2013; Heidenreich et al. 2019; Gilardi et al. 2020; Ylä-Anttila et al. 2021). The method's ability to capture meaning's relationality also aligns it with various discourse analysis strands, from critical discourse analysis to post-structuralist theories of discourse (Aranda et al. 2021; Jacobs and Tschötschel 2019). There are various statistical implementations for topic models. In our analysis we use the “tomotopy” library (Lee 2022) in Python, because of its speed in estimation as well as its functionality. The package implements the basic LDA model (Blei et al. 2003) as well as various subsequent developments of this model. For the analysis that we report here, we used the basic model, because of the exploratory nature of the task. Estimating a complicated topic model can be computationally very demanding, especially for a large corpus of text, while fitting the basic model is relatively fast.
There is no gold standard for sentiment analysis - i.e. the detection of emotional content in textual data - and there are various both dictionary-based as well as machine learning and language model based approaches that have been suggested and validated over recent years. A dictionary-based approach uses a sentiment dictionary, which is a pre-defined set of e.g. positive and negative words, to count emotional words in a text. Such counts would then characterise the overall emotional content of a text. In recent years such approaches have been supplemented by language models that have been trained to classify emotions on the basis of annotated texts for which their level of emotionality is known. As a first step in our sentiment analysis of the combined GAFAM and media text corpus, we used a selection of various methods to determine the emotionality of texts: