by Anna Vancsó

The aim of the first phase of our research project is to analyse the online public discourse on youth activism in the #ClimateStrike movement with the use of content & discourse analysis methods. In this blog post I will present the difficulties we are facing right now in our data gathering process.

Online data analyses are proliferating recently and this trend is not only due to the exponential growth of the number of content on the internet. Collecting data from online media seems easily accessible and manageable by using recent technologies in data gathering and analysis. But this image could not be further from  reality. Online media data gathering and its analysis requires as rigorous planning and implementation as any other methodology in the social sciences. Here are some of the challenges we are trying to overcome in order to create a thorough database of online content that will reflect the online representation of young people’s civic activism:

Dependency on online media monitoring tools

The first, and our most biggest difficulty is the dependence on tools offered by online media monitoring companies. As scientific tools are not yet common in this field, we need to understand how these companies’ methodology is different to those commonly used in academic research.  Another difficulty comes from the fact that these companies need to protect their exact data gathering processes or sentiment analysis methods, for business reasons. Those differences can make our research process and the interpretation of the results less reliable and comparable, thus vulnerable. Therefore, the first step must be to understand the limits of the software and then work around this issue in order to gather robust data.

Different language and comparability

Since we are an international team with the aim to compare the image of young climate activists in the online sphere, we need to have a similar data gathering process, which is extremely sensitive to language. For example, while in the Czech case Greta Thunberg (the famous climate activist) is mostly mentioned simply as Greta and the majority of the data gathered using this keyword (“Greta”) refers to Greta Thunberg. However, in the Hungarian case most of the findings based on the keyword “Greta” are out of context, her full name must be used to obtain relevant results. This necessarily leads to significant differences in the keyword structure of the two different languages, which makes the comparability challenging.

“Piled Higher and Deeper” by Jorge Cham

Type of content, type of data

In the online sphere there are multiple types of data. Websites, blogs, forums, comments, not to mention social media content, which – from different aspects – all can have key significance in answering the research questions. But the openness of the web makes it extremely difficult to create a consistent data gathering process. Not only do the different data types require different gathering processes for the search engines, but similar types of contents such as websites and their comment sections can be built with  various architectures (e.g. the way these are embedded on websites) making the accessibility and thus automated data gathering even more complicated. We experienced this problem in our data collection process when some websites’ comment sections were available to download, while others were restricted. Those inconsistencies can cause serious bias in the database even in case of one language. With two languages in play, the situation is even more difficult.

Social media

The level of constraints researchers have to face when building a database of social media content depends on the platform itself. While tweets on Twitter are fairly easily accessible, resulting in thousands of open source databases in several topics, data gathering from Facebook is more and more complicated. This is not only due to GDPR, but also because the platform protects its own database. We could choose then to have only Twitter data, but here comes again the bias on different languages and countries. While both in Hungary and in the Czech Republic, Facebook is the mostly used social media platform, in other countries  it can be Twitter or  WhatsApp. In case we are looking for key actors and core discussions on a certain topic, those differences can cause misinterpretations.

If one reads journal articles or listens to academic presentations, it might seem that doing research is a smooth process. The aim of this blog post was to demystify some of the issues one faces when trying to gather robust data. In the future, we will return to this topic and present some of the decisions we took in order to overcome these difficulties.  

The first challenges – How to create a good online media database?