Category Archives: Methodology

News and articles about social research methodology, digital research, computational social science, R coding, data and text mining

When Pearson’s r Fools You: Why Caution is Necessary When Working with Time Series

The Pearson’s r coefficient is the most popular metric of correlation. In the social sciences, researchers not attuned to the nuances of statistics may be tempted to use it every time they need to compute the correlation between variables, including time series. However, several aspects should be considered when applying this method to time series data.

I present three scenarios where the correlation coefficient can be misleading without deeper data analysis:

  1. Series with a common long-term trend but divergent short-term behavior.
  2. Series with common seasonality or cyclical fluctuations.
  3. Time series with regime-switching characteristics, exhibiting different behaviors at different stages.

These examples illustrate why calculating correlations between time series requires careful consideration and a thorough understanding of the underlying processes.

Case 1: Common Trend

Two series may have a common trend, for example, increasing or decreasing together. Such behavior returns a very high value of Pearson’s correlation coefficient. In the case of the series below the correlation coefficient is 0.99, indicative of an almost perfect positive correlation.

TIme series sharing a common trend.

However, if we look at how the series proceed point by point, we can see even antithetical behavior (one goes up, the other goes down) or completely random behavior. In the case of this example, if we subtract the trend from the series and calculate the correlation coefficient again, we find a value indicative of an inverse rather than positive correlation: r=-0.38.

The two series above, after being detrended.

Trending series are rather frequent. For example, measuring the correlation between scientific publications in two completely different and unrelated subject areas can return a high correlation coefficient simply because the global scientific productivity increases over time due to the ‘publish or perish’ culture, but without any other notable relationship between the two specific series.

Case 2: Common Seasonality

Two series that follow a common seasonality are generative of a high correlation coefficient. As with regard to trend, such correlation due to seasonality may mask opposite correlation or nonexistent correlation. The series represented below show a remarkable correlation of 0.70, using Pearson’s r coefficient.

Series sharing common seasonality.

However, when we go to subtract the seasonal component and recalculate the correlation coefficient, we find that the series are substantially uncorrelated (r = -0.04).

The two series above, after having seasonality removed.

Seasonality is common in social science data. For example, the daily frequency of social conversations on two topics may manifest a seasonal component with more conversations on weekends than on weekdays, producing a high correlation coefficient despite the fact that the two conversation topics are not correlated.

Case 3: Regime‐Switching Time Series

In the third case, we find two time series that are characterized by a sequence of different processes. In the example, they are first characterized by a common positive trend, then proceed flatly with negligible variability, and in the third phase take on opposite trends. Such series can be called regime-switching time series because they are characterized by parameters taking different values in each of a series of regimes or phases.

Calculating the correlation coefficient between the two series, we find an almost complete lack of correlation (r=-0.07) despite the fact that the human eye may suspect the presence of common determinant factors. In fact, if we break the series into their three phases and calculate the correlation coefficient for each of them, we find an r=0.99 in the first phase, r=0.00 in the second phase, and an r=-0.99 in the third phase.

In the paper Protest and repression on social media: Pro-Navalny and pro-government mobilization dynamics and coordination patterns on Russian Twitter1, we used a preliminary changepoint analysis to differentiate between different phases of social media mobilization for and against Alexey Navalny before proceeding to analyze them individually.

Conclusions

Time series are data with a special nature and therefore require specific statistical tools. It often happens that correlations due to common seasonality or trends patterns produce high correlation coefficients, even though such patterns are not determined by the processes the analyst intends to measure. Time series can also have varying behaviors over time: new factors may come into play, representing different processes. This happens all the more easily as the series covers larger time frames and in the case of communicative and social processes.

Misleading time series correlations are what is commonly referred to in the field of time series analysis as spurious correlations, an unclear term that I hope I have helped, in part, to clarify. The topic is indeed complex and multifaceted. To delve a bit deeper into the main concepts of classic time series analysis, a primer can be found in the online handbook on the topic that I wrote for my Master’s students in Communication Science at the University of Vienna2.

References

  1. Kulichkina, A., Righetti, N., & Waldherr, A. (2024). Protest and repression on social media: Pro-Navalny and pro-government mobilization dynamics and coordination patterns on Russian Twitter. New Media & Society, https://doi.org/10.1177/14614448241254126.
  2. Righetti, N. (2022). Time Series Analysis With R. https://nicolarighetti.github.io/Time-Series-Analysis-With-R/.

CooRTweet: an R package to detect coordinated behavior on Twitter

Update: this post refers to the first versions of the package CooRTweet, which is now a multi platform package for coordinated behavior analysis. Read more here.

I have just release the beta version of CooRTweet, an R package that I developed to help detecting coordinated networks on Twitter.

The CooRTweet package builds on the existing literature on coordinated behavior and the experience of previous software, particularly CooRnet, to provide R users with an easy-to-use tool for coordinated action detection.

Coordinated behavior is a relevant social media strategy employed for political astroturfing (Keller et al., 2020), the spread of inappropriate content online (Giglietto et al., 2020), and activism. Software for academic research and investigative journalism has been developed in the last few years to detect coordinated behavior, such as the CooRnet R package (Giglietto, Righetti, Rossi, 2020), which detects Coordinated Link Sharing Behavior (CLSB) and Coordinated Image Sharing on Facebook and Instagram (CooRnet website), and the Coordination Network Toolkit by Timothy Graham (Graham, QUT Digital Observatory, 2020), a command line tool for studying coordination networks in Twitter and other social media data. CooRTweet adds to this set of tools with an easy app for R users.

Further details and the instruction for installing and using the package are available on GitHub: https://github.com/nicolarighetti/CooRTweet

A online handbook to learn R and Time Series Analysis

I have recently started to teach a course in data analysis with R at the University of Vienna, and I am creating a free online book where I explain fundamental R functions and data analysis operations, with a specific focus on time series analysis.

I’ll update the online book as the course goes on, but some chapters are already online. You can read the book at this link: Time Series Analysis With R

CooRnet: an R package for detecting coordinated behavior on social media

Today we have released CooRnet, an R package developed for detecting coordinated link sharing behavior on Facebook and Instagram.

Given a list of URLs, the package queries the CrowdTangle API link endpoint and retrieves the Facebook shares performed by public pages, groups and verified accounts, identifying the networks involved in coordinated activity.

The basic functions of CooRnet are augmented with other useful functions that create, for instance, the graph of the coordinated networks (to do additional network analysis) or the dataset of the most shared URLs.

CooRnet implements the methods we applied and detailed in our research on coordinated link sharing behavior on Facebook, as described in the report Understanding Coordinated and Inauthentic Link Sharing Behavior on Facebook in the Run-up to 2018 General Election and 2019 European Election in Italy – where we found, for instance, that URLs shared in a coordinated way gained more engagement than those shared in a non-coordinated way – and in the paper It takes a village to manipulate the media: coordinated link sharing behavior during 2018 and 2019 Italian elections.

Engagement by coordinated and non-coordinated activity on Facebook. Understanding Coordinated and Inauthentic Link Sharing Behavior on Facebook in the Run-up to 2018 General Election and 2019 European Election in Italy

In these works we found that networks involved in coordinated link sharing behavior are consistently associated with the spread of misinformation on Facebook. In the two figures below you can see the proportion of blacklisted domains shared by coordinated and non-coordinated entities, and the proportion of problematic entities (signaled by Avaaz) included in the coordinated and non-coordinated entities, as emerged in our studies on the Italian elections.

A more detailed description of CooRnet can be found on the dedicated website.

Proportion of problematic domains shared by coordinated and non-coordinated entities. The panel on the right displays the risk ratio (RR) values, all statistically significant. (It takes a village to manipulate the media)
Proportion of problematic entities included in the coordinated and non-coordinated entities. The panel on the right displays the risk ratio (RR) values, all statistically significant. (It takes a village to manipulate the media)

Using Twitter Data to Estimate Partisan Attention in a Multi-Party Media System

It has just been published “Multi-Party Media Partisanship Attention Score. Estimating Partisan Attention of News Media Sources Using Twitter Data in the Lead-up to 2018 Italian Election”.

Extending the computational method first introduced by Benkler, Faris, Roberts and others (see here and here), the paper makes use of Twitter data to measure partisan attention to news media sources in a multi-party political system.

To validate the method we compared our results with those obtained through a survey (ITANES), finding remarkable similarity (see figure below).

Furthermore, we analyzed the degree of polarization of the Italian online news media system we observed in the lead-up to the 2018 Italian election, finding a moderate level of polarization.

We also find that populist partiesonline communities relied on news sources characterized by an higher level of insularity (i.e. mainly shared on Twitter by their partisan community only) than non-populist ones.

Replication data and R code used in the study can be found here, while the paper can be read here.