Table of Contents

  1. Introduction
  2. Data Wrangling
  3. Named-entity recognition (NER) and Topic Modelling
  4. Cluster Analysis

Introduction

Language used: Python

Goal: Analyse the frequency, the type and the related-topic of English words in the Latin American Spanish-speaking subreddits of the social media platform Reddit for the period 2016-2023. Finally, it will be taken into account the influence of economic relations with United States and the amount of tourists (>1.000.000) from this country to see if either, both or none have an impact.

Data Source: Spanish-speaking Latin American subreddits (communities in a forum) of the social media platform Reddit ((https://www.reddit.com/)).

Libraries used

Library Description
pandas Data manipulation and analysis to work with structured data
numpy For numerical operations to work on large and multi-dimensional arrays and matrices
requests To make HTTPS requests to access web APIs and fetch data from the internet
praw Wrapper to interact with Reddit API
nltk Natural Language Toolkit, it includes programs used for text processing and analysis including tokenization, stemming, and stopwords removal
spacy Used for different NLP tasks, such as tokenization, part-of-speech tagging, and named entity recognition
matplotlib.pyplot Plotting library to create visualizations
seaborn Statistical data visualization tool for informative statistical graphics
scikit-learn Machine learning library, used in this project for clustering
gensim Used for topic modelling and document similarity analysis

Subreddits analysed by country

Country Subreddit
Argentina republica_argentina; alianzaargentina; buenosaires; argentina; cordoba; bahiablanca; bariloche; chubut; corrientes; lapampa; mendoza; rosario; salta; tucuman; neuquen
Bolivia bolivia
Chile republicadechile; chile; santiago; yo_ctm; noeslalegal; anormaldayinchile; chilefit; clubdelecturachile; chileambiental; chileorganico
Colombia colombia; bogota; medellin; barranquilla; cali; bucaramanga; manizales; pereira; santamarta
Costa Rica ticos
Cuba cuba
Ecuador ecuador
El Salvador el salvador
Guatemala guatemala
Honduras honduras
México mexico; monterrey; guadalajara; mexicocity; tijuana; puebla; videojuegosmx; memexico; mexicofinanciero; derechomexicano; somosmexico.
Panamá panama
Paraguay paraguay
Perú peru; cusco; machupicchu; arequipa; cumbiaperuana; pokemongoperu
Puerto Rico puerto rico
República Dominicana dominicanos
Uruguay uruguay; uruguay marketplace; burises; uruguay libre; uruguay crypto; uruguay circle jerk; uruguay verde; charruadevs
Venezuela venezuela; vzla

Data wrangling

The data fetched is related to the title, the body of posts, and the comments left by users and will be saved in a dataframe. Before starting with the data wrangling, all dataframes are being consolidated into one as to have:

  • A column “text” where all the text will be stored, regardless of its type;
  • A column “Created On” as to know when the specific text was created;
  • A column “year” where the creation year is extracted;
  • A column “df_name” with the name of the original dataframe (from which country subreddit the text is from).

table

In the text processing pipeline all words are converted to lowercase, and commas are removed. Additionally, special characters, such as “é, á, í,” etc., are transformed into their unaccented forms, with the exception of the character ‘ñ,’ which is retained in its original form. This preprocessing step standardizes the text and ensures uniformity in word representations.

Following this, common language stopwords are eliminated from the text. Stopwords, being frequently used words that typically do not contribute significant meaning to the text, are filtered out to focus on the more substantive content of the text.

The text is then converted into tokens. A new column, only containing English tokens, is introduced into the dataframe. This step is crucial as it aligns with the primary focus of the project, which centers on analyzing and extracting insights from the English words present in the retrieved data.

Descriptive Analysis

Top 10 most used English words used regardless of country and year:

graph

Most frequent terms: people, like; words related to social media and the interaction among users.

Most frequent English words by subreddit:

graph_2

Most frequent terms: post (Argentina), thread (Honduras). Terms related to the digital world (thread, twitter, post, edit, sub, like, format) and society (government, people, city).

Country English word Frequency Total Percentage
Argentina post 180 12994 1.385255
Bolivia format 44 1205 3.651452
Chile format 100 7096 1.409245
Colombia government 89 5259 1.692337
Costa Rica sub 28 1193 2.347024
Cuba time 13 1067 1.218369
Dominican Rep. thread 143 1323 10.808768
Ecuador would 12 1135 1.057269
Guatemala link 18 1186 1.517707
Honduras watch 22 1141 1.928133
Mexico post 113 14010 0.806567
Nicaragua like 18 1229 1.464605
Panama like 25 1526 1.638270
Paraguay get 10 1270 0.787402
Peru city 22 2061 1.067443
Puerto Rico would 20 1092 1.831502
El Salvador like 24 1355 1.771218
Uruguay edit 49 5194 0.943396
Venezuela us 41 2323 1.764959

Subreddit that most use English words _by taking a random selection on 1000 tokens each:

graph_3

According to this analysis, the Nicaraguan subreddit has the highest number of anglicisms (865 out of 1.000), followed by the Cuban subreddit (658 out of 1.000) and El Salvador (635 out of 1.000). The Paraguayan subreddit has the lowest number of anglicisms (336 out of 1.000). It’s worth noting that there is a substantial difference between the first two positions, whereas in the other cases, the difference is much smaller.

Frequency of English words by year (2016-2023)

graph_4

English words most used by subreddit and year by taking a random selection on 1000 tokens each:

graph 5

Over the years, Nicaragua consistently stands out as the subreddit with the highest number of anglicisms, particularly in the year 2018, distinguishing itself from the others. In the years 2020-2021, the Cuban subreddit takes the lead, with the Venezuelan subreddit standing out in 2020, and the Dominican Republic subreddit in 2023.

Named-entity recognition (NER) and Topic Modelling

NER:

graph_6

The analysis highlights that the most frequently used typology is of the date type, with 1904 cases, followed by the categories of cardinal numbers (1150) and persons (1097).

Topic Modelling:

graph_7

  • Topic 0: Politics and Leadership, containing primarily the words: would (conditional of the verb “to be”), president, get, know, people, like, time, us, country.
  • Topic 1: Current Events and International Affairs, containing primarily the words: Russia, Covid, width, preview, day, format, next, even, area, government.
  • Topic 2: Technology and Media, containing primarily the words: RAM (random-access memory), video, edit, world, extra, opinion, war, super, watch, post.
  • Topic 3: Social Media and Personal Life, containing primarily the words: sub (short for subreddit), status, Twitter, see, live, food, due, place, also, family.

Cluster Analysis

In the context of the growing global interconnection, it is important to examine in detail the influence that the United States exerts on different countries, both from a commercial and tourist perspective. The main objective of this study is to conduct a comprehensive assessment of such influence, using data from official and reliable sources. For this purpose, two key sources have been employed: the official website of the United States Trade Representative (USTR) and data provided by the World Tourism Organization (UNWTO).

For this study, an approach based on the ordinal classification available in the country profiles of the USTR has been adopted. This classification, which ranks countries according to the total volume of products traded with the United States, provides a structured view of the trade relationship between the United States and each respective country.

Regarding tourist influence, the analysis relies on data compiled by the UNWTO, a globally recognized entity for its comprehensive tourism statistics. Specifically, a threshold of one million American tourists has been established as a criterion to identify the most relevant tourist destinations. Countries surpassing this threshold are labeled with the number 1, while others receive the category number 2.

Country Tourism Commerce
Argentina 2 33
Bolivia 2 96
Chile 2 20
Colombia 2 22
Costa Rica 1 38
Cuba 2 127
Dominican Republic 1 30
Ecuador 2 42
El Salvador 2 50
Guatemala 2 36
Honduras 2 44
Nicaragua 2 67
México 1 2
Paraguay 2 109
Perú 2 29
Panamá 2 35
Puerto Rico 1 0
Uruguay 2 91
Venezuela 2 72

graph_8 graph_9

Number of clusters: Silhouette method and Elbow method recommend 3 clusters.

K-Means

graph_10

  • Cluster 0: Costa Rica, Dominican Republic, Mexico, Puerto Rico.

This group includes countries that receive more than 1 million tourists from the United States per year (value 1 in the “Tourism” column). Regarding trade, not all countries in this group are considered significant trading partners, as some have a value equal to or less than 30 in the “Commerce” column.

  • Cluster 1: El Salvador, Honduras, Ecuador, Guatemala, Panama, Argentina, Peru, Colombia, Chile.

This cluster comprises countries that also receive few tourists from the United States but are considered significant trading partners due to significant values in the “Commerce” column.

  • Cluster 2: Cuba, Paraguay, Bolivia, Uruguay, Venezuela, Nicaragua.

Here are countries that receive a low number of tourists from the United States. Regarding trade, there are few significant trading partners for the United States in this group.

Influence of US by Year

(using 1000 sample dataframe to avoid skewed data)

graph_11

Starting from the year 2020, the subreddits of countries belonging to cluster 1 show a significant increase in the frequency of anglicisms compared to those of other countries. During the years 2018 and 2019, countries in cluster 2 took the lead in terms of anglicism frequency. Based on these findings, it can be concluded that the quantity of anglicisms is influenced by trade relationships with the United States.