Using R for the Semantic and Emotional Content of Clinton, Sanders, Trump Facebook posts

I recently stumbled across a super cool Python script that will scrape Facebook group content, and in this post I show what I did with the text that this script helped me gather. I do so because, as R slowly consumes the lunches of the other data science languages and software out there, I often find myself confused as to which library is becoming the ‘standard’ for this or that practice. As such, this post will be useful to some simply because minimaxir’s functions are awesome, and to others because I show some code to do very simple text analysis. (Disclaimer: I am not the biggest fan of some of the text analysis literature, and I do not use it much in my own work, but it is very interesting where appropriate. Also, users will need a Facebook Developer ID to fully reproduce this post, but not to follow along with the first part.)

The text that I use for the purposes of this illustration comes from the three remaining U.S. Presidential candidates’ Facebook groups. (Visitors from the future will note that, as of today, those three are Hillary Clinton, Bernie Sanders, and Donald Trump) I first use the Facebook Post Scraper functions designed for Facebook pages to get .csvs containing the posted statuses and some metadata about them for the pages of each candidate. For this example, we’ll work with Trump’s data, which I’ve made available here.

First, lets take a look at all the data we’ve got to work with. Load the data into your R session using read.csv and then we’ll take a look using head. Note that each row here represents one post Donald Trump has made to Facebook.

dt = read.csv('~/Google Drive/Python/facebook-page-post-scraper/DonaldTrump_facebook_statuses.csv', header = T, stringsAsFactors = F)
1 153080620724_10157177470730725
2 153080620724_10157176983270725
3 153080620724_10157175643325725
4 153080620724_10157176048285725
5 153080620724_10157175711945725
6 153080620724_10157174547880725
great evening - I am grateful for all of your support. It is time to make America SAFE and GREAT again - ASAP! Together, we can accomplish that goal.
nDeparting New Hampshire with my amazing family, after a landslide victory. Will never forget it. I love my family, and am forever grateful for their support over the past 365 days of my candidacy.
rooked Hillary Clinton rakes in millions from nations that fund terrorism, oppress women, and spread hatred.
4 STATEMENT ON AFL-CIO ENDORSEMENT OF HILLARY CLINTON\n\nSadly with this endorsement of Hillary Clinton - who is totally owned by Wall Street - the leadership of the AFL-CIO has made clear that it no longer represents American workers. Instead they have become part of the rigged system in Washington, D.C. that benefits only the insiders.\n\nI believe their members will be voting for me in much larger numbers than for her.\n\nHillary Clinton and her husband have made hundreds of millions of dollars doing favors and selling access to Wall Street, special interests and oppressive foreign regimes. As Bernie Sanders said, "Why, over her political career, has Wall Street been the major campaign contributor to Hillary Clinton?" They own Hillary Clinton and she will do whatever they tell her to.\n\nBernie Sanders is also 100% correct when he says that Hillary Clinton "vote[d] for virtually every trade agreement that has cost the workers of this country millions of jobs." Hillary supported NAFTA and she supported the trade deal with China, Vietnam, South Korea – and if elected will implement the TPP she loves so much – guaranteed.\n\nWhile Secretary of State,'Hillary Clinton racked up a $1 trillion trade deficit with China – all while China was funneling a small fortune to Hillary via speaking fees paid to Bill.\n\nOn immigration, Hillary Clinton sides with Wall Street too. Bernie Sanders correctly warned that open borders "would substantially lower wages in this country," and yet Hillary has put forward a plan that would completely open America's borders in her first 100 days in office.\n\nOn energy, Hillary Clinton wants to shut down the coal mines, block the Keystone pipeline, and destroy millions of good union jobs through executive action.\n\nHer massive proposed increase in taxes and regulations will also send millions of jobs overseas.\n\nHillary Clinton will economically destroy poor communities, African-American and Hispanic workers on trade, immigration, crime, energy, taxes, regulation and everything else.\n\nFinally, union workers have long believed in having an open and free society. Yet Hillary takes money from regimes that support the murder of gays and the enslavement of women while pushing to bring people into America who want to do the exact same thing to our people. I only want to bring people into our country who will love and support everyone.\n\nHillary Clinton is the enemy of working people and is the best friend Wall Street ever had. I will fight harder for American workers than anyone ever has, and I will fight for their right to elect leaders who will do the same. I will be a president for ALL Americans.


1                                                                                  Timeline Photos
2                                                                                  Timeline Photos
3                        Nations Clinton Bashes For Terrorism Gave BIG BUCKS To Clinton Foundation
5 441 Syrian Refugees Admitted to the U.S. Since the Orlando Attack, Dozens to Florida - Breitbart
6                                                             Funding terrorism — and the Clintons
1       photo
2       photo
3        link
4      status
5        link
6        link
     status_published num_reactions num_comments num_shares num_likes num_loves
1 2016-06-16 21:13:19         28452         1630       1701     26589      1705
2 2016-06-16 18:36:08         26751         1971        938     24868      1736
3 2016-06-16 16:00:24         36104         4115      11421     29366       250
4 2016-06-16 15:30:53         39610         3845       8714     35981      1494
5 2016-06-16 14:30:18         52198         5045      10605     41346      1211
6 2016-06-16 11:00:39         25448         3481       6167     21112       197
  num_wows num_hahas num_sads num_angrys
1       37        57        3         65
2       84        28        1         36
3      817       282      198       5192
4      163        85      138       1749
5      533       111      244       8754
6      554       162      111       3312

Note the final few rows, num_reactions, num_comments, num_shares, num_likes, num_loves, num_wows, num_hahas, num_sads, num_angrys. These correspond to the various types of reactions that users may offer to posts made to pages that they follow. Because those reactions take the form of a standard emoji character I spent some time finicking with some libraries users have made in an attempt to make it easier to use emoji in R to no success, so I leave the inability to easily plot series as giant pillars of emojis as an oversight for other R users to tackle. There has been some early work on this such as emoGG that I hope to see more of, but that is not the focus of this post.

When I got this data I was immediately curious to know how users’ responses have changed over time. I consider emotional responses to text to be a promising new frontier for public opinion measurement, and sites like Facebook elicit billions of such responses per day. To inspect these responses, we can produce a plot that shows us how users’ responses to the posts of the candidates they follow. Note, first, that if we want to plot change over time, we must make sure our X-axis variable will be of class date. It currently is not:

'data.frame':	3052 obs. of  15 variables:
 $ status_id       : chr  "153080620724_10157177470730725" "153080620724_10157176983270725" "153080620724_10157175643325725" "153080620724_10157176048285725" ...
 $ status_message  : chr  "WOW! THANK YOU DALLAS, TEXAS! A great evening - I am grateful for all of your support. It is time to make America SAFE and GREA"| __truncated__ "THROWBACK THURSDAY:\nDeparting New Hampshire with my amazing family, after a landslide victory. Will never forget it. I love my"| __truncated__ "Crooked Hillary Clinton rakes in millions from nations that fund terrorism, oppress women, and spread hatred." "STATEMENT ON AFL-CIO ENDORSEMENT OF HILLARY CLINTON\n\nSadly with this endorsement of Hillary Clinton - who is totally owned by"| __truncated__ ...
 $ link_name       : chr  "Timeline Photos" "Timeline Photos" "Nations Clinton Bashes For Terrorism Gave BIG BUCKS To Clinton Foundation" "" ...
 $ status_type     : chr  "photo" "photo" "link" "status" ...
 $ status_link     : chr  "" "" "" "" ...
 $ status_published: chr  "2016-06-16 21:13:19" "2016-06-16 18:36:08" "2016-06-16 16:00:24" "2016-06-16 15:30:53" ...
 $ num_reactions   : int  28452 26751 36104 39610 52198 25448 56087 24365 157160 63407 ...
 $ num_comments    : int  1630 1971 4115 3845 5045 3481 2718 1602 8252 3310 ...
 $ num_shares      : int  1701 938 11421 8714 10605 6167 4913 3 12447 5226 ...
 $ num_likes       : int  26589 24868 29366 35981 41346 21112 52126 22669 146754 59276 ...
 $ num_loves       : int  1705 1736 250 1494 1211 197 3537 1471 7667 3727 ...
 $ num_wows        : int  37 84 817 163 533 554 260 128 173 204 ...
 $ num_hahas       : int  57 28 282 85 111 162 62 54 233 70 ...
 $ num_sads        : int  3 1 198 138 244 111 10 6 98 14 ...
 $ num_angrys      : int  65 36 5192 1749 8754 3312 92 39 2238 115 ...

The date is very well-formatted, however, and we can simply convert it to a date like so:

dt$status_published = as.Date(dt$status_published)

I find it easier to work with ggplot when my data is in the most minimalist format possible, that is, when I have no extraneous data in the data frame I’ll be plotting. As far as I can tell, the only data I was is the data (for the X-axis) and then the counts of the various emotional responses available to each post. If we check, we see that the date is in column six, and the emotional responses of interest are in columns eleven through fifteen, and so I next subset the data down to just those. I also subset the data to only include dates after February 23, 2016, which appears to be when Facebook rolled out its emoji response system to 100% of its users, or something.

dt = dt[dt$status_published > '2016-02-23',c(6,11:15)]

ggplot will take as its first argument a data frame, and I have opted to melt our data so that it is just three variables: a variable for date, a variable for the type of response users offered, and a variable for the count of those responses. As such, every row is a date-type-response sum. This is considered standard practice in ggplot as it requires only specifying one data source.

To do so I first ‘melt’ the data by publication date, which requires the reshape library. Doing so transforms the data from having a field for every response type to having a row for every data type. I next aggregate each response so that dates are not duplicated. This is useful for the occasional days when candidates post multiple times per day, and it turns out Donald Trump is wont to do so. For aggregation I use the somewhat befuddling but extremely useful data.table library, whose function setDT lets me recode in-place, which I like, and uses the same by syntax that pretty much every other library uses to specify which column we will be grouping by. Finally, even though it makes little difference, I do not intend for any of the results to be a factor, and so I make sure the text variable is set to type character.


dt = melt(dt, id.vars = 'status_published')
dt = setDT(dt)[, .(value = sum(value)), by = .(status_published, variable)]
dt$variable = as.character(dt$variable)

So what have we got here?

   status_published  variable value
1:       2016-06-16 num_loves 22995
2:       2016-06-15 num_loves 15190
3:       2016-06-14 num_loves 39145
4:       2016-06-13 num_loves 42424
5:       2016-06-12 num_loves 73870
6:       2016-06-11 num_loves 11680
> str(dt)
Classes ‘data.table’ and 'data.frame':	570 obs. of  3 variables:
 $ status_published: Date, format: "2016-06-16" "2016-06-15" "2016-06-14" "2016-06-13" ...
 $ variable        : chr  "num_loves" "num_loves" "num_loves" "num_loves" ...
 $ value           : int  22995 15190 39145 42424 73870 11680 11813 34023 29408 31571 ...
 - attr(*, ".internal.selfref")=<externalptr> 

ggplot will take status_published as the x-axis, value (the sum of responses by type, by post!) as the y-axis, and will produce bars by the cleverly-named variable variable. (These sound confusing but, given that this naming nomenclature is actually quite standard per the default output of the reshape functions is one to which you will like to become familiar) We also want to tell ggplot to stack the bar charts so that they look like proportions for us, and so I include the stat="identity" and position="fill" arguments to the geom_bar() function so that ggplot knows to stack the bars and that the “fill” argument (which, for us, will be “variable!”) tells us which variable defines the bars. Straightforward, right?

ggplot(dt, aes(x = status_published, y = value, fill = variable)) +
  geom_bar(position = "fill", stat = "identity") +
  theme_classic() +
  xlab("") +
  ylab("") +
  ggtitle("Reactions to Donald Trump Facebook Posts") +
  scale_fill_manual(labels = c("Angry","Funny","Love","Sad","Wow"), values = c("Red","Yellow","skyblue","darkgray","Green"), name = "Emoji response")

User reactions to Donald Trump's Facebook posts

Not bad! Going back to the start and doing the same for Clinton and Sanders you can plot reactions to their Facebook posts too, and here’s what you’d get if you did:

User responses to Hillary Clinton's Facebook posts

User responses to Bernie Sanders' Facebook posts

Not much leaps out at me except for how angry Trump supporters are, and how sadness seems to be creeping in at the edge of the X-axis for Sanders supporters… on to some readability analysis!

To my mind, the above work constitutes some simple “sentiment analysis.” In a way, it is sentiment in its purest form – a snap judgment about the emotional valence of content viewed by millions of users over time. But what are these candidates posting? Is it smart or silly? I think we all have our priors about this question (spoiler alert: our priors about the intellectual quality of content posted by Clinton, Sanders, and Trump will probably be about right), but I’d like to investigate.

I mostly want to do so by way of highlighting the quanteda library which makes text analysis very easy. Specifically, I am interested in knowing how “smart” such content is, which I proxy by the grade level at which each post was written. This requires its own particular munging, but not much. To start, I reload the Trump data and format the date as above. The only change that I really make is at the end where I compute the “readability” of each post.

To do so, you must first convert the text string to a corpus (I convert each to its own corpus – the difference is not material here between doing so for each entry and doing so for the whole variable, except that doing so for the whole variable would produce errors for very short entries), which is what quanteda ‘s readability will be scoring for readability. I do so using a loop, and my friends will probably remind me that I should’ve used apply instead, but that’s a matter for another post. We will start by loading quanteda, which I use to perform the readability analysis (because analyzing text’s readability is way easier than, say, reading it, after all).


dt = read.csv('~DonaldTrump_facebook_statuses.csv', header = T, stringsAsFactors = F)
dt = dt[dt$status_type == 'status',]
dt$status_published = as.Date(dt$status_published)
dt = dt[dt$status_published > '2016-01-01',]
dt$readability = rep(NA, nrow(dt))
for(i in 1:nrow(dt)){
  dt$readability[i] <- readability(corpus(dt$status_message[i]), measure = "Coleman.Liau.grade")


Which shows us that Trump tends to post at about a 9th grade level.

Coleman-Liau readability grade for Donald Trump Facebook posts

Why do I use the Coleman-Liau measure? Google told me to. And if we download the rest of the data for Clinton and Sanders and precisely repeat the above code, and then mold the three into one data frame, we see a possibly somewhat predictable difference:

Readability scores for all three candidates

With code below. The only somewhat icky part is adding in a new variable for each candidate name to pass to ggplot.

dt = read.csv('~/Google Drive/Python/facebook-page-post-scraper/DonaldTrump_facebook_statuses.csv', header = T, stringsAsFactors = F)
dt = dt[dt$status_type == 'status',]
dt$status_published = as.Date(dt$status_published)
dt = dt[dt$status_published > '2016-01-01',]
dt$readability = rep(NA, nrow(dt))
for(i in 1:nrow(dt)){
  dt$readability[i] <- readability(corpus(dt$status_message[i]), measure = "Coleman.Liau.grade")

bs = read.csv('~/Google Drive/Python/facebook-page-post-scraper/berniesanders_facebook_statuses.csv', header = T, stringsAsFactors = F)
bs = bs[bs$status_type == 'status',]
bs$status_published = as.Date(bs$status_published)
bs = bs[bs$status_published > '2016-01-01',]
bs$readability = rep(NA, nrow(bs))
for(i in 1:nrow(bs)){
  bs$readability[i] <- readability(corpus(bs$status_message[i]), measure = "Coleman.Liau.grade")

hrc = read.csv('~/Google Drive/Python/facebook-page-post-scraper/hillaryclinton_facebook_statuses.csv', header = T, stringsAsFactors = F)
hrc = hrc[hrc$status_type == 'status',]
hrc$status_published = as.Date(hrc$status_published)
hrc = hrc[hrc$status_published > '2016-01-01',]
hrc$readability = rep(NA, nrow(hrc))
for(i in 1:nrow(hrc)){
  hrc$readability[i] <- readability(corpus(hrc$status_message[i]), measure = "Coleman.Liau.grade")

ggdat = rbind(

ggdat$variable = c(rep('Clinton',nrow(hrc)), rep('Trump',nrow(dt)), rep('Sanders',nrow(bs)))

ggplot(ggdat, aes(x = status_published, y = readability, group = variable, color = variable, fill = variable)) + geom_smooth(se = FALSE) + geom_point() + ylim(0,16) + xlab("") + ylab("Coleman Liau reading grade") + theme_classic() + ggtitle("Readability scores by candidate") + scale_color_manual(values = c('blue','green','red'))


And there you have it! Emotional valence and simplicity of legibility in just a few simple lines. I really dig the new libraries I found this week, name quanteda and so I use this post to pass them along to you.