Following my previous elections candidate twitterbot, in which I adopted a very simple and straightforward approach of generating random tweets from the tweets of a given set of candidates, it was clear that structure, sense and meaning were missing on the tweets produced (obvious due to the naive randomness of the approach).
Trying to overcome this limitation, and not intending to dig too deep on the linguistic analysis and generation techniques, I designed a simple but quite effective process to generate more realistic tweets:
- Obtain the morpho-syntactic structure of the input tweets using a natural language processing tool
- For each morpho-syntactic tag, collect all the words in the input tweets for the given tag
- Randomly select one of the morpho-syntactic structures obtained in step 1
- Generate a tweet randomly selecting a word from step 2 for each tag in the selected structure
Similarly to the previous approach the input data is the set of the last twenty tweets of each candidate. The candidates considered are the main candidates of national parties for the general elections for the Spanish Government on the last 20th of December, in alphabetical order:
Once again, I obtain the tweets using tweepy and then store them in a text file. I’ve had to do some pre-processing to make the linguistic parsing process easier and arrange all tweets in a text file, e.g. remove extra characters, incomplete words and avoid passing URLs to the natural language tool as they weren’t being properly parsed.
Once we have the tweets preprocessed and ready to be fed into the natural language tool, it’s time to make it work. The natural language tool I’ve used is FreeLing, and I’ve chosen it by two main reasons:
- The tweets I intended to work with are mainly in Spanish, and this library provide extensive resources for Spanish language processing.
- It has been developed in the University I studied, the Technical University of Catalonia, in the Natural Language research group.
Even though FreeLing has a non-official Python API, I preferred using it as a standalone command line library, for convenience and because the API is for Python 2.x. If executing the analysis produces an error, the bot tweets me indicating there has been an error.
With the configuration I use, FreeLing reads and tokenize each tweet and performs a morpho-syntactic analysis, providing for each analyzed token, the lemma, the part of speech (PoS) and the probability of the tagging (sometimes it could infer several different PoS for a given token, and should provide the probability for each).
For a tweet like
Mañana en Córdoba, en los desayunos de el Diario de Córdoba y con profesionales sanitarios investigadores y docentes.
The output of FreeLing would be:
For Twitter entities like users and hashtags, the analysis splits the symbol and the word in two lines, like:
To circumvent the shortcomings of a non-Twitter-friendly analysis (one that would detect Twitter entities as single words and tag them appropriately), while processing the analysis output line by line, I have to keep track of the previous line in these special cases.
Once all words from the input tweets are grouped by its PoS tag and for each tweet we have its morpho-syntactic structure, it’s time to generate a new tweet. It’s done in two phases:
- first randomly select a syntactic structure,
- then randomly select a word for each of the PoS tags in the selected structure.
At the end, if the length of the generated tweet is lower than the 140 character limit, we try to add a url from those I collected in the preprocessing of the tweets. Adding a url makes the generated tweets more attractive as they will include pictures or external links.
All in all, this twitterbot generates tweets like these:
Otra con nuestras leyes democráticas comenzará la mesa de los ingresos para los miembros Preparados con la concentración .— Candidato Aleatorio (@bot_candidato) January 14, 2016
I’ve implemented a couple more features that would fall in the growth hacking category, but I’ll leave them for the next post
P.S. When installing FreeLing in Ubuntu, I found these two links to be useful: