Just a few months ago I have started learning Python and the first project was building a scrapping py script for the earthquake database in Romania.
From there I have completed the course from Dataquest.io “Python for Data Science” and I learn a lot of new things. Along with this, I have started to learn also SSIS (SQL Server Integration Services) with Pluralsight.com, because I wanted to have a deeper understanding of some ETL tool.
First project
This project has started form the question “What is the word people are using the most when they post on social media?” …and that was is.
I have started to make some research on this and I found out that I need to know how to use at least 3 tools.
Python for text analysis.
SSIS for making the output of Python more usable in a BI tool
and Tableau to get the insight in a visual form.
Just to understand much better I have posted also a diagram of the flow:
Python
The core data that has been processed using Python has been downloaded from data.world. And it contains 20,000 tweets, with username and gender.
Starting from this in the script I broked each tweet in words and after that, I have counted them to see what it is the frequency of each word.
After a few hours of work, I have produced the following code that is telling me that the mows used word on Twitter in those days was “the” (by males) and “and” (by females).
SSIS (SQL Server Integration Services)
The SSIS pack is quite robust for this project and in fact, the most annoying part of it was the Code page setting from the CSV connection that has created me a lot of problems in the flow.
With Python, I have processed the genders in separate files in order to keep a clear track of it. Anyway, we talk about 60k lines at the end.
Tableau comes into action
Being an atypical data set with multiple anomalies it has been difficult to find a good visualization to represent all.
That’s why I have made a division in 2:
- Pronouns analysis. I was curious about who are we talking about (me, you, they…)
- and Man vs Women war of words
Pronouns analysis in Tableau
This is how I discovered that half of the time we are talking about us.
it’s not the best viz…but I wanted to represent the whole in one piece
And “We” is almost in the last position.
Man vs Women war of words
In order to have a full view over this, I used a horizontal line chart where the women are on the left side and the males are on the right side.
And we can spot that most of the words are used on both sides, except some of them, but nothing really significant.
As you can see the women are using more words than man and also, the vocabulary used is more diverse.