Tracking a Website Block Post Comment Relationships

Tell us about your projects. Update us regularly.
Post Reply
parkview
Guru Maker
Posts: 603
Joined: Tue Jun 24, 2014 8:25 pm
Location: Busselton
Contact:

Tracking a Website Block Post Comment Relationships

Post by parkview » Tue Oct 04, 2016 7:35 pm

I enjoy the RSS feed from the website called: The Conversation https://theconversation.com/factcheck-have-eight-of-australias-12-most-emission-intensive-power-stations-closed-in-the-last-five-years-65036 I love some of the comments and peoples interactions to the article and other comments. This lead me to ponder who is interacting with whom etc. So yesterday I set out to chart the comments.

I have used a Raspberry Pi loaded with the Graphviz (http://www.graphviz.org/) module and I feed my Python script a URL and it will spit back a GIS and SVG graph of the comment relationships like this:
conversation-comments_5818.gif
conversation-comments_5818.gif (46.32 KiB) Viewed 7806 times


The number of borders denotes the size of the comment, eg: 1 x border is less than 500 characters, 2 x borders is 500-1500 characters and 3 x boarders is over 1500 characters long. Just set arbitrarily. The gray nodes are comments that have become orphaned due to their posts being deleted by the moderator. I could look for the next parent up the tree and re-attach them there, but I think they kind of look cool as they are.

Graphviz is a fantastic bit of free software that can be used to display relationships between objects. In the past I have used it to auto generate a HR Org chart of an 300+ employee organisation (data pulled from Active Directory using OpenLDAP) and tracking/monitoring network objects with data pulled from NMAP scan of the network.

Here are a few more interesting graphs:
conversation-comments_45945.gif
conversation-comments_45945.gif (55.01 KiB) Viewed 7806 times


in the comment tree below, I like how one persons comment sparked a LOT of further comments:
conversation-comments_36504.gif
conversation-comments_36504.gif (102.09 KiB) Viewed 7806 times


All the nodes are hyper-linked, so if you open a SVG file in a modern browser, you can click on any of the nodes to bring up the actual article, or comment. Unfortunately phpBB won't allow SVG files to be added in, so i can't demo that directly here.

The graphs have been shrunk down to fit into the forum webpage, but they can be rather large. At least a SVG file can be zoomed in with out becoming pixelated, as it's a vector graphics file, as opposed to a GIF file which is a raster image.

How does a Graphviz data structure work?

It's all very simple. You have one line per node to describe how the node looks. Here is an example:

Code: Select all

567820 [colour=black fillcolor=khaki peripheries=1 URL="https://theconversation.com/factcheck-did-carbon-emissions-fall-faster-before-the-carbon-price-36504#comment_567820" label="Casey Jones"];


Above, the first number is kind of a node ID. It can be anything, but has to be unique. Here I have used the comment ID number. The rest of the line is the description of the node and URL etc.

Then you have to show the relationship between the nodes in a separate line like this:

Code: Select all

567820 -> 36504


Above, this is asking Graphviz to automatically draw a line from node id 567820 to node id 36504. In this case, node id: 36504 is the webpage article ID number.

All my python script does is parse the comment tree to find the relationship of the comments and write out the lines to a file which at the end is fed into Graphviz.

That's all I really set out to do, but as I progressed and pondered the project, I now wonder on a few extra enhancements:
1) Have an attempt to analyse the comment text to see if the commenter was: for or against the article or just fed up with the commenter. I have attempted text analysis before, and it's not an easy thing to do (for me), so this will be a low priority.
2) How about crawling over the website and superficially analyse say 2000 articles. These could be checked for:
a) the article URL and comment file URL
b) the number of comments for each article
c) record the keywords or tags for each article.

From this I could have a go at working out the relationship of comments vs. article type, ie: Energy, Immigration, Politics, Arts, Food etc. All very interesting :)

parkview
Guru Maker
Posts: 603
Joined: Tue Jun 24, 2014 8:25 pm
Location: Busselton
Contact:

Re: Tracking a Website Block Post Comment Relationships

Post by parkview » Wed Jun 12, 2019 8:58 pm

While I was travelling Asia, one evening a group of us where chatting and one of the fellow HTTA tour ladies, Jenn, mentioned that she would like to have a go at analysing the sentiment of a large volume of text. That got me re-thinking about my Comment Relationship project.

It would be interesting to be able to determine if each blog post comment is either Negative, neutral, or positive. I could then have the Python program colourise the graph bubble appropriately. I could also see if some people are generally more negative or positive in their comment writing. Does a comment spark off a chain of negative or positive comments?

So, here is where I am at tonight. Jenn found this NLP - Natural Language Processing software: https://stanfordnlp.github.io/CoreNLP/ I have installed Java 1.8 onto a FreeBSD 11.2 based server. The server has one 2.5GHz core and 8GB of RAM. I also downloaded both English data modules.

Initially I did try running it on a 4GB Server, but there wasn't enough free RAM to really run the software properly. Not really being that knowledgeable with Java products, I struggled to get it running properly, as it kept saying it couldn't find the correct class. I eventually found that if I ran the command from within the CoreNLP directory, it all worked well.

I entered my text into a file, and ran this command: time java -cp "*" -Xmx6g edu.stanford.nlp.sentiment.SentimentPipeline -file text3h.txt

Here are some responses:

1) took 5 seconds to process some sample text I quickly typed up.

She looked very happy with the project.
Very positive
She was bemused by the project.
Neutral
She was unhappy with the project.
Negative
She was very unhappy with the project.
Negative
She really hated the project.
Negative
She hated the project.
Negative
She looked across the room to the project.
Neutral
She didnt careless about the project.
Negative
The cat jumped over the project.
Neutral
She loved the project.
Very positive
She looked forward to working on the project.
Positive
She threw the project across the room.
Neutral
She smashed the project.
Positive
She broke the project into many parts.
Positive
She stomped on the project.
Positive
She took the project apart.
Positive
Come on, can you at least give one sentence a very negative outcome?
Negative

The rest of the examples below are real world comments people have left on a public blog post.

2) Time wasn't recorded
In the next three years, we already know, beyond any shadow of a doubt that we will not end up in recession.
Negative
The Libs are in charge and they TOLD us that it will not happen - not with them and their steady hand on the till.
Negative
There word of honour stands as a witness to the truth of all they say.
Positive
How can we doubt them?
Neutral

3) 4.63 seconds to process.
The idea would be that cutting tax will increase consumption and make investments more productive.
Negative
However people might just save the money because they seem themselves needing it in the near future.
Neutral

4) 12.8 seconds, but this is just 1/3 of the comment.
These can be easily distinguished from distribution by their larger size, but there are a few other indicators that you are looking at a transmission line rather than a distribution line.
Positive
Transmission lines are always built with sets of three conductors with an optional small wire or two at the top of the structure to serve as lightning protection.
Negative
While a typical residential service may only include a single phase, the electric grid itself is a three-phase system and the transmission lines are meticulously balanced so that an equal amount of current flows on each of the three phases.
Negative
Too much savings, not enough productive investments to put it into, increasing inequality.
Negative
How is tax cuts going to help that?
Neutral
It sounds like the hoarders of money are ruining everything.
Negative

I wonder what a 'Very Negative' sentence might look like? ;)

Anyway, that's enough for the moment. Off to work on some other projects.

parkview
Guru Maker
Posts: 603
Joined: Tue Jun 24, 2014 8:25 pm
Location: Busselton
Contact:

Re: Tracking a Website Block Post Comment Relationships

Post by parkview » Tue Jul 02, 2019 8:20 pm

In the past I have been manually running a Python script to parse each blog post that takes my interest. I then have to access a Samba share to view the SVG graph. I wasn't recording the blog topic tags either and the whole process was kind of rudimentary. It was time to at least automate the viewing process and start collecting the blog tags.

I have never needed an opportunity to play with Flask: http://flask.pocoo.org/, the Python based webserver. This mini project add-on was a perfect time to investigate how it all works. I have set it up on an internal server with a small HTML template file, some static files (CSS style sheet and a javascript file: https://www.kryogenix.org/code/browser/sorttable/ (super easy to use) to handle the HTML table column sorting for me.

All the links are hyperlinked back to the various parts of the originating blog post. You can click through on the right hand side SVG graph to open up a much larger zoomable SVG comment relationship graph. This shows which Author responded to another etc. A table at the top displays the total number of comments per Author. The webpage also lists the corresponding topic tags that have been assigned to each blog post. These are hyperlinked back to the sites list of all the blog posts associated with that topic.
The-Conversation_webpage.jpg
The-Conversation_webpage.jpg (57.03 KiB) Viewed 7440 times

At the moment, I still have to manually add in each blog post URL, but I do have plans to allow bulk processing of blog URL's and eventually maybe even fully automate the process to parse all blog entry's as they are posted.

There is a lot more interesting data analysis and data display that can be done with this project, so watch this space. Python is sooo much fun!

User avatar
seaton
Master Maker
Posts: 222
Joined: Tue Jun 24, 2014 11:41 am
Location: Bunbury, WA
Contact:

Re: Tracking a Website Block Post Comment Relationships

Post by seaton » Tue Jul 16, 2019 2:44 pm

ooo lots of fun crunching the data, well done

parkview
Guru Maker
Posts: 603
Joined: Tue Jun 24, 2014 8:25 pm
Location: Busselton
Contact:

Re: Tracking a Website Block Post Comment Relationships

Post by parkview » Sat Aug 17, 2019 5:17 pm

So, I have now collected and parsed a lot of 'The Conversation blogs and their comments. I thought it might be interesting to view a word-cloud of all the blog post tags, or categories that have been listed against each blog post. Blog posts can have anywhere from 0 (I think that was an error), 1 to 13 tags registered at the bottom of the post.

I wrote up a python script to dump all the tag categories into a text file, then in unix (RPi CLI), sorted the list, looked for and counted the unique words, numerically sorted that list, then used the Python wordcloud module: https://www.datacamp.com/community/tuto ... oud-python - CLI function to generate a nice looking image:

wordcloud_cli --text mytext.txt --imagefile wordcloud.png

This is the result:
wordcloud4.png
wordcloud4.png (171.38 KiB) Viewed 7301 times
I could have done all of the above inside one Python script, but for me it was faster doing some of it at a UNIX CLI. The data is skewed by various interests and subjects that I follow ;)

Post Reply