Forum    Members    Search    FAQ

Board index » Website Things » Suggestions/Requests




Post new topic  Reply to topic  [ 6 posts ] 
 
Author Message
 Post Posted: Mon Dec 26, 2016 7:17 am 
Here for the 10th Anniversary Has collected at least one unit
Offline
Joined: Fri Sep 16, 2016 10:05 pm
Posts: 497
Website: http://lateralbreakdown.blogspot.com.au/
This is a bit of an odd request, but is there a way to get a plain text/database dump containing the contents of a forum thread?

I've got an idea for converting threads into Thought Bubbles, but I need the formatting to test it.

If Erfworld uses a freely available forum package, knowing that might be a viable alternative.

_________________
Save money, Pop an Heir! - My first ever fanfic attempt.

My blog

  • Tip this post

    Make Anonymous
  • Top 
       
     Post Posted: Mon Dec 26, 2016 9:29 pm 
    E is for Erfworld Supporter Print Book 2 & Draw Book 3 Supporter This user is a Tool! This user was a Tool before it was cool Pin-up Calendar and New Art Team Supporter Here for the 10th Anniversary Has collected at least one unit Erfworld Bicycle® Playing Cards supporter
    Offline
    Joined: Sun Jun 23, 2013 6:54 pm
    Posts: 319
    It looks to be phpBB, which is even open source.

  • Tip this post

    Make Anonymous
  • Top 
       
     Post Posted: Wed Dec 28, 2016 8:04 am 
    Here for the 10th Anniversary Has collected at least one unit
    Offline
    Joined: Fri Sep 16, 2016 10:05 pm
    Posts: 497
    Website: http://lateralbreakdown.blogspot.com.au/
    Jaxad0127 wrote:
    It looks to be phpBB, which is even open source.


    That's both good and bad.

    Good; It's probably very easy to find its architecture

    Bad; It might not match the defaults.

    But its enough for me to start playing with.

    _________________
    Save money, Pop an Heir! - My first ever fanfic attempt.

    My blog

  • Tip this post

    Make Anonymous
  • Top 
       
     Post Posted: Fri Jan 06, 2017 9:21 am 
    Here for the 10th Anniversary Has collected at least one unit
    Offline
    Joined: Fri Sep 16, 2016 10:05 pm
    Posts: 497
    Website: http://lateralbreakdown.blogspot.com.au/
    Unless I can find a better way to extract the code, I'm going to have to write this off as beyond my capabilities. However that's not to say that it's beyond the capabilities of someone else. Here was my plan and how I would have enacted it;

    1) Extract source for all pages of a particular thread to text.

    2) Extract all text contained within the <div class="postbody"> tag. If it contains a quotecontext tag, include its text as a separate column. Do this for the first quote only.

    3) Import into access, with a primary key column. Posts should now be numbered in the order they were posted, with outermost quotes listed.

    4) Create query using this table with the columns primary key, postbody and a new column called origin.

    5) If postbody does not contain <div class="quote* then make origin equal to one.

    6) If it does, make origin equal the first matching cell in postbody that contains the contents of the quote.

    7) Export the query result and we now have the structural information that we would need to create a bulk import specification a mind mapping tool

    8) Import this information into a mind mapping tool, and we have Thought Bubbles!


    The main parts that are tripping me up are;

    *getting the forum posts - The tool I found to grab the pages grabs everything within that directory and sub directories so it would be a server hog and pulls lots of unnessecary data. While I can do it manually, I was hoping to semi-automate the process since it would be nice to have one for each reaction thread at the least.

    *extracting the text - I think I could get it a regex that was something like /$s/<div class="postbody">(.*((quotecontent.*)?)</div>\t/(1)\,(2) which should say grab everything after post body until next tab, but my regex's aren't the best. It would also need cleaning afterwards to remove the quotecontent part. I dunno how I can make that part optional without it stopping from doing what I want.

    *finding a mindmapping software that accepts bulk import from csv - haven't looked too hard into this yet.

    Anyone got ideas or suggestions that could streamline it a bit and make it more feasible?

    _________________
    Save money, Pop an Heir! - My first ever fanfic attempt.

    My blog

  • Tip this post

    Make Anonymous
  • Top 
       
     Post Posted: Wed Jan 11, 2017 6:14 pm 
    Offline
    Joined: Wed Jan 04, 2017 3:24 pm
    Posts: 200
    You have a bunch of hidden assumptions here:

    Knavigator wrote:
    <we can scrape the post content from the page html>

    I highly don't recommend that: it's awkward, unreliable, and prone to breaking if the site admins tweak the page styles even a little bit.

    The best representation of a post to go on is its actual bbcode. The best way to read that would be to have access to the forum database, but that's not likely to happen. In theory the admins could set up some API access to post data, but you'd have to give a bloody good reason for them to go to all that hassle.

    You could, I suppose, use the forum quote function itself to see the bbcode of somebody's post. However, any large-scale scraping of all threads on the forum is going to create unnecessary load on the servers, so don't be surprised if you get told off.

    Knavigator wrote:
    <each post quotes zero or one other posts>

    Not true, I often quote several people at once if I only have a short response to each of them. I find it tidier than posting multiple replies.

    Knavigator wrote:
    <quotes can be traced back to the original post>

    As I'm demonstrating right now, quotes can be paraphrased, or even invented out of whole cloth. :-)

    Knavigator wrote:
    Import this information into a mind mapping tool, and we have Thought Bubbles!

    Okay, but... now what? You've essentially created a threaded-discussion view of a forum topic (instead of phpbb's usual linear view), but it's not clear what this is useful for.

    Knavigator wrote:
    The tool I found to grab the pages grabs everything within that directory and sub directories so it would be a server hog and pulls lots of unnessecary data.

    Running large numbers of wget -r queries against the forum server is probably a good way to piss off the admins, yeah.

    Knavigator wrote:
    <parsing HTML with regular expressions>

    You can't do that.

  • Tip this post

    Make Anonymous
  • Top 
       
     Post Posted: Mon Feb 13, 2017 12:57 am 
    Here for the 10th Anniversary Has collected at least one unit
    Offline
    Joined: Fri Sep 16, 2016 10:05 pm
    Posts: 497
    Website: http://lateralbreakdown.blogspot.com.au/
    Thanks for the feedback.

    If you copy the forum part of a page into excel it puts it into two columns with post content on the right, user name and badges on the left.

    Every line break splits a post into a new cell, but each post starts the cell after one containing "Post Subject" as a hyperlink and ends just before a horizontal rule.

    A users joined and post statistics is an extended cell to the left of the post content.

    Quotes are given their own cell after one that says "[User] wrote:" and the quoted text is in a different font to the main body.

    That might be enough info to clean it up for an SQL database. I'm still going to work on it, but it's not really a priority.

    _________________
    Save money, Pop an Heir! - My first ever fanfic attempt.

    My blog

  • Tip this post

    Make Anonymous
  • Top 
       
    Display posts from previous:  Sort by  
     
    Post new topic  Reply to topic  [ 6 posts ] 

    Board index » Website Things » Suggestions/Requests


    Who is online

    Users browsing this forum: No registered users and 1 guest

     
     

     
    You cannot post new topics in this forum
    You cannot reply to topics in this forum
    You cannot edit your posts in this forum
    You cannot delete your posts in this forum
    You cannot post attachments in this forum

    Search for:
    Jump to: