Mirroring a PHPBB forum

Jan Land
05/20/2009 08:47 pm
Hello!

I wanto to extract all links which have the format:

http://www.megaupload.com/*

from the site: http://hd-bb.org/viewforum.php?f=60

username: gorgonzola
password: qwerty

These are my steps:

I started a new project named:
http://hd-bb.org/viewforum.php?f=60

Then I added the limit 1 and checked ONLY "Tex" as File Filter
In "Content Filter" I added: http://www.megaupload.com and left everything else default there
In Advanced/Passwords, I added the username/password above.
However, Offline Explorer parses the threads, but it downloads 0 files.
How I can fix this and get what I want?

Jan
Oleg Chernavin
05/21/2009 06:35 am
I think, you should do it differently - do not use the Passwords section in the Project Properties dialog. Instead, logon the site in the Internal browser. This will be enough.

Remove the keyword from Contents Filter. Use URL Filters - Server and add two keywords to the Included list:

www.megaupload.com
hd-bb.org

I think, this should work.

Best regards,
Oleg Chernavin
MP Staff
Jan Land
05/21/2009 08:57 am
> I think, you should do it differently - do not use the Passwords section in the Project Properties dialog. Instead, logon the site in the Internal browser. This will be enough.
>
> Remove the keyword from Contents Filter. Use URL Filters - Server and add two keywords to the Included list:
>
> www.megaupload.com
> hd-bb.org
>
> I think, this should work.
>
> Best regards,
> Oleg Chernavin
> MP Staff

Hi!

This is what I did:

I logged in with the internal browser.
Then created a new project with the wizard and checked under File Filters only "Text".
Then under "URL Filters" I added what you told me in the Server Tab.
Then I started the project, but I noticed that the queue gets bigger and bigger.
Therefore I added at URL Omissions, the following:
http://hd-bb.org/memberlist.php?*
http://hd-bb.org/posting.php?*
http://hd-bb.org/report.php?*
http://hd-bb.org/search.php?*
http://hd-bb.org/ucp.php?*

The idea is, that only the topics are parsed. But even so, it takes alot of time.
So, I would ask the following:

How can I extract the links of the format http://www.megaupload.com/* from the first post of each thread? In a timely fashion of course. How would you do it? I am just interested in a list containing just the links.

Thank you for your time and great support!

Jan
Oleg Chernavin
05/21/2009 10:24 am
The problem is that links to megaupload site were made not as links on the pages, but just as a text. Offline Explorer was not made to get links from texts. I think, this is the reason why it doesn''t download.

Oleg.
Jan Land
05/21/2009 10:44 am
> The problem is that links to megaupload site were made not as links on the pages, but just as a text. Offline Explorer was not made to get links from texts. I think, this is the reason why it doesn''''t download.
>
> Oleg.

So, while being logged in with the internal browser and after downloading all topics on the server, will I be able to do a search for "http://www.megaupload.com/*", so that it outputs a list with links?

Sorry if I will double post, but the last message, wasn''t posted by the forum.

Jan
Oleg Chernavin
05/21/2009 02:14 pm
Yes, you can make the search, but as I understand you need not the search, but a kind of extract of all such texts. So far there is no such feature in Offline Explorer, you will have to use some other text extraction software.

I think, TextPipe Pro will work for this task - please use the Tools - DataMining button in Offline Explorer Pro. This software is not easy to understand, but quite powerfull.

I think, its trial mode will still allow you to make the extraction.

Oleg.
Jan Land
05/21/2009 02:41 pm
> Yes, you can make the search, but as I understand you need not the search, but a kind of extract of all such texts. So far there is no such feature in Offline Explorer, you will have to use some other text extraction software.
>
> I think, TextPipe Pro will work for this task - please use the Tools - DataMining button in Offline Explorer Pro. This software is not easy to understand, but quite powerfull.
>
> I think, its trial mode will still allow you to make the extraction.
>
> Oleg.

Hi Oleg!

There is one problem:

There were 57000 files extracted. But none contains the data which you get after logging in to the forum. When I open a file with a browser, I am requested to log myself in. In the files on my hdd, there are no links at all. How can I download the posts which I see after I log in to the forum?
Oleg Chernavin
05/21/2009 03:08 pm
Please logon the forum in the Internal browser and then start the download. This should be enough.

Oleg.
Jan Land
05/22/2009 07:55 pm
> Please logon the forum in the Internal browser and then start the download. This should be enough.
>
> Oleg.

Hi!

I am still having problems. Do you know what works?
If I go on a thread in the internal browser, like:
http://hd-bb.org/viewtopic.php?f=14&t=15024

and log in and then choose by right click, the menu "Offline Explorer: Download the current page". Then everything works as it should.
Now, how do I do it, so that I can get the same result with ALL threads in the section:
http://hd-bb.org/viewforum.php?f=14

If I am doing the following, I don''t get any data, only files which require me to log in to read the post:

1) I am going to http://hd-bb.org/viewforum.php?f=14 and log in with gorgonzola/qwerty
2) I click "New Project", add http://hd-bb.org/viewforum.php?f=14 as URL
3) I choose only text as File Filters
4) I keep everything else default and start the program

As a result, I get for instance files of the form viewtopic.php@f=14&t=*
But when I open these files in a browser, I am requested to log in to the forum. I am not getting the topic itself, but a page which requires me to log in. This happens with every topic!

What do I need to fix, in order to get the same result I would get if I would save an individual topic with Offline Explorer?

I am sorry if am annoying you Oleg, but I think that Offline Explorer can really do what I want.

Jan
Oleg Chernavin
05/23/2009 09:10 am
Maybe you are using an older version of Offline Explorer? It may follow logout links. This will not happen in 5.5 version.

Oleg.
Jan Land
05/23/2009 12:08 pm
> Maybe you are using an older version of Offline Explorer? It may follow logout links. This will not happen in 5.5 version.
>
> Oleg.

I am using the latest version. Could you try a test run? I am pretty sure, you will get the same results as me.
Oleg Chernavin
05/25/2009 05:42 am
I did the following - created a new Project with URL:
http://hd-bb.org/viewforum.php?f=60
Level=1
Unchecked All File Filters categories, except Text, as in your setup. URL Filters - Server, Directory - "Load only from the starting...", Filename - added the following to the Included list:

viewtopic

File Filters - Text - Ignore Logout Links box is checked. This worked OK for me. I downloaded 209 pages, all are in the logged on state.

In the Ribbon - Internet tab it is important to have checked the following:

Use MS Internet Explorer cookies
Use alternative connection method

Oleg.