Downloading whole website
|Simon||07/28/2016 12:16 pm|
|This is my first time trying to download a website and hoping someone can help with a few questions I have.
The website consists of mostly text and mp4 videos.
This is the process I use (please correct me if wrong):
Project wizard -> Download regular website (Download regular site vs download and convert, should I convert before/after or even at all? Which format should I convert to, chm vs compressed exe vs maff vs mht?)
Turn off level Limit (I want to download whole site)
Load only within the starting URL vs Load only from the starting server (I've read the definition on the online help though still don't understand which one to pick)
Generate site map (How can I use this to estimate website size before downloading?)
Final Question: How can I exclude certain video files, eg: tutorial01_hd.mp4 and tutorial01_sd.mp4, I only want the hd file
Thanks in advance.
|Oleg Chernavin||07/28/2016 05:27 pm|
|You don't have to choose the conversion method. It will be possible to convert the downloaded site to any format later using the Export Files button.
The difference between starting URL and starting site is showed on the corresponding Wizard screen - the URL part is displayed and the download will be within that part.
It makes sense when you download from a subdirectory, like http://www.server.com/dir/. But if your starting address is just the server with no subdirectory, both these options are equal.
Unfortunately, generating site map is a very lengthy process, just a bit shorter than the real site download. And you would have to download it again after the size estimate. I would do the site download right away.
To exclude _sd files, select to setup advanced Project Properties on the last Wizard step. In the Project Properties dialog select File Filters - Video, make sure its Location box has "Load using URL Filters". Then go to URL Filters - Filename section and add to the Excluded filename keywords list:
|Simon||07/28/2016 06:10 pm|
|Thanks for the information, unfortunately I'm still a bit confused. As per the online help:
1) If you want to download web pages only within the starting URL, select "Load only within the starting URL."
2) If you do not want to load from other servers linked to the starting URL, select "Load only from the starting server."
Could you please explain this with an example?
Do both options download whole website?
What are "other servers", is it another website that is linked to original or another part of the same website?
|Oleg Chernavin||07/28/2016 06:35 pm|
|Let's take the starting URl http://www.apple.com/iphone/ as an example.
Starting URL option will load only the iPhone section and only from the Apple web site.
The following addresses will be not downloaded:
and others. The following links will be loaded:
and so on.
But if you choose the Starting Server option, all links on the Apple site are allowed, because these addresses start from http://www.apple.com/...
Other servers are skipped, like the link to https://support.apple.com/.
|Simon||07/28/2016 06:48 pm|
|Thanks for the explanation, I think I get it now.
So if starting url is: http://www.apple.com
Then Starting URL option and Starting Server option would download the same files.
|Oleg Chernavin||07/28/2016 07:06 pm|
To remove the confusion, I just improved the code to hide the Starting URL option if the address has no directories.