File Filters and Duplicates in OE 6
|Sean||11/09/2011 09:24 am|
I am trying to download specific content (jpg's and mov's) from the members area of amateurallure.com. I have configured the project as best I can but I am receiving unwanted files and duplicates.
I am trying to only download *.mov and *.jpg files but NOT certain .mov and .jpg files (via excluded entires).
Example: I want video01.mov and video01.jpg but NOT videosm01.mov or video_th.jpg or any other video or image type.
I am also getting 2 of every file downloaded, which wastes bandwidth and space.
Lastly, files that aren't video or image are being downloaded - I don't know what these are but they are the same size as video files and have the video file name in them. I have tried to exclude them but they are being downloaded anyway.
Example: This is one of many files that have been downloaded that are neither .mov or .jpg - http://www.amateurallure.com/members/_girls/maelynn/_video/maelynn02.mov?PSSO=ei9Vb2RraGgramVtUUlxTEpjc1Q4aDRiVXlITDI3d3E1Mm84S0tJeUpsTG5iK2E5cTNpTmJiY1ZmODVNVVVWZQpyTUN5dS94VlQ3MzM2SXlra2MwZzZjQzlLdWYvZVlmdAo*
My project settings are... (I am providing only settings that are selected and/or not blank)
Project > Addresses: http://www.amateurallure.com/members/
Level Limit: unchecked
Do Not download existing files: selected
File Filters > Images (jpeg and jpg) and Videos (mov) checked
Load Using URL filter settings selected for both
URL Filters > Directory: Load files only within the starting directly and below selected
FileName: Load files only with the starting filename is NOT selected
Excluded Keywords: *_th.jpg *sm0*.mov *.mov@psso* *.mov?psso*
Content Filters > When keywords are found in a page - save these pages is checked
All settings beneath advanced are the defaults - only thing I have added is the download location and userid password.
The directory structure off of the members folder is...
I think some of these have the same content beneath them but in different folder names and locations hence the duplicates).
Any help you could offer would be greatly appreciated. I realize there is a lot of information here but I figure the more you know the better you'll be equipped to assist.
|Oleg Chernavin||11/09/2011 09:29 am|
|Everything looks correct from your description. Can you post exact settings here? Select the Project, press Ctrl+C and paste to the forum message.
|Oleg Chernavin||11/09/2011 09:30 am|
|Regarding duplicates - please try to uncheck the "Check files integrity" box in the Properties - Parsing section. Would this help?
|Sean||11/10/2011 06:09 pm|
|Below find project settings gathered via Project > Properties > CTRL+C
LastStarted=11/8/2011 7:04:53 AM
LastEnded=11/8/2011 7:38:13 AM
|Oleg Chernavin||11/10/2011 06:21 pm|
|The settings are correct. There could be a possible bug with the unchecked File Filters sections. Can you please give me access to the site via email@example.com ? I will make the download with your settings and see if this reproduces.
|Oleg Chernavin||11/11/2011 04:03 pm|
|Yes, I got the E-mail and trying to reproduce this.
|Oleg Chernavin||11/11/2011 04:22 pm|
|I made download (not complete). And enabled logging to see the rejected URLs.
The log clearly shows that the _tn.jpg files and the video files with .mov?psso=.... are not allowed for the download by the parser. It works correctly.
Regarding the duplicates - did you try to uncheck the "Check files integrity" box? Does it help? If not, please describe me what kind of duplicates do you have and what are their URLs. Also, on which pages I can find these links.