Project not obeying "Skip existing files on levels higher than" option

User Forums
Offline Explorer Pro
Project not obeying "Skip existing files on levels higher than" option

Author

Message

The GermRod

12/29/2015 07:24 pm

Hi Oleg:

I use OE a lot for news. I put news sites in the URL and use a level of 1 with a timer of about 10 minutes to download newly published articles. The website below does not obey the "Skip existing files on levels higher than 1" command and re-downloads articles that have already been downloaded. When I search for "Recently loaded files only" in the the "Find Contents" dialog, old news articles will be mixed in with new results. Please help.

[Object]
OEVersion=Pro 7.0.4408
Type=0
IID=8212
Caption=DNAInfo
URL=http://www.dnainfo.com/new-york/index/allhttp://www.dnainfo.com/new-york/Additional=DisableScripts;DisableJava;SkipIFrames;donotparseexistingfilesChannels=1
MVer=5
Lev=1
When=5
Minute=10
Weekday=257
FMGroup=3
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwfwebp
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4m4v
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaapeoggm4a
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jsaxdcssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=xoxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0,0,0,0,0,0,0,0
NotIgnoreLogout=False
RPathIn=/201 x
RProt=255
LastStart=135:152:135:123:220:175:228:64:
LastEnd=44:37:163:123:220:175:228:64:
PrjStart=211:163:51:212:97:147:228:64:
LastStarted=12/28/2015 9:21:42 PM
LastEnded=12/28/2015 9:21:43 PM
S200=4
S304=42
SPar=4
SSav=4
SLast=304
SSiz=361845
SMdf=4
SHTML=3
SSuccDowns=708
LFiles=46
LSize=105534
Flags=1
ImgDim=0,0,0,0
PrevURL=http://www.dnainfo.com/new-york/index/all
ConvertWWW=False

Oleg Chernavin

12/29/2015 07:26 pm

You have "Skip existing files on levels higher than 0" selected in this Project setup. I tested its download several times and every time it loaded just 2 starting URLs. Nothing more.

One idea is to duplicate this Project - select it, press Ctrl+C, then Ctrl+V and delete the first copy of it. Would this make a change in its download behavior?

Best regards,
Oleg Chernavin
MP Staff

The GermRod

01/14/2016 10:07 pm

Hi Oleg.

I solved the problem: I used URL Substitutes to rename downloaded filenames to *.htm locally.

My problem was some existing files were being downloaded again, despite "Skip above 0" option being set (I wrote "Skip Above 1" by mistake in the original post, but it is correct in the project)

Since the pages have no extension (.htm or similar), OE was creating a directory with the name of the page and putting a "default.htm" file in said directory. OE would not recognize that the file had already been downloaded, and would download them again. This only happened to one or two pages, for some reason. (bug?)

Thanks.

Oleg Chernavin

01/15/2016 07:24 pm

Can you give me more details on how to reproduce this? Could it be because of URL Substitutes?

Oleg.

The GermRod

01/17/2016 07:15 pm

I just copied and pasted the original project into OE and downloaded it once. (Note that it is set to run every 10 minutes). The original project has no URL substitutes. This is what was downloaded:

c:\download\www.dnainfo.com\new-york\20151027\greenwich-village\watch-harlem-globetrotters-stomp-team-up-on-greenwich-village-courts
c:\download\www.dnainfo.com\new-york\20151103\richmond-hill\gunman-threatens-chinese-food-employee-over-chicken-wing-combo-worker-says
c:\download\www.dnainfo.com\new-york\20151112\lower-east-side\man-wanted-for-rape-attempted-assaults-manhattan-police-say
c:\download\www.dnainfo.com\new-york\20160114\central-harlem\13-things-do-your-manhattan-neighborhood-this-weekend
c:\download\www.dnainfo.com\new-york\20160114\park-slope\open-house-agenda-3-top-floor-apartments-see-this-weekend\default.htm
c:\download\www.dnainfo.com\new-york\20160114\park-slope\open-house-agenda-3-top-floor-apartments-see-this-weekend\slideshow\683873
c:\download\www.dnainfo.com\new-york\20160114\tompkinsville\5-things-for-you-do-staten-islands-neighborhoods-this-weekend
c:\download\www.dnainfo.com\new-york\20160114\upper-west-side\what-its-like-be-black-civil-war-re-enactor\default.htm
c:\download\www.dnainfo.com\new-york\20160114\upper-west-side\what-its-like-be-black-civil-war-re-enactor\slideshow\683972
c:\download\www.dnainfo.com\new-york\20160114\west-harlem\8-ways-commemorate-martin-luther-king-jr-day-city
c:\download\www.dnainfo.com\new-york\20160115\bed-stuy\condo-prices-fall-18-percent-bed-stuy-bushwick-crown-heights-report
c:\download\www.dnainfo.com\new-york\20160115\bed-stuy\decrease-bed-stuy-shootings-for-2015-is-unprecedented-nypd-chief-says
c:\download\www.dnainfo.com\new-york\20160115\brooklyn-heights\hidden-cocktail-bar-inspired-by-marie-antoinette-opens-brooklyn\default.htm
c:\download\www.dnainfo.com\new-york\20160115\brooklyn-heights\hidden-cocktail-bar-inspired-by-marie-antoinette-opens-brooklyn\slideshow\684298
c:\download\www.dnainfo.com\new-york\20160115\brownsville\4-of-5-suspects-released-brownsville-playground-rape-case
c:\download\www.dnainfo.com\new-york\20160115\bushwick\6-new-cafs-bars-gyms-check-out-greenpoint-wburg-bushwick
c:\download\www.dnainfo.com\new-york\20160115\central-harlem\charles-rangel-not-impressed-with-candidates-running-replace-him
c:\download\www.dnainfo.com\new-york\20160115\central-harlem\meet-candidates-running-replace-charles-rangel-congress
c:\download\www.dnainfo.com\new-york\20160115\downtown-brooklyn\1066-foot-tall-skyscraper-could-rise-downtown-brooklyn
c:\download\www.dnainfo.com\new-york\20160115\midtown\14-subway-lines-slated-for-service-changes-this-weekend
c:\download\www.dnainfo.com\new-york\20160115\midtown\man-steals-entire-essie-nail-polish-collection-from-duane-reade-police-say
c:\download\www.dnainfo.com\new-york\20160115\park-slope\raccoon--rat-infested-trash-pile-defeated-by-persistent-park-slopers
c:\download\www.dnainfo.com\new-york\20160115\upper-east-side\cast-your-vote-on-what-central-park-statue-should-be-sculpted-out-of-ice
c:\download\www.dnainfo.com\new-york\default.htm

All the files ending in "default.htm" will be downloaded again every time the project runs even though I don't want them to be downloaded again (I only want newly published news articles). It looks like it has something to do with the subdirectory "slideshow" being created.

The fix is URL substitutes
Apply to Filename
URL: *2016*
Replace *
with *.htm

Hope this helps.

Oleg Chernavin

01/17/2016 07:27 pm

I cannot reproduce this. I used your settings and downloaded it several times, only the two starting URLs were loaded. Nothing more on subsequent downloads. I used a bit newer version. Can you please try with it?

http://www.metaproducts.com/download/betas/opsetup.exe

Oleg.

Project not obeying "Skip existing files on levels higher than" option

MetaProducts Systems Privacy Practices

Personal Information

Web Tracking Information

Information Security and Quality

Business Relationship

Cookies

Requests for Information and Legal Requirements

MetaProducts Systems Web Site Copyright

MetaProducts Systems End User License Agreement

TRADEMARKS

IMPORTANT: PLEASE READ THIS AGREEMENT CAREFULLY BEFORE USING THE SOFTWARE.

END USER LICENSE AGREEMENT

LICENSE OF UNREGISTERED SOFTWARE

LICENSE OF REGISTERED SOFTWARE

DISTRIBUTION OF UNREGISTERED SOFTWARE

TERM OF LICENSE

ACCEPTANCE OF THIS LICENSE AGREEMENT

LIMITATIONS OF USE

DISCLAIMER OF WARRANTY AND LIABILITY

OTHER RESTRICTIONS

INVALID PROVISIONS

ENTIRE AGREEMENT

GOVERNING LAW

MetaProducts Systems Terms of Use

TERMS OF USE

COPYRIGHT

MetaProducts Systems Trademarks