Project not obeying "Skip existing files on levels higher than" option

Author Message
The GermRod 12/29/2015 07:24 pm
Hi Oleg:

I use OE a lot for news. I put news sites in the URL and use a level of 1 with a timer of about 10 minutes to download newly published articles. The website below does not obey the "Skip existing files on levels higher than 1" command and re-downloads articles that have already been downloaded. When I search for "Recently loaded files only" in the the "Find Contents" dialog, old news articles will be mixed in with new results. Please help.

[Object]
OEVersion=Pro 7.0.4408
Type=0
IID=8212
Caption=DNAInfo
URL=http://www.dnainfo.com/new-york/index/allhttp://www.dnainfo.com/new-york/Additional=DisableScripts;DisableJava;SkipIFrames;donotparseexistingfilesChannels=1
MVer=5
Lev=1
When=5
Minute=10
Weekday=257
FMGroup=3
FTText.Exts=htmlhtmaspaspxjspstmstmlidcshtmlhtxtxttextxspxmlrxmlcfmwmlphpphp3
FTImages.Exts=gifjpgjpegtiftiffxbmfifbmppngipxjp2j2cj2kwbmplwfwebp
FTVideo.Exts=mpgavianimpegmovflvfliflcvivrmramrvasfasxwmvm1vm2vvobsmilmp4m4v
FTAudio.Exts=wavriffmp3midmp2m3uravocwmaapeoggm4a
FTArchive.Exts=ziparcgzzarjlhalayleirarcabtarpakacejarpdftgzexe
FTUDef.Exts=jsaxdcssssivbsdtdxslswfclassent
FTText.B=ooxooo
FTImages.B=xoxooo
FTVideo.B=xoxooo
FTAudio.B=xoxooo
FTArchive.B=xoxooo
FTUDef.B=xoxooo
FTOther.B=ooxooo
FTSizes=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,0,3,0,0,0,0,0,0,0,0
NotIgnoreLogout=False
RPathIn=/201 x
RProt=255
LastStart=135:152:135:123:220:175:228:64:
LastEnd=44:37:163:123:220:175:228:64:
PrjStart=211:163:51:212:97:147:228:64:
LastStarted=12/28/2015 9:21:42 PM
LastEnded=12/28/2015 9:21:43 PM
S200=4
S304=42
SPar=4
SSav=4
SLast=304
SSiz=361845
SMdf=4
SHTML=3
SSuccDowns=708
LFiles=46
LSize=105534
Flags=1
ImgDim=0,0,0,0
PrevURL=http://www.dnainfo.com/new-york/index/all
ConvertWWW=False
Oleg Chernavin 12/29/2015 07:26 pm
You have "Skip existing files on levels higher than 0" selected in this Project setup. I tested its download several times and every time it loaded just 2 starting URLs. Nothing more.

One idea is to duplicate this Project - select it, press Ctrl+C, then Ctrl+V and delete the first copy of it. Would this make a change in its download behavior?

Best regards,
Oleg Chernavin
MP Staff
The GermRod 01/14/2016 10:07 pm
Hi Oleg.

I solved the problem: I used URL Substitutes to rename downloaded filenames to *.htm locally.

My problem was some existing files were being downloaded again, despite "Skip above 0" option being set (I wrote "Skip Above 1" by mistake in the original post, but it is correct in the project)

Since the pages have no extension (.htm or similar), OE was creating a directory with the name of the page and putting a "default.htm" file in said directory. OE would not recognize that the file had already been downloaded, and would download them again. This only happened to one or two pages, for some reason. (bug?)

Thanks.
Oleg Chernavin 01/15/2016 07:24 pm
Can you give me more details on how to reproduce this? Could it be because of URL Substitutes?

Oleg.
The GermRod 01/17/2016 07:15 pm
I just copied and pasted the original project into OE and downloaded it once. (Note that it is set to run every 10 minutes). The original project has no URL substitutes. This is what was downloaded:

c:\download\www.dnainfo.com\new-york\20151027\greenwich-village\watch-harlem-globetrotters-stomp-team-up-on-greenwich-village-courts
c:\download\www.dnainfo.com\new-york\20151103\richmond-hill\gunman-threatens-chinese-food-employee-over-chicken-wing-combo-worker-says
c:\download\www.dnainfo.com\new-york\20151112\lower-east-side\man-wanted-for-rape-attempted-assaults-manhattan-police-say
c:\download\www.dnainfo.com\new-york\20160114\central-harlem\13-things-do-your-manhattan-neighborhood-this-weekend
c:\download\www.dnainfo.com\new-york\20160114\park-slope\open-house-agenda-3-top-floor-apartments-see-this-weekend\default.htm
c:\download\www.dnainfo.com\new-york\20160114\park-slope\open-house-agenda-3-top-floor-apartments-see-this-weekend\slideshow\683873
c:\download\www.dnainfo.com\new-york\20160114\tompkinsville\5-things-for-you-do-staten-islands-neighborhoods-this-weekend
c:\download\www.dnainfo.com\new-york\20160114\upper-west-side\what-its-like-be-black-civil-war-re-enactor\default.htm
c:\download\www.dnainfo.com\new-york\20160114\upper-west-side\what-its-like-be-black-civil-war-re-enactor\slideshow\683972
c:\download\www.dnainfo.com\new-york\20160114\west-harlem\8-ways-commemorate-martin-luther-king-jr-day-city
c:\download\www.dnainfo.com\new-york\20160115\bed-stuy\condo-prices-fall-18-percent-bed-stuy-bushwick-crown-heights-report
c:\download\www.dnainfo.com\new-york\20160115\bed-stuy\decrease-bed-stuy-shootings-for-2015-is-unprecedented-nypd-chief-says
c:\download\www.dnainfo.com\new-york\20160115\brooklyn-heights\hidden-cocktail-bar-inspired-by-marie-antoinette-opens-brooklyn\default.htm
c:\download\www.dnainfo.com\new-york\20160115\brooklyn-heights\hidden-cocktail-bar-inspired-by-marie-antoinette-opens-brooklyn\slideshow\684298
c:\download\www.dnainfo.com\new-york\20160115\brownsville\4-of-5-suspects-released-brownsville-playground-rape-case
c:\download\www.dnainfo.com\new-york\20160115\bushwick\6-new-cafs-bars-gyms-check-out-greenpoint-wburg-bushwick
c:\download\www.dnainfo.com\new-york\20160115\central-harlem\charles-rangel-not-impressed-with-candidates-running-replace-him
c:\download\www.dnainfo.com\new-york\20160115\central-harlem\meet-candidates-running-replace-charles-rangel-congress
c:\download\www.dnainfo.com\new-york\20160115\downtown-brooklyn\1066-foot-tall-skyscraper-could-rise-downtown-brooklyn
c:\download\www.dnainfo.com\new-york\20160115\midtown\14-subway-lines-slated-for-service-changes-this-weekend
c:\download\www.dnainfo.com\new-york\20160115\midtown\man-steals-entire-essie-nail-polish-collection-from-duane-reade-police-say
c:\download\www.dnainfo.com\new-york\20160115\park-slope\raccoon--rat-infested-trash-pile-defeated-by-persistent-park-slopers
c:\download\www.dnainfo.com\new-york\20160115\upper-east-side\cast-your-vote-on-what-central-park-statue-should-be-sculpted-out-of-ice
c:\download\www.dnainfo.com\new-york\default.htm

All the files ending in "default.htm" will be downloaded again every time the project runs even though I don't want them to be downloaded again (I only want newly published news articles). It looks like it has something to do with the subdirectory "slideshow" being created.

The fix is URL substitutes
Apply to Filename
URL: *2016*
Replace *
with *.htm

Hope this helps.
Oleg Chernavin 01/17/2016 07:27 pm
I cannot reproduce this. I used your settings and downloaded it several times, only the two starting URLs were loaded. Nothing more on subsequent downloads. I used a bit newer version. Can you please try with it?

http://www.metaproducts.com/download/betas/opsetup.exe

Oleg.