Can the OE Internal Browser parse Java Script ?
|George Robinson||06/18/2016 08:00 pm|
|Can the OE Internal Browser parse the Java Script on the page
in order to download all of the cache.nxp.com/documents/data_sheet/*.PDF files, linked from that page ?
|George Robinson||06/18/2016 08:06 pm|
My post above was supposed to go into the Offline Explorer category.
|Oleg Chernavin||06/18/2016 08:33 pm|
|Yes, it is possible with 7.2 version. Please install the latest Offline Explorer Pro:
Then import the Project settings:
Use the Import - Project Settings - Load from Text File on the main toolbar. Then start downloading the Project.
|George Robinson||06/19/2016 05:21 am|
|> Oleg wrote:
> Yes, it is possible with 7.2 version. Please install the latest Offline Explorer Pro:
Do you mean that earlier versions could not do it but v7.2 can ?
If "yes", could you elaborate what makes the v7.2 so special ?
I am just curious - I am a programmer myself so I will understand any explanation.
> Then import the Project settings:
I had no problem importing the Project settings and the project appears to run, but it saves only 28 files from the directory cache.nxp.com/documents/data_sheet/ ...but there should be hundreds of them.
FYI: I am getting a lot of errors in the log:
I was trying to understand how Offline Explorer works using this help link:
Unfortunately I was not able to grasp the difference in Project settings between:
1) The mapping of a website (enumerating the links)
2) Saving the files to a local disk.
Specifically, I am confused about which OE Project settings refer to the mapping/traversing/spidering and which settings refer to the actual downloading of the files to the local disk.
For example I noticed, that all of the PDF files that interest me are in the directory:
...but the contents of this directory cannot be listed so OE needs to get the links to these PDF files from:
So, apparently OE needs to limit the mapping/traversing/spidering process to the above directory and below.
...but OE needs to save the PDF files to the local disk only from the directory:
Unfortunately cache.nxp.com IS NOT BELOW www.nxp.com !!!!
In the Project settings, I cannot see the separation between the directory used for mapping/traversing/spidering and the directory for downloading the PDF files that are to be saved locally.
I have the same conceptual problem in the branch "The Filters" of the Project settings tree.
I do not know whether these filters refer to the files that will be spidered or to the files that will be retrieved and saved locally.
For example: If I only enable the PDF file type in "The Filters" branch, will OE still traverse and parse the .html and .js files, that are the sources of links to these PDF files ?
I wish all of that was explained in the help section at:
|Oleg Chernavin||06/19/2016 08:53 pm|
|That's strange - my download lasted 2 hours and I then counted 227 files in the cache.nxp.com\documents folder.
I also checked a number of pages from the Saved Pages tab - I used the Documentation link to click through the PDF links - they were all openable offline.
The Products/Parts column lists 175 items, so 227 PDF files looks OK.
7.2 version was necessaru because it introduces the ability to autoscroll pages X times before saving them. As you can see, the Products/Parts list gets populated as you scroll the page down.
I fixed the errors in the log, however they had no affect on the download at all. Anyway, here is the updated version:
I also noticed some links that lead to non-English languages. If they are unnecessary, you may get rid of them easily. Select the Project, click the Properties button, go to the URL Filters - Filename section and add:
to the Excluded list.
As for how I filtered the links - I allowed all servers and directories to be downloaded. But the URL Filters - Filename - Included keywords list allows only the files that lead to PDF links and PDF links as well.
|George Robinson||06/20/2016 12:40 pm|
...seems to work as well for limiting the PDF downloads to the Datasheets only.
|George Robinson||06/20/2016 12:49 pm|
|I wrote a detailed reply with links to project properties but it did not appear here.
When I was posting it, I got a message that a moderator must approve it :/
|George Robinson||06/22/2016 06:38 pm|
|> Oleg wrote:
> That's strange - my download lasted 2 hours
Yes, InternetExplorer is terribly slow. It borders on non-usability.
I thought that its API would be faster...but apparently not so.
> Oleg wrote:
> and I then counted 227 files in the cache.nxp.com\documents folder.
That is only because you downloaded more than just the datasheets for these products and this way each products has several PDFs downloaded (e.g. a Datasheet and an ApplicationNote).
Try to limit downloading the PDFs to files only from the directory /documents/data_sheet in order to download only datasheets.
Also, not that you also have the same files filenames with slightly different filenames.
If you do, I found that setting the URL Filter to include filenames to FileExt=.pdf limits the PDF duplicates
> The Products/Parts column lists 175 items, so 227 PDF files looks OK.
Note, that many products did not have their datasheets downloaded, such as:
... I think this is because they do not have the keyword "tab=Buy_Parametric_Tab" in their corresponding filenames appearing on the starting page:
However, I noticed that OE's Internal Download Code does better than the IE and can download datasheets for these missing products without the Buy_Parametric_Tab and can do it much faster.
See my project properties at:
However I found out that even with the Internal Download Code, the datasheets for the following products did not get downloaded:
Also, notice that the other project named "NXP_All_files_1st_Level" did not find any files referring to the two products above.
Do you know why the internal code missed these products but not the others?
It would be nice if the internal code could treat the return value from the ng-click event as just another .js or .html file to be parsed for links.
<a class="ng-scope ng-binding" ng-click="showProdInfoPop($event, rowData.prodCodeObj) "
> I also noticed some links that lead to non-English languages. If they are unnecessary, you may get rid of them easily. Select the Project, click the Properties button, go to the URL Filters - Filename section and add: lang_cd=
to the Excluded list.
Thanks for noticing that. Indeed datasheets in Chinese should no be mingled up with he other ones ;)
> As for how I filtered the links - I allowed all servers and directories to be downloaded. But the URL Filters - Filename - Included keywords list allows only the files that lead to PDF links and PDF links as well.
Ii is still unclear how the OE's Download Process works. For example in my project named "NXP_Internal_Most_Datasheets", I still don't understand what causes the files in the
www.nxp.com/products/discretes-and-logic/logic/hct remote directory to be parsed but not downloaded to my local directory?
|Oleg Chernavin||06/22/2016 06:44 pm|
|Yes, I just approved it. I tried to download the link with Project settings, but the server asks for a password.
You can easily post Project settings here - select it, press Ctrl+C on keyboard and paste to the forum message.
Could it be that the setting to prevent duplicates skipped loading the PDFs for 74HC4049 and 74HC4050?
Yes, the internal code is faster. Because via the browser Offline Explorer has to wait for a page to open, then few more seconds more to let all scripts finish their initialization and work, then scroll down to let page and scripts load more data and so on.
Regarding the directory for PDFs. You may change the .pdf Included filename keyword to :
|George Robinson||06/22/2016 06:59 pm|
>Yes, I just approved it. I tried to download the link with Project settings, but the server asks for a password.
O o! It does not happen to me. This server does not require any passwords.
|George Robinson||06/22/2016 07:44 pm|
|Oleg Chernavin||06/23/2016 07:36 pm|
|Yes, your Project downloads lots of files. Regarding these two parts that didn't download - use the Find Contents dialog to find web pages that contain these parts. And see if they contain the links to download PDFs or not.
|George Robinson||06/29/2016 10:15 am|
|Yes, however with such a wide scope, OE finds all of the relevant pdf files.
I have another challenge for you:
How to download all of the pdf datasheets from the following URL?:
|Oleg Chernavin||06/30/2016 06:28 pm|
|What about a Project with Level=1, File Filters - Archive - Download from any site. And in the "Open pages in the browser and save" mode.
|George Robinson||07/01/2016 08:42 pm|
|Nope, such project downloads only several pdf files from
The problem is that OE never goes into the 2nd, 3rd of the product documentation pages.
For example the documentation URL for just this 1 example product at:
...has 3 pages of documentation. The product's pdf datasheet is usually on the 2nd documentation page but the OE never gets there :(
|Oleg Chernavin||07/05/2016 07:18 pm|
|Yes, this is tough! Maybe only manually - open this page in the Internal Browesr, select show 50 items per page and then use the Save Page button - Save page with its links. It is on the Internal Browser toolbar.