Downloading all pages from a query

Author Message
Rebecca Bryant 02/02/2007 05:45 pm
On the USPTO site I would like to download all only the top-level pages from a search, i.e., just the page for each patent not all the referenced patents. A search displays 50 hits at a time but I want to get all the pages in one download but only the level immediately below the list page. How do I do this?

Thank you
Oleg Chernavin 02/02/2007 06:13 pm
I haven't visited the site, but I think, you can use URL Filters - Filename section to allow only the filenames you need.

Best regards,
Oleg Chernavin
MP Staff
Rebecca Bryant 02/02/2007 06:56 pm
I tried that but it didn't work or I wasn't doing it properly.The file names are not terribly differentiated. For example, the URL below is the first page -- listing 1 - 50 hits -- of a search that I need to do. There are 6187 hits produced by this search and I need to get all of them but just one level down:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&p=1&f=S&l=50&Query=an%2Farmy&d=PTXT

For example I want:
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&p=1&f=G&l=50&d=PTXT&S1=army.ASNM.&OS=an/army&RS=AN/army
AND
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=2&p=1&f=G&l=50&d=PTXT&S1=army.ASNM.&OS=an/army&RS=AN/army

all the way to 6,187.

BUT I don't want any of the referenced patents such as

http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN%2FD278038

As you can see the URLs are almost indistinguishable. I think the pages that I want will always have "htm&r=1&p" or "htm&r=2&p" (the "2&p" for example, identifies that it is the second hit in the search) but (1) is there a way to enter part of a filename; and (2) will OE be able to use this to download only what I want?

Thank you very much for your help.

> I haven't visited the site, but I think, you can use URL Filters - Filename section to allow only the filenames you need.
>
> Best regards,
> Oleg Chernavin
> MP Staff
Oleg Chernavin 02/03/2007 08:33 am
Yes, you can use URL Filters - Filename, select Custom Configuration and add the following keywords:

htm&r=[0-9]&p

Oleg.
Rebecca Bryant 02/03/2007 01:12 pm
OK, I'm making some progress; I finally got OE to download the pages that I want and leave out the pages that I don't want EXCEPT I only got the first 50 patents in the first subset. It didn't continue onto the next page (in this particular query 6187 hits are returned). I finally figured out the distinguishing URL characteristics of the files that I want and the files that I don't need so I am using the following filters:

Included files keywords:
asnm.&os
Excluded files keywords:
=next&os
=prev&os
&query

What do I need to do to get OE to continue downloading each subset of 50 until it reaches 6187?

Thank you again very much for all your help!


> Yes, you can use URL Filters - Filename, select Custom Configuration and add the following keywords:
>
> htm&r=[0-9]&p
>
> Oleg.
Oleg Chernavin 02/05/2007 07:57 am
What about adding to the URLs field the following link:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&f=S&l=50&d=PTXT&OS=an%2Farmy&RS=AN%2Farmy&Query=an%2Farmy&TD=6187&Srch1=army.ASNM.&NextList{:2..125}=Next+50+Hits

Oleg.
Rebecca Bryant 02/05/2007 10:56 am
This gave me a set of files with names I haven't seen before. It looks like I got all the hit lists, that is each listing of 50; these downloaded first but after that again I only got the full page for the first subset of 50. I see what you're doing with the URL... where did you get the base URL to start with? (When I download the [NEXT_LIST] link it does not look like the URL you gave me.) Also, I tried using the following URL to get the full pages for all 6187 hits but it is not correct as OE choked on it. Had to end task.
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r={:51..6187}&p={:2..125}&f=S&l=50&d=PTXT&S1=army.ASNM.&Page=Next&OS=an_2Farmy&RS=AN_2Farmy

As you can see I tried to do something similar to what you did with the link you provided me. I took the filename of one of the full-page files and modified it to try to make a URL out of it and then added the page/list ranges but it is not a proper URL.

Am I on the right track here? Do you know what the URL should be?

Once again, thank you very much for your help. It is very much appreciated.

> What about adding to the URLs field the following link:
>
> http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&f=S&l=50&d=PTXT&OS=an%2Farmy&RS=AN%2Farmy&Query=an%2Farmy&TD=6187&Srch1=army.ASNM.&NextList{:2..125}=Next+50+Hits
>
> Oleg.
Oleg Chernavin 02/05/2007 12:24 pm
Yes, it is tricky. What about to try to allow HTML Forms processing in the Properties - Advanced - would this get such links?

Oleg.
Rebecca Bryant 02/05/2007 02:15 pm
Do you mean the "Explore HTML Forms" checkbox? I tried selecting that but got the same result. It dowloaded 225 files...it looks like 124 of those are the listing of patents for each subset of 50 and then the rest are the full page for each patent in the first subset of 50. (For each full page there are two files which seem to be exactly the same except the second copy of each has "&RS=AN_2Farmy" at the end.)

Best regards

> Yes, it is tricky. What about to try to allow HTML Forms processing in the Properties - Advanced - would this get such links?
>
> Oleg.
Oleg Chernavin 02/05/2007 03:19 pm
I tested and the following link is enough to load all pages:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=51&p={:2..125}&f=S&l=50&d=PTXT&S1=army.ASNM.&Page=Next&OS=an_2Farmy&RS=AN_2Farmy

You will have to access these pages using the Project Map, but at least, this is the way to load them all.

Oleg.
Rebecca Bryant 02/06/2007 08:53 am
Hi Oleg,

I haven't used this feature yet. Will give it a try. Thank you!


Rebecca

> I tested and the following link is enough to load all pages:
>
> http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=51&p={:2..125}&f=S&l=50&d=PTXT&S1=army.ASNM.&Page=Next&OS=an_2Farmy&RS=AN_2Farmy
>
> You will have to access these pages using the Project Map, but at least, this is the way to load them all.
>
> Oleg.