A data mining tool would be extremely useful. Currently if you want to mine data, you have to first download all the pages and then run them through TextPipe. It`s a two-step manual process and it wastes tons of hard drive space.
The user should be able to define filters which specify the pages he wants to extract data from. Then for each of the page filters, he should be able to enter one or more regular expressions to specify the actual data he wants to extract. The result of each regular expression should be assigned to a variable and then the user should be able to specify format strings for the actual output to one or more files. Within each format string, you could use whatever characters you want and of course the variables.
So as an example, let`s say that you created three regular expressions and assigned them to variables $a, $b and $c. Now you want to output the result as a comma delimitted file. You would specify your format string as:
Or if you wanted three separate files, you would create three format strings and each string would include one of the three variables.
Now when OE runs, it would follow the links like normal, but instead of saving the entire pages, it would only save the data as specified in the format strings and that data would be saved to each of the user-defined output files. Of course, you could still have an option to save the raw data as well.
So, to summarize, my ideal scenarios would be:
* Multiple page filters for selecting which pages to extract data from
* Within each page filter, multiple regular expressions to select the actual data to be extracted
* Allow the user to create multiple format strings, with each string assigned to an output file