Any plans to include TextPipe Engine in OE ?
|Defenestration||10/07/2004 04:52 am|
|I was checking out the TextPipe website and saw that they licence the TextPipe engine for inclusion in other products for a one-off fee of US$500. Do you have any plans to integrate this into OE Pro ?|
|Oleg Chernavin||10/07/2004 10:17 am|
|Yes, we have plans to make a kind of Web data mining tool, but we haven`t determined the way we will do it yet. Do you need data extraction or various data conversion things as well?
|Brad Konia||11/22/2004 02:53 am|
|> Yes, we have plans to make a kind of Web data mining tool, but we haven`t determined the way we will do it yet. Do you need data extraction or various data conversion things as well?
A data mining tool would be extremely useful. Currently if you want to mine data, you have to first download all the pages and then run them through TextPipe. It`s a two-step manual process and it wastes tons of hard drive space.
The user should be able to define filters which specify the pages he wants to extract data from. Then for each of the page filters, he should be able to enter one or more regular expressions to specify the actual data he wants to extract. The result of each regular expression should be assigned to a variable and then the user should be able to specify format strings for the actual output to one or more files. Within each format string, you could use whatever characters you want and of course the variables.
So as an example, let`s say that you created three regular expressions and assigned them to variables $a, $b and $c. Now you want to output the result as a comma delimitted file. You would specify your format string as:
Or if you wanted three separate files, you would create three format strings and each string would include one of the three variables.
Now when OE runs, it would follow the links like normal, but instead of saving the entire pages, it would only save the data as specified in the format strings and that data would be saved to each of the user-defined output files. Of course, you could still have an option to save the raw data as well.
So, to summarize, my ideal scenarios would be:
* Multiple page filters for selecting which pages to extract data from
* Within each page filter, multiple regular expressions to select the actual data to be extracted
* Allow the user to create multiple format strings, with each string assigned to an output file
|Oleg Chernavin||11/22/2004 12:24 pm|
|Thank you! I will think on implementing the above filters. However this will take some time.