Unstructured scraper plugin for Grouper Evolution

Grouper - Documentation

Unstructured Scraper Plugin

The Unstructured plugin finds all of the links on a webpage and uses them and the text that follows them to construct an RSS feed. The plugin can be configured to skip links that you don't want included in the feed. The result tends to be a fairly "quick and dirty" feed unless the webpage is well suited to the way the plugin works and/or appropriate configuration is done. Still, for some purposes, the simplicity of this plugin is useful.

Installation:
To use the Unstructured plugin, unstructured.php must be located in the "plugins" folder inside the folder containing grouper.php. This is the default location when Grouper Evolution is installed.

Use:
The following code will generate an RSS feed from a webpage:

<?php
require_once '/YOUR/PATH/TO/grouper/grouper.php';
GrouperLoadPlugin('unstructured.php');
GrouperSourceURL('http://example.com/foo/');
// additional configuration usually needed here, as described below
GrouperShow('','CACHE-FILE-NAME');
?>

Configuration:
The Unstructured plugin provides the function UnstructuredGrouperAddSkip, which is used to configure parts of the document to omit from the feed. The function has two arguments, "$type" and "$data". $type indicates how the value of $data is to be used, and can have the following values:

in-link: Any link whose URL contains the specified text is omitted from the feed.

Example:
UnstructuredGrouperAddSkip('in-link','http://example.com/');
in-title: Any link whose title contains the specified text is omitted from the feed.

Example:
UnstructuredGrouperAddSkip('in-title','comments');
text: The value of $data in array with two members. Everything between the values in the first and second members of the array is removed from the document before extracting links and other data. If the first member is blank, everything from the beginning of the document up to and including the first occurrance of the value of the second member is removed. If the second member is blank, everything from the last occurance of the first member to the end of the document is removed. If neither member is blank, text between each occurance of the first member and the most closely following occance of the second member is removed.

Example:
UnstructuredGrouperAddSkip('text',array('','<body')); // remove everything before the "body" tag
UnstructuredGrouperAddSkip('text',array('',''));
regex: This value is intended for use only by persons familiar with Perl style "regular expressions". The value of $data in array with two members. The first is regular expression. All data in the document matching the specified pattern is removed from the document before extracting links and other data. The second member contains any "pattern modifiers" you wish to have used when applying the regular expression. The "m" and "s" modifiers are always applied and should not be specified here. You may want to make the pattern matching case insensitive by including an "i" in this argument.

Example:
UnstructuredGrouperAddSkip('regex',array('<a.*?class="readmore".*?<\\/a>','i'));

You may set the remainder of the configuration options for the Unstructured plugin using the function GrouperSourceConf, as follows:

GrouperSourceConf('OptionName','new value');

The Regex plugin has the following options:

maxdesc: The maximum number of characters to include in the item description element.
channeltitle: [Grouper < 1.6] The title for your RSS channel.
channeldescription: [Grouper < 1.6] The description for your RSS channel.
encoding: [Grouper <= 1.6.1] The character encoding of the page (and thus of the newsfeed). You can usually leave this as it is.
searchdomain: [Grouper < 1.6] The domain name of the blog or other page you wish to scrape (for example, 'www.geckotribe.com'). Use the function GrouperSourceURL to set this option and querystart at the same time.
querystart: [Grouper < 1.6] The path to the page you wish to scrape. This value MUST begin with '/'. If the path is to a directory and the document contains relative links, it must end with '/' for the links to be processed correctly. Note that this applies only to link fields, not to links in the description text (which are not altered by this plugin). Use the function GrouperSourceURL to set this option and querystart at the same time.