Grouper - Documentation
Getting Started: Free Download |
Purchase |
Install
Reference: Functions | Plugins | Themes
Etc.: Configure | Affiliates
Reference: Functions | Plugins | Themes
Etc.: Configure | Affiliates
"Regular Expression" Based Scraper Plugin
The Regex plugin uses Perl-style regular expression matching to parse regularly structured web pages and convert them to RSS feeds. This plugin is intended for use by persons familiar with regular expression matching. Due to the complexity of analyzing the structure of web pages and constructing regular expressions to extract data from them, we are unable to provide support for configuring this plugin for particular webpages.Installation:
To use the Regex plugin, regex.php must be located in the "plugins" folder inside the folder containing grouper.php. This is the default location when Grouper Evolution is installed.
Use:
The following code will generate an RSS feed from a webpage:
<?php
require_once '/YOUR/PATH/TO/grouper/grouper.php';
GrouperLoadPlugin('regex.php');
GrouperSourceURL('http://example.com/foo/');
// additional configuration usually needed here, as described below
GrouperShow('','CACHE-FILE-NAME');
?>
Configuration:
You may configure the behavior of the Regex plugin using the function GrouperSourceConf, as follows:
GrouperSourceConf('OptionName','new value');
The Regex plugin has the following options:
- tossbefore: All data up to and including the first occurance of the specified text will be discarded before applying the regular expression.
- tossafter: All data including and following the first occurance of the specified text will be discarded before applying the regular expression.
- extractionpattern: The regular expression to use to parse the portion of the webpage that remains after application of the "tossbefore" and "tossafter" settings.
- extractionorder: The names of the item child elements that correspond to the parentesized portions of the regular expression in "extractionpattern". Separate multiple values with a pipe character (|). Enter the element names exactly as you want them to appear in the RSS feed.
- encoding: [Grouper <= 1.6.1] The character encoding of the page (and thus of the newsfeed). You can usually leave this as it is.
- channeltitle: [Grouper < 1.6] The title for your RSS channel.
- channeldescription: [Grouper < 1.6] The description for your RSS channel.
- searchdomain: [Grouper < 1.6] The domain name of the blog or other page you wish to scrape (for example, 'www.geckotribe.com'). Use the function GrouperSourceURL to set this option and querystart at the same time.
- querystart: [Grouper < 1.6] The path to the page you wish to scrape. This value MUST begin with '/'. If the path is to a directory and the document contains relative links, it must end with '/' for the links to be processed correctly. Note that this applies only to link fields, not to links in the description text (which are not altered by this plugin). Use the function GrouperSourceURL to set this option and querystart at the same time.