GT::URI - Internet resource request broker
use GT::URI; my $doc = GT::URI->get( 'http://www.gossamer-threads.com' );
GT::URI Makes requests and retrieves resources from internet servers.
Just want just a few items? Call GT::URI::HTTP->get and all the magic will be done for you.
use GT::URI; my $docs = GT::URI->get( "http://www.gossamer-threads.com/", "http://www.google.com/", "http://www.somethingelse.com" );
If options need to be set, include a hashref that has the appropriate setting you'd like set
use GT::URI; my $conf = { max_down => 2000 }; my $docs = GT::URI::HTTP->get( $conf, "http://www.gossamer-threads.com/", "http://www.google.com/", "http://www.somethingelse.com" );
The simple method blocks when acquiring the data, meaning until all the data is downloaded, your script is frozen. GT::URI has the capability to do handle things in a non-blocking fashion, so while you wait for the documents to download, you can do something else.
A very simple example follows.
use GT::URI; use GT::Dumper;
$uri = new GT::URI();
# queue up the URIs wanted $uri->rack_uri( "http://www.gossamer-threads.com/", "http://www.google.com/", "http://www.somethingelse.com" );
# loop through until there are no more requests left to finish while ( $uri->requests() ) { $docs = $uri->do_iteration();
# do something here print '.'; }
# output all the data print Dumper($docs);
But this can quickly get much more complex. Since the downloads are asynchronous, the code can be changed to handle each request as it comes in.
use GT::URI; use GT::Dumper;
$uri = new GT::URI();
# queue up the URIs wanted $uri->rack_uri( "http://www.gossamer-threads.com/", "http://www.google.com/", "http://www.somethingelse.com" );
# loop through until there are no more requests left to finish while ( $uri->requests() ) {
$uri->do_iteration();
# if there are any completed requests, handle them if ( my $number_completed = $uri->completed_requests() ) {
print "Completed $number_completed request(s):\n"; my $completed = $uri->completed(); print Dumper( $completed );
# IMPORTANT: the object caches downloaded requests, once the # data wanted has been pulled out of the object, clear the object's # cache. Otherwise, the resource will appear again in the next # $uri->completed() call $uri->clear_completed(); }
# do something here print '.'; }
# output all the data print Dumper($docs);
It is possible to queue more links with the $uri->rack_uri()
within the loop safely though a separate accounting system must be designed to prevent infinite loops.
GT::URI has only a few options to control it's behaviour: There's not much it does that can be configured!
$opts = {
# maximum number of bytes to download for a single resource 'max_down' => 0,
# maximum number of simultaneous downloads 'max_simultaneous' => 10,
# configuration settings for individual protocols, look in # any GT::URI::xxxx protocol module to find out related # configuration options 'protocol_opts' => { 'protocol_name' => { setting => value, ... }, # eg 'HTTP' => { 'agent_name' => 'example agent name option value' } } }
The data structure that GT::URI produces to house all the resource infomation is mildly complex.
$docs = { 'uri requested' => { 'buffer' => 'resource data', 'resource_attribs' => { 'resource_key' => 'value' }, 'extra info' => .... } }
The 'buffer' will contain the raw http data, 'resource_attribs' will contain extra information related to the resource.
Depending on the service requested, there could be more information added. Currently no protocol requires the need for an extra key.
Socket Handling
sub do_iteration() Basic looping function that downloads resources in the background sub pending() Returns true if data awaiting
Acquisition
sub completed() Returns a hash of all the completed requests sub completed_requests() The number of requests completed sub clear_completed() Cleans the completed request cache sub get() Simple resource aquisition function sub rack_uri() Add a URI to be downloaded sub requests() Returns number of active requests sub vec() Sets file bits suitable for a select call
Returns a datastructure with the cached completed documents.
Returns the current number of completed requests in the cache.
Clears the completed document cache.
The major bulk of the non-blocking work is handled within this function.
The simplest way of acquiring a number of pages. Call and it will return a the GT::URI data structure.
The configuration hashref, can be found anywhere in the list. The function will iterate through the get parameters and assume any hashref is an option parameter and any scalar an URI.
Returns '1' or '0' if there is data pending to be downloaded for any of the requests
Takes a list of URLs and queues them for download.
Will return the number of requests pending action in the downloading queue. Usually this would be followed up with a call to $uri->do_iteration();
Returns a bit mask that can be used in a call to select. If you want to use an already existing bit mask, pass it into the function and the appropriate bits from requets will be additionally set.
forthcoming
Copyright (c) 2000 Gossamer Threads Inc. All Rights Reserved. http://www.gossamer-threads.com/
Revision: $Id: URI.pm,v 1.24 2002/04/07 03:35:35 jagerman Exp $