NAME

GT::URI - Internet resource request broker


SYNOPSIS

  use GT::URI;
  my $doc = GT::URI->get( 'http://www.gossamer-threads.com' );


DESCRIPTION

GT::URI Makes requests and retrieves resources from internet servers.


BASICS

Getting a resource, the simple way

Just want just a few items? Call GT::URI::HTTP->get and all the magic will be done for you.

  use GT::URI;
  my $docs = GT::URI->get( "http://www.gossamer-threads.com/";, "http://www.google.com/";, "http://www.somethingelse.com"; );

If options need to be set, include a hashref that has the appropriate setting you'd like set

  use GT::URI;
  my $conf = { max_down => 2000  };
  my $docs = GT::URI::HTTP->get( $conf, "http://www.gossamer-threads.com/";, "http://www.google.com/";, "http://www.somethingelse.com"; );

When you've got better things to do than wait

The simple method blocks when acquiring the data, meaning until all the data is downloaded, your script is frozen. GT::URI has the capability to do handle things in a non-blocking fashion, so while you wait for the documents to download, you can do something else.

A very simple example follows.

  use GT::URI;
  use GT::Dumper;
  $uri = new GT::URI();
  # queue up the URIs wanted
  $uri->rack_uri( "http://www.gossamer-threads.com/";, "http://www.google.com/";, "http://www.somethingelse.com"; );
  # loop through until there are no more requests left to finish
  while ( $uri->requests() ) {
               $docs = $uri->do_iteration();
               # do something here
               print '.';
             }
  # output all the data
  print Dumper($docs);

But this can quickly get much more complex. Since the downloads are asynchronous, the code can be changed to handle each request as it comes in.

  use GT::URI;
  use GT::Dumper;
  $uri = new GT::URI();
  # queue up the URIs wanted
  $uri->rack_uri( "http://www.gossamer-threads.com/";, "http://www.google.com/";, "http://www.somethingelse.com"; );
  # loop through until there are no more requests left to finish
  while ( $uri->requests() ) {
               $uri->do_iteration();
               # if there are any completed requests, handle them
               if ( my $number_completed = $uri->completed_requests() ) {
                  print "Completed $number_completed request(s):\n";
                  my $completed = $uri->completed();
                  print Dumper( $completed );
                  # IMPORTANT: the object caches downloaded requests, once the
                  # data wanted has been pulled out of the object, clear the object's 
                  # cache. Otherwise, the resource will appear again in the next
                  # $uri->completed() call
                  $uri->clear_completed();
               }
               # do something here
               print '.';
             }
  # output all the data
  print Dumper($docs);

It is possible to queue more links with the $uri->rack_uri() within the loop safely though a separate accounting system must be designed to prevent infinite loops.

Options to configure GT::URI

GT::URI has only a few options to control it's behaviour: There's not much it does that can be configured!

  $opts = {
        # maximum number of bytes to download for a single resource
            'max_down'    => 0,
        # maximum number of simultaneous downloads
            'max_simultaneous'    => 10,
        # configuration settings for individual protocols, look in 
        # any GT::URI::xxxx protocol module to find out related
        # configuration options
             'protocol_opts' => {
                                'protocol_name' => {
                                                     setting => value,
                                                    ...
                                                    },
                                # eg
                                'HTTP'    => {
                                            'agent_name' => 'example agent name option value'
                                            }
                                
                              }
           }

The main data structure GT::URI creates

The data structure that GT::URI produces to house all the resource infomation is mildly complex.

  $docs = {
             'uri requested' => {
                                    'buffer' => 'resource data',
                                    'resource_attribs' => {
                                                               'resource_key' => 'value'
                                                               },
                                    'extra info' => ....
                                 }
           }

The 'buffer' will contain the raw http data, 'resource_attribs' will contain extra information related to the resource.

Depending on the service requested, there could be more information added. Currently no protocol requires the need for an extra key.


METHOD LIST

Socket Handling

  sub do_iteration()        Basic looping function that downloads resources in the background
  sub pending()             Returns true if data awaiting

Acquisition

  sub completed()           Returns a hash of all the completed requests
  sub completed_requests()  The number of requests completed
  sub clear_completed()     Cleans the completed request cache
  sub get()                 Simple resource aquisition function
  sub rack_uri()            Add a URI to be downloaded
  sub requests()            Returns number of active requests
  sub vec()                 Sets file bits suitable for a select call

completed () : completed_requests HASHREF

Returns a datastructure with the cached completed documents.

completed_requests () : num_requests INTEGER

Returns the current number of completed requests in the cache.

clear_completed ()

Clears the completed document cache.

do_iteration () : completed_requests HASHREF

The major bulk of the non-blocking work is handled within this function.

GT::URI->get ( [ conf HASHREF, ] url STRING, url STRING, url STRING..., ) : completed_requests HASHREF

The simplest way of acquiring a number of pages. Call and it will return a the GT::URI data structure.

The configuration hashref, can be found anywhere in the list. The function will iterate through the get parameters and assume any hashref is an option parameter and any scalar an URI.

pending () : status BOOLEAN

Returns '1' or '0' if there is data pending to be downloaded for any of the requests

rack_uri ( url1 STRING, [ url2 STRING ... ] )

Takes a list of URLs and queues them for download.

requests ( tics INTEGER ) :

Will return the number of requests pending action in the downloading queue. Usually this would be followed up with a call to $uri->do_iteration();

vec ( [ bits STRING ] ) : bits STRING

Returns a bit mask that can be used in a call to select. If you want to use an already existing bit mask, pass it into the function and the appropriate bits from requets will be additionally set.


BUILDING PROTOCOL HANDLERS

forthcoming


COPYRIGHT

Copyright (c) 2000 Gossamer Threads Inc. All Rights Reserved. http://www.gossamer-threads.com/


VERSION

Revision: $Id: URI.pm,v 1.24 2002/04/07 03:35:35 jagerman Exp $