LWP (Library for WWW in Perl)
If you want to automatically process web pages to extract data, you have a number of tools available. You can bring a web page down to your computer using "curl" or "wget"
curl http:.//aplawrence.com > mysite
If you don't really want the html, use "lynx --dump http://whatever.com > /yourstorage/whatever.txt" to get a text representation of the page. Check the man page for options you might want like "--nolist" and also see lynx alternatives
You can also easily be selective and pull only the data you want from a page with simple Perl scripts.
#!/usr/bin/perl
use LWP::Simple;
$url = 'http://aplawrence.com";
$content = get $url;
print $content;
And then of course you'd process the $content as desired. It's only a little more complex if you are dealing with forms; see http://aplawrence.com/Words/2005_03_05.html for a small example of that.
A book that covers LWP is reviewed at http://aplawrence.com/Books/webc.html.
*Originally published at APLawrence.com
A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com
Search Bing From Hotmail Inbox to Insert ContentBing Added to Quick Add Feature
-

Real-Time Search Engines Rush to Fill New Need
Twitter has produced a hot new trend: real-time search. -

Google's OS to Challenge Microsoft?
Googlers Sundar Pichai and Linus Upson announced on Wednesday that... -

Is Twitter Scaring Google?
There have been multiple reports that Twitter could replace Google. -

User Authentication Services: Good or Bad?
Products such as OpenID, Facebook Connect, and Google Friend Connect...
WWDC Demo: two tip calculators The Unofficial Apple...
Fidelity doubles stake in... Seattle Times
How To: Excel At Excel For SEM... Search Engine Land
Forecaster of the Month:... MarketWatch
iEntry 10th Anniversary
RSS
Newsletter
Advertising




















1 Comment
great stuff
Another great article AP
Post new comment