Crawling Other Sites [Archive] - Website Publisher Forums

Mike

03-10-2004, 09:13 AM

Hi all,

Would anyone be able to tell me the function to "crawl" another website? I know there isn't one specific function, but could you like give me a basic idea of what you need to know.

Thanks,
Mike

chromate

03-10-2004, 09:58 AM

I've never done it, but I guess you would have to make an HTTP request and then read the results into a variable. Then use some string functions to find what you're looking for.

Pas mentioned a good book that discusses this in the 4fineart thread I think.

Chris

03-10-2004, 11:00 AM

Basically load the content of the page.

Parse out all html links.

Enter the URLs into an array (or db).

Cycle through the array (DB) pulling each page.

Parse out all html links....

So on and so forth.

Mike

03-10-2004, 11:12 AM

Would you make a HTTP request, as chromate said, to load the content of the page Chris?

Thanks a lot,
Mike

Chris

03-10-2004, 11:57 AM

Yes.

PHP has a file() function that can fetch the contents of a remote file.

incka

03-10-2004, 12:09 PM

If you want to fill up your server do a complete data crawl of wikipedia.org

Mike

03-10-2004, 12:17 PM

Originally posted by Chris
Yes.

PHP has a file() function that can fetch the contents of a remote file.

I may give it ago then:)

Is it alright to do it? Or could the site owners not like it because it's eating up their bandwidth?

Thanks very much,
Mike

r2d2

03-10-2004, 12:36 PM

They probably wouldn't like it, but what can they do?

flyingpylon

03-10-2004, 02:43 PM

Why do you need to crawl another site?

If it's just to grab everything on the site and download it, there are products that do that already. Why reinvent the wheel? If you need to be able to search that other site, there are products that do that too.

However, if you really need to parse out specific data and put it in a database, then I suppose you'd need to write your own script.

Mike

03-10-2004, 03:08 PM

Originally posted by flyingpylon

However, if you really need to parse out specific data and put it in a database, then I suppose you'd need to write your own script.

Exactly :)

Will it be legal to crawl another site then? Or could some consider it as an attack?

Thanks,
Mike

incka

03-10-2004, 03:19 PM

Google do it, and I don't see them being sued for it...

r2d2

03-10-2004, 03:54 PM

You are just reading their website - they have put it there for all to read.

What exactly you are doing with it might be a problem - i.e. they cant stop you just fetching their website, but they would have a problem with you just republishing it! Im sure thats not what you were going to do, its just an example...

incka

03-10-2004, 03:59 PM

Yeah, only do it one site that either allow it, like lets say an affiliate program, or a site that is open source, like wikipedia.

Dan Morgan

03-10-2004, 04:18 PM

There are a couple of Spiders written available on Sourceforge.net...

Mike

03-11-2004, 12:43 AM

Thanks for all the replies...

Going off topic a little here, but isn't what google are doing against copyright laws? Like they are displaying part of someone's website content aren't they?

Mike

r2d2

03-11-2004, 02:04 AM

Unlikely to be sued though, cos this is benefit to the copyright holder...

Its a difficult one though... I guess it is a very small amount of any page, and they always show where it has actually come from.

Chris

03-11-2004, 07:44 AM

Not really. I believe someone tried suing them but failed. Basically because there are tools for someone to remove their site.

incka

03-11-2004, 09:18 AM

I bet the suer was google-watch.org

Mike

03-11-2004, 09:35 AM

Ok, thanks for all the responses guys :)

Mike

03-12-2004, 10:54 AM

I had a go at this last night, testing it with my site. I came to the parsing links bit though, and really didn't know where to go. So for two hours today I've been searching around php.net, but not found anything that works for me. I think its something to do with preg_grep, but I can't think how it will work.

Could anyone help?

Thanks very much,
Mike

Chris

03-12-2004, 11:41 AM

you definitely need to use regular expressions.

r2d2

03-12-2004, 12:11 PM

Finding links using Regular Expression Syntax (http://www.dotnetindex.com/read.asp?articleID=60)

Hope that helps.

This too: Regular Expressions in PHP (http://www.zend.com/zend/spotlight/code-gallery-wade5.php)

GCT13

03-12-2004, 12:36 PM

Thanks for those links.

Mike

03-12-2004, 12:57 PM

Yeh, thanks r2d2. I've just had a quick read, and the php one seems very useful:)

I made the following after reading it, but it's not working. Does anyone know what's wrong?

<?php
$file_lines = file("http://www.sitestem.com");
$page = htmlspecialchars($line);
foreach ($file_lines as $line) echo htmlspecialchars($line);

$match = ereg("href", $page);

if($match) {
echo "yes!";
}
?>

Thanks a lot,
Mike

r2d2

04-15-2004, 03:00 PM

Im just starting to use snoopy which I think came from PEAR (PHP function library type thing). I'm using it to make a POST request to a search page. Its pretty cool stuff. You should check it out if you were still doing this stuff.

mobilebadboy

04-15-2004, 03:28 PM

http://phpdig.net, unless you're just absolutely wanting to build something yourself (which I can assume you are). If it already exists, I'd rather utilize my time elsewhere. ;)