Python Crawler
Script to download data based of specific pattern from apache directory listing
Sunday 16 April 2006 at 4:43 pm. Used tags: python, web_crawler, web_pattern_fetcher.py
This IMO is one of the dirtiest way for me up till now to accomplish a requirement. But it does the job I want. :D
DISCLAIMER: I'm a learner. There must be better, smarter and easier way to accomplish the same task.
rrs@learner:~/My_Documents/My Books $ cat /home/rrs/devel/eclipse/PythonFun/web_pattern_fetcher.py
#!/usr/bin/env python
"""
This tiny little script does the job of crawling into Apache generated directory listings
and download scanning a specific pattern.
I'm using it to download anything that apache shows as TXT or IMG.
I'm sure others will be able to extend it more.
"""
import urllib, urllib2, string
url = "http://www.wuppy.net.ru/Fun/"
req = urllib2.Request(url)
handle = urllib2.urlopen(req)
x = 1
data = ''
while x:
data = ''
line = handle.readline()
if "[TXT]" in line or "[IMG]" in line:
word_list = line.split(' ')
word = word_list[4:5]
req_word = str(word)
# Break and take out the relevant data uri
begin_num = req_word.find(">")
end_num = req_word.find("</A" )
req_word = list(req_word)
while begin_num < end_num - 1:
final_word = string.lstrip( string.rstrip(str(req_word[begin_num+1:begin_num+2]), "']"), "['")
data += final_word
begin_num += 1
#data.append(req_word[begin_num+1:begin_num+2])
real_url = url + data
urllib.urlretrieve(real_url, data)
if line == '':
x = 0