查看完整版本: [-- Nutch 于 winxp --]

IdeaGrace - Java开发、Web开发论坛 -> Java开源项目 -> Nutch 于 winxp [打印本页] 登录 -> 注册 -> 回复主题 -> 发表主题

kevin 2006-07-05 10:01
这里有一篇文章
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
教我们如何在windows上安装nutch的,昨天我试了一下,nutch0.7.*以上都不可以用,提示IndexWriter类出错。于是在今天试了0.6版本以后,就很快搞定了。

下面说说我的配置,也好给大家一个参考:
1,Tomcat 4.1.30
2,JDK 1.4.2
3,Nutch 0.6
4,cygwin(装gcc)

安装过程:
1,安装JDK(在C:\jdk),配置环境变量NUTCH_JAVA_HOME = C:jdk
2,安装Tomcat(在C:\Tomcat)
3,安装cygwin
4,在cygwin中你的目录下,建一个目录,用来存放nutch,比如我的是:D:\cygwin\home\iwind\nutch,然后装nutch 0.6压缩包解压到这里。
5,运行Tomcat,在http://localhost:8080/manager/html上传nutch根目录下的nutch-0.6.war文件。
6,关掉Tomcat,将C:\tomcat\webapps下的ROOT文件名改成其它的,比如ROOT1,将nutch 0.6之类的目录名改成ROOT,这样你访问http://localhost:8080 时,就直接到nutch搜索了

以上步骤完成之后,就可以抓取网页了。

在nutch根目录下创建一个文件urls(应该是可以改成其它的),输入抓取的开始页,比如 http://www.ideagrace.com/

在nutch/conf下的crawl-urlfilter.txt里可以配置抓到的url特征,比如在# accept anything else下,将+.注释掉,换成自己的
+^http://([a-z0-9]*\.)*ideagrace.com/

保存就ok了,这里需要注意的是这个urls里的内容要和这个规则相匹配,不然无法开始抓取。

打开cygwin,cd到nutch目录下,运行
bin/nutch crawl urls -dir crawl.demo -depth 2 -threads 4
可以看到就开始了,这里的depth是深度,threads是线程数,urls就是刚才我们创建的那个文件,dir是抓取内容存放的地址。

过一段时间,cygwin抓完页面后就会停止。我们在
C:\tomcat\webapps\ROOT\WEB-INF\classes\nutch-site.xml
里,将
<nutch-conf>
</nutch-conf>
换成
<nutch-conf>
<property>
  <name>searcher.dir</name>
  <value>D:\cygwin\home\Administrator\nutch\crawl.demo</value>
</property>
</nutch-conf>

再启动tomcat,输入http://localhost:8080,就会发现我们自己的搜索引擎终于能搜索了。

kevin 2006-07-05 10:12
GettingNutchRunningWithWindows

Since Nutch is written in Java, it should be possible to get Nutch working in a Windows environment, provided that the correct software is installed.

The following documents how I got it working on Windows XP Pro running Tomcat 5.28.
Java

You will need to have Java 1.4.2 or Java 1.5 installed.
Cygwin

You'll need cygwin to run the shell commands since there are no separate scripts for NT cmd (the NT cmd shell does not nest environments recursively). Mks ksh does not work correctly with the scripts.
Tomcat

You'll need Tomcat 4.* or higher running on your machine.
Setup

Download the release and extract anywhere on your hard disk e.g. c:\nutch-0.7.1

Create an empty text file in your nutch directory e.g. "urls" and add the urls of the sites you want to crawl as shown in the tutorial.

Add your urls to the crawl-urlfilter.txt (e.g. C:\nutch-0.7.1\conf\crawl-urlfilter.txt). An entry could look like this: +^[WWW] http://([a-z0-9]*\.)*apache.org/

Load up cygwin and naviagte to your nutch directory. When cygwin launches you'll usually find yourself in your user folder (e.g. C:\Documents and Settings\username).

If your workstation needs to go through a windows authentication proxy to get to the internet then you can use an application such as the NTLM Authorization Proxy Server: [WWW] http://www.geocities.com/rozmanov/ntlm/ to get through it. You'll then need to edit the nutch-site.xml file to point to the port opened by the app.
Intranet Crawling

Follow the tutorial instructions to begin the crawl by entering commands in cygwin. Depending on the commands you enter Nutch should create a crawl directory and a log file.

For example, if you enter the following command:

bin/nutch crawl urls -dir crawled -depth 3 >& crawl.log

then a folder called crawled is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. From my experience you'll need to delete the crawled directory before starting the crawl off again.
Serving

In your Environment Variables settings, add NUTCH_JAVA_HOME and the location of your JVM (e.g. C:\j2sdk1.4.2_09) as a new Environment Variable

Open up a web browser and navigate to the Tomcat webapps manager (e.g. [WWW] http://localhost:8080/manager/html) and upload the WAR file to the context.

If a root context already exists, undeploy it.

You now need to create a context fragment file so that the root url points to your nutch webapp. Navigate to your [tomcat_home]/conf/Catalina/localhost/ and put it there. Create a new xml file (name it the same as the webapp?) e.g. nutch-0.7.1.xml and add something like the following line to it

<Context path="" debug="5" privileged="true" docBase="nutch-0.7.1"/>

Next, navigate to your nutch webapp folder then WEB-INF/classes. Edit the nutch-site.xml file and add the following to it (make sure you don't have two <nutch-conf></nutch-conf> tags!):

<nutch-conf>
<property>
  <name>searcher.dir</name>
  <value>your_crawled_folder_here</value>
</property>
</nutch-conf>

For example, if your nutch directory resides at C:\nutch-0.7.1 and you specified crawled as the directory after the -dir command, then enter C:\nutch-0.7.1\crawled\ instead of your_crawled_folder_here.

Restart Tomcat using the windows services tool, open up a browser and enter the url [WWW] http://localhost:8080. The nutch search page should appear. As long as you've defined the correct location of your nutch index directory as shown above then clicking search should yield results.


查看完整版本: [-- Nutch 于 winxp --] [-- top --]


Powered by PHPWind v6.0 Code © 2003-05 PHPWind
Gzip enabled

You can contact us