搜速引擎nutch教程 - IdeaGrace | Java文栏 | Open Source,IdeaGrace,web,开发,技术,交流,教程 - http://www.ideagrace.com/
搜速引擎nutch教程
|
|
nutch:免费的搜索引擎还包括搜索页面的前台,真的很不错,我研究了大概几个小时就做成了一个简陋blog搜索站点。趁着我的nutch在爬行时,把Nutch tutorial翻译一下。nutch该怎么读?nut -tch?
Nutch教程
需求:
1.Java 1.4.x,推荐Linux环境(windows也可以用反正是Java开发的软件),设置环境变量NUTCH_JAVA_HOME为你的JVM安装的根目录。
3.如果是Win32环境,要装
cygwin,为啥?她的脚本都是shell
4.清理出上G的硬盘,高速网络,和时间,搜索引擎爬的很慢的,抓的数据也很多,这不我等不急了在这儿打字玩。
开始了:
试一下下面命令:
bin/nutch
如果显示相关说明,说明没问题了,我可以用了。
1.Intranet抓取,用crawl命令。
2.Whole-web抓取,需要更高级的控制,用inject,generate,fetch,updatedb等命令。
Intranet抓取:
Intranet:配置
Intranet:运行抓取器
Whole-web 抓取
用于大规模抓取运行在多主机上,可能需要数周时间才能完成。
Whole-web:概念
Nutch数据有下面2类:
1.Web数据库,包含每个页面和页面间的链接。
2.segment数据集,抓取页面的集合并建立了索引,segment由下面组成:
fetchlist 待抓取页面的集合
fetcher output 包含已抓取页面文件的集合
index 是Lucene格式的fetcher output的索引
Whole-web:Boostrapping the Web Database
创建一个新的空的数据库
bin/nutch admin db -create
The injector adds urls into the database. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
Next we inject a random subset of these pages into the web database. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We inject one out of every 3000, so that we end up with around 1000 URLs:
bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000
This also takes a few minutes, as it must parse the full file.
Now we have a web database with around 1000 as-yet unfetched URLs in it.
Whole-web:Fetching
从数据库中创建fetchlist:
bin/nutch generate db segments
This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1:
s1=`ls -d segments/2* | tail -1`
echo $s1
Now we run the fetcher on this segment with:
bin/nutch fetch $s1
When this is complete, we update the database with the results of the fetch:
bin/nutch updatedb db $s1
Now the database has entries for all of the pages referenced by the initial set.
Now we fetch a new segment with the top-scoring 1000 pages:
bin/nutch generate db segments -topN 1000
s2=`ls -d segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch updatedb db $s2
Let's fetch one more round:
bin/nutch generate db segments -topN 1000
s3=`ls -d segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch updatedb db $s3
By this point we've fetched a few thousand pages. Let's index them!
Whole-web: Indexing
To index each segment we use the index command, as follows:
bin/nutch index $s1
bin/nutch index $s2
bin/nutch index $s3
Then, before we can search a set of segments, we need to delete duplicate pages. This is done with:
bin/nutch dedup segments dedup.tmp
Now we're ready to search!
Searching
首先要把nutch的war文件置入你的servlet container中(如果不是下载Release版,还要先编译出war)
假设你的Tomcat在~/local/tomcat,可以使用下面命令安装Nutch的war
rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war
由于webapp在./segments寻找索引,如果是intranet crawling,修改ROOT中nutch-site.xml文件,如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/usr/local/nutch/lystudio.test</value>
<description>My path to nutch's searcher dir.</description>
</property>
</nutch-conf>
如果是whole-web crawling,不必更改目录,运行下面命令:
~/local/tomcat/bin/catalina.sh start
现在可以访问http://localhost:8080/
喜欢本文?那就在线订阅更多文章更新吧!
加入技术论坛讨论
访问IdeaGrace开发者博客
浏览更多java开源项目
IdeaGrace开发者Wiki
更好的浏览体验,