Two cores, one Nutch

Two cores, one nutch

Solr can do a lot inside TYPO3 CMS but once in a while you need to index external sites too. Apache Nutch can come to the rescue, but not without some help. This help comes in the shape of some plugins and patches created by Ingo Renner and Phuong Doan from dkd Internet Service GmbH. These add the necessary fields and values to make the data in the index compatible with the TYPO3 extension “solr”.

It works great if you only have one external indexing job for a single core. If you have multiple cores (either for different languages or for different TYPO3 sites) things need some tweaking. This small tutorial describes a method to have multiple configuration in a single Nutch installation.

Installing Nutch

Just follow the instructions in the github readme, or download the ready-made tar.gz .

If you have the solr server installation as created by the scripts that come with the extension solr you should unpack the files in the archive into /opt/solr-tomcat/apache-nutch-for-typo3 .

Preparing multiple configurations

Create a new directory “configurations” and for each core you want to use create a subdirectory (if you use the name of the core it's easier to recognize what's for what). Also create similar subdirectories for each core in the “crawls” and “urls” directories.

Copy the entire contents of the “conf” directory to each of the configurations subdirectories. Now it's possible to modify each nutch-site.xml and regex-urlfilter.txt file separately. The same is true for the list of URLs to start crawling in each “urls” subdirectory.

Running Nutch

The instructions on github are almost correct for except for a few details. If you create a shell script for each job it's easy to run them as a cronjob. Make sure that no two jobs run at the same time as they use environment variables to configure Nutch.

cd /opt/solr-tomcat/apache-nutch-for-typo3/
export JAVA_HOME=/usr/lib/jvm/jre
export NUTCH_CONF_DIR=/opt/solr-tomcat/apache-nutch-for-typo3/configurations/core_one
bin/crawl urls/core_one crawls/core_one 127.0.0.1/solr-4.8.1/core_one 3 >/dev/null 2>&1

Setting the NUTCH_CONF_DIR environment variable makes it possible to use a specific configuration for a crawl job. The “crawls” subdirectories separate the temporary data of each configuration.

The JAVA_HOME environment variable must be adapted to the location of the Java Runtime Environment.

TYPO3 development for fun