Technology Stuff: Solr and Tomcat on Windows

It has been a long while since I touched Lucene. Lately, I have a chance to look into the latest coming out of Apache Lucene project - Solr. It is pretty amazing. Instead using the lucene libraries directly, we could have a nice web interface to leverage and with the RESTful API, it takes virtually no time to get a document indexed and integrate the search capability into an application.

The Tutorial from Sor site is very straightforward and took me less than 10 minutes to go thru it and got me started. To make it be used in a production environment, It needs to be installed in a Tomcat server. I will write a few posts to capture what i learned on using it. First thing first, here are the complete steps to set Solr on Tomcat:

Download needed packages: I am using Solr 4.81 and Tomcat 7.035 and Java JDK 1.7.0_25 on Windows 7

Unzip solr package into C:\Software\solr-4.8
Deploy Solr application to Tomcat
Copy C:\Software\solr-4.8.1\dist\solr-4.8.1.war
to C:\Software\apache-tomcat-7.0.53\webappsRename solr-4.8.1.war to solr.war
Add additional libraries to satisfy logging needs
Copy all jar files from C:\Software\solr-4.8.1\dist\solrj-lib
to C:\Software\apache-tomcat-7.0.53\lib.
Failed to do will yield logging related 'Class Not Found Error"
Setup solr home
Copy C:\Software\solr-4.8.1\example\solr directory
to a place you want to use as solr home, e.g. C:\Software\UserData\solrCollections
note: 1. this directory contains two folders (bin and collection1) and a few other files
2. this is also the directory that we setup for multi-core (see below)
Make Tomcat know solr home
Modify catalina.bat file found in C:\Software\apache-tomcat-7.0.53\bin to add following to refer to the solr home:
set CATALINA_OPTS=-Dsolr.solr.home=C:/Software/UserData/solrCollections
Set up logging
a). Copy log4j.properties from C:\Software\solr-4.8.1\example\resources
to a directory on classpath. I am using: C:\Software\apache-tomcat-7.0.53\webapps\solr\WEB-INF\classes
b). Setup log folder by making solr.log=../logs/ in log4j.properties file. The default value (solr.log=logs/) will create logs directory inside tomcat\bin folder. I don't want that and in this example, the solr.log can be found at: C:\Software\apache-tomcat-7.0.53\logs
note: add set CATALINA_OPTS=%CATALINA_OPTS% -Dlog4j.debug to catalina.bat to assure the log4j.properties files is found from the classpath.
Start Tomcat by running catalina.bat in C:\Software\apache-tomcat-7.0.53\bin
Launch solr from http://localhost:8080/solr and following screen should show solr admin console:
Since I already did the tutorial, the indexed data already exists in the /collection1/data, i can continue to use it to verify my setup.

This completes the simple set up of solr with Tomcat. Following additional steps are to set up muti-core. Multi-core (From Solr wiki: Multiple cores let you have a single Solr instance with separate configurations and indexes, with their own config and schema for very different applications, but still have the convenience of unified administration. Individual indexes are still fairly isolated, but you can manage them as a single application, create new indexes on the fly by spinning up new SolrCores, and even make one SolrCore replace another SolrCore without ever restarting your Servlet Container.). Depends on how we want to use multi-core, there are many options, also while below can be also achieved via solr command line, here I am using the admin console to do it just to capture the basics.
Create a new core - core2:
Create C:\Software\UserData\solrCollections\core2 by replicating the collection1.
Empty the core2\data directory.
Delete the core.properties
note: Failed to do so will yield an error on unable to find solrconfig.xml file.
In Solr admin console, create a new core with the name core2.
Refer to solrConfig.xml to make necessary changes to reflect your specific needs.

DataImportHandler

I had a need to retrieve data from a database and then index it. For that, DataImportHandler is the way to go and here I noted the process down for future reference.

First and foremost, this involved three configuration files.

solrconfig.xml
data-config.xml
schema.xml

Step 1: Add data-config.xml in solrconfig.xml.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

Step 2: Add queriess in data-config.xml

<dataConfig>
    <database name="myTestDB" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://127.0.0.1/mydb user="root"/>
    <document>
        <entity name="resource"
                transformer="com.mysolr.plaground.MyBlobTransfomer"
                query = "select id,name,base_path, target_url,content, author from solr_test">
            <field name="id" column="id" />
            <field name="basepath" column="base_path" />
            <field name="proxyname" column="name" />
            <field name="blobcontent" column="content" blob="true" srcColumn="content" />
                <field name="certified" column="certified" blob="true" element="lifecycle" srcColumn="content" />
                <field name="contacts" column="entry" blob="true" element="contacts" srcColumn="content" />
                <field name="platform" column="platform" blob="true" element="techstack" srcColumn="content" />
                <field name="contributinggroup" column="contributinggroup" element="general" blob="true" srcColumn="content"/>
            <field name="creator" column="author" />
        </entity>
    </document>
</dataConfig>

Step 3: Add indexing fields to schema.xml.

Take all defined fields from above (the value of the "name" attribute). Make sure the field type (i.e. text_general) is previously defined in the schema.xml. If not, replace text_general with a pre-defined type, i.e. text or some other name.

<field name="basepath" type="text_general" index="true" stored="true" />
<field name="proxyname" type="text_general" index="true" stored="true" multiValued="true" />
<field name="creator" type="text_general" index="true" stored="true" multiValued="true" />
<field name="certified" type="text_general" index="true" stored="true" multiValued="true" />
<field name="contacts" type="text_general" index="true" stored="true" multiValued="true" />
<field name="platform" type="text_general" index="true" stored="true" multiValued="true" />
<field name="contributinggroup" type="text_general" index="true" stored="true" multiValued="true" />

Step 4: Run http://localhost:8080/solr/dataimport?command=full-import to add index

In Action

The first DataImport I needed to do was to import and index a blob field from an Oracle database using BlobTransformer.With the reference from Lucidworks, I was able to cast the return object with oracle BLOB class (oracle.sql.BLOB).

import oracle.sql.BLOB;
import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.Transformer;
...

public class BlobTransformer extends Transformer {
    private static Log LOGGER = LogFactory.getLog(BlobTransformer.class);

    @Override
    public Object transformRow(Map<String, Object> row, Context context) {
        List<Map<String, String>> fields = context.getAllEntityFields();
        JSONObject xmlJSONObject;

        for (Map<String, String> field, fields) {
            //check if this field has blob=true specified in the
            //data-config.xml
            String blob=field.get("blob");

            if("true".equals(blob)) {
                Object value = row.get("srcColumn"); //
                BLOB blobValue =null;
                String propertyXml = "<empty />";

                if(blobValue!=null) && (value instanceof BLOB){
                    blobValue = (BLOB)value;
                    try{
                        byte[] bdata = blobValue.getBytes(1,(int)blobValue.length());
                        propertyXml = new String(bdata);
                        ... ...
                    }catch(){
                        ...
                    }
                }
            }
        }

Things worked out pretty good until recently, I encountered a similar use case but the backend database is MySQL. According to MySQL document, the Blob datatype can be casted into java.sql.Blob. It might be true if I make the query call from within my Java code. It didnt fit my use case. I have a MySQL table solr_test, which has a blob field content. Some XML content is stored in it. In MySQL workbench, this field has BLOB showing as its content. At command line, I could view its content in its original native XML format.

After necessary changes in my data-config file and schema.xml as well as my new version of BlobTransformer, I hoped the transformer would work as previous version. Not the case! The first error I got was the value is not an instance of BLOB so none of the logic got executed in that if block. After looking it up, the value has [B type (a byte[]). To moving forward, I changed the condition to be if(blobValue!=null) && (value.getClass().getName().equals("[B") {...

I encountered the seconded issue after the above change: [B can not be cast into java.sql.Blob. Although the value object is of byte array type, given its real object type, I couldn't simply make a String out it. I had to turn byte array object into a real byte[]. Following serialization process was taken (there maybe a simpler way to use common.lang package to serialize it, but I wanted to do it the hard way!).

try{
    ByteArrayOutputSream out = new ByteArrayOutputStream();
    ObjectoutputStream oos = new ObjectOutputStream(out);
    oos.writeObject(value);
    ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(out.toByteArray())); // line needed
    byte[] bdata = (byte[])ois.readObject(); //line needed
    propertyXml = new String(bdata);
    ... ...
}catch(IOException ioe){
    ...
}

N.B. If I replace the two "line needed" with byte[] bdata = out.toByteArray() as quite some posts on internet suggested, I would end up with the orginal xml content with some unreadable characters or symbols added to the beginning.

Technology Stuff

Saturday, May 31, 2014

Solr and Tomcat on Windows