Importing a MySQL database into Apache Solr Part I

This is a tutorial for getting started with Apache Solr enterprise search with PHP and MySQL on Unix/Linux or Mac OS X. Most of the information I have found is rather incomplete or full of erros so I wanted to share how I have successfully used Solr in a number of different applications. I’m going to break this down into three separate posts/sections.

  • Importing a MySQL database into Apache Solr
  • Using PHP with Apache Solr
  • Using the Suggester feature in the new Apache Solr 3.1

Apaches Solr is an enterprise search application that was developed at CNET and is now part of the Apache Lucene project. It runs in a standalone java servlet contain like Jetty or Tomcat. Since it runs as a webservice, you access it from any application like PHP or in Rails. Results can be returned in XML, JSON, or now CSV format.

There are several advantages to using Solr for your search function in you application. The biggest is speed. In a ~1 gig table Solr will return a result in under 0.2 seconds versus MySql fulltext search which can take up to 10 seconds (or even longer) depending on the length and complexity of the query. End users expect fast search results and Solr can deliver that, while allowing a level of control over search that a person would not be able to achieve using Google, i.e. search for posts by user name or product by manufacturer.

Another big advantage with Solr is that it recognizes plurals and slight misspellings, for instance a query with “Arc’teryx” will return results with both “Acrteryx” and “Arc’teryx”. A description of using other search features like facets can be found at the Solr Homepage.

Where would you use Solr? Any site that has a lot of content will benefit from using enterprise search; large popular blogs, e-commerce sites with a large product database, and forums/message boards.

Server requirements depend on the size of database but are fairly modest while running queries. I have been able to run Solr on a VPS with 300 megs or RAM with laser quick results. However, the database import requires much more RAM, for a 1.5 gig DB it requires at least 2 gigs and preferably 4gig. I don’t believe it’s necessary to run Solr on a separate server or VPS, just making sure that you setup your cron jobs for import to run in the evening when traffic is low.

Let’s get started!

First, all the examples in this tutorial will be run on locally on OS X under the domain solr.local. If you don’t already develop on your machine locally I encourage you to do so. On Mac OS X both MAMP and VirtualHostX are great tools for using a local dev environment.

  1. Start by downloading Solr. The current version is 3.1. Place the directory in an appropriate location, ie User/Sites/Solr31.
  2. In the termimal change into the example directory.
  3. Start the Jetty server up by executing in the terminal “java -jar start.jar.
  4. Get familiar with the terminal output for Jetty as this will help you track down errors later in your configuration.
  5. In your browser go to http://solr.local:8983/solr/admin
  6. Go through the Solr tutorial to get yourself familiar with the feature set.
  7. Stop Jetty with Control-C.

Great. Now we are ready to start getting Solr configured to import your MySQL database. In this example I will be configuring Solr to import a generic message board table.

  1. Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/
  2. Copy the downloaded directory into example/lib
  3. In example/solr/conf create a file called data-config.xml. Change the sql queries to reflect the database you are indexing.
    1. <dataConfig>
    2. <dataSource type="JdbcDataSource"
    3. driver="com.mysql.jdbc.Driver"
    4. url="jdbc:mysql://localhost/ccubbupgrade"
    5. user="root"
    6. password=""/>
    7. <document name="content">
    8. <entity name="id"
    9. query="SELECT
    10. p.post_id,
    11. p.post_posted_time,
    12. FROM_UNIXTIME(p.post_posted_time) AS post_posted_datetime,
    13. p.user_id,
    14. t.forum_id,
    15. p.post_subject,
    16. p.post_default_body
    17. FROM
    18. POSTS AS p
    19. JOIN
    20. TOPICS AS t
    21. ON p.topic_id = t.topic_id
    22. WHERE
    23. p.POST_IS_APPROVED = '1'
    24. AND t.TOPIC_STATUS != 'M'"
    25. deltaImportQuery="SELECT
    26. p.post_id,
    27. p.post_posted_time,
    28. FROM_UNIXTIME(p.post_posted_time) AS post_posted_datetime,
    29. p.user_id,
    30. t.forum_id,
    31. p.post_subject,
    32. p.post_default_body
    33. FROM
    34. POSTS AS p
    35. JOIN
    36. TOPICS AS t
    37. ON p.topic_id = t.topic_id
    38. WHERE
    39. p.post_id = ${dataimporter.delta.post_id}"
    40. deltaQuery="SELECT
    41. post_id
    42. FROM
    43. POSTS
    44. WHERE
    45. post_posted_time > UNIX_TIMESTAMP('${dataimporter.last_index_time}')">
    46. <field column="POST_ID" name="post_id" />
    47. <field column="POST_POSTED_TIME" name="post_posted_time" />
    48. <field column="USER_ID" name="user_id" />
    49. <field column="FORUM_ID" name="forum_id" />
    50. <field column="POST_SUBJECT" name="post_subject" />
    51. <field column="POST_DEFAULT_BODY" name="post_default_body" />
    52. </entity>
    53. </document>
    54. </dataConfig>
  4. We need to tell Solr that we will be using the JDBC driver. Open the solrconfig.xml file in the same directory and insert the following code.
    1. <requestHandler name="/dataimport"
    2. class="org.apache.solr.handler.dataimport.DataImportHandler">
    3. <lst name="defaults">
    4. <str name="config">data-config.xml</str>
    5. </lst>
    6. </requestHandler>
  5. Now we need to add some info into the schema.xml file, located in the same directory, to tell Solr more about our table columns and how to conduct a search.
    1. <!-- Field to use to determine and enforce document uniqueness.
    2. Unless this field is marked with required="false", it will be a required field
    3. -->
    4. <uniqueKey>post_id</uniqueKey>
    5. <field name="post_id" type="string" indexed="true" stored="true" required="true"/>
    6. <field name="post_posted_time" type="tint" indexed="true" stored="true"/>
    7. <field name="user_id" type="int" indexed="true" stored="true" />
    8. <field name="forum_id" type="int" indexed="true" stored="true"/>
    9. <field name="post_subject" type="text" indexed="true" stored="true"/>
    10. <field name="post_default_body" type="text" indexed="true" stored="true"/>
    11. <!-- field for the QueryParser to use when an explicit fieldname is absent -->
    12. <defaultSearchField>fulltext</defaultSearchField>
  6. Go ahead and start Jetty up again with “java -jar start.jar” and check for any errors in the terminal.
  7. Now it’s time to do a full index. In your browser go to http://solr.local:8983/solr/dataimport?command=full-import . This process will take a while depending on the size of your database. Check the status of the import with http://solr.local:8983/solr/dataimport . The response format will look like this for the first little bit “00data-config.xmlidleThis response format is experimental. It is likely to change in the future.” Eventually it will show you the index progress after it has loaded the database.
  8. After the index process is done you should be able to search your data at http://solr.local:8983/solr/admin . If not, look at the errors popping up in the terminal. Results will be in XML format.
  9. To setup delta indexing (ie changes made every hour or interval at you choosing) setup a cronjob to the following url http://solr.local:8983/solr/dataimport?command=delta-import
  10. To run the Solr service in the background or in the background on your server make sure you start it like this “java -jar start.jar &”

In another post I will cover locking down Solr so that it is only accessible by IP address or username/password combo.

 

출처 : http://entropytc.com/importing-a-mysql-database-into-apache-solr/#codesyntax_1

+ Recent posts