Importing a MySQL database into Apache Solr Part I

This is a tutorial for getting started with Apache Solr enterprise search with PHP and MySQL on Unix/Linux or Mac OS X. Most of the information I have found is rather incomplete or full of erros so I wanted to share how I have successfully used Solr in a number of different applications. I’m going to break this down into three separate posts/sections.

  • Importing a MySQL database into Apache Solr
  • Using PHP with Apache Solr
  • Using the Suggester feature in the new Apache Solr 3.1

Apaches Solr is an enterprise search application that was developed at CNET and is now part of the Apache Lucene project. It runs in a standalone java servlet contain like Jetty or Tomcat. Since it runs as a webservice, you access it from any application like PHP or in Rails. Results can be returned in XML, JSON, or now CSV format.

There are several advantages to using Solr for your search function in you application. The biggest is speed. In a ~1 gig table Solr will return a result in under 0.2 seconds versus MySql fulltext search which can take up to 10 seconds (or even longer) depending on the length and complexity of the query. End users expect fast search results and Solr can deliver that, while allowing a level of control over search that a person would not be able to achieve using Google, i.e. search for posts by user name or product by manufacturer.

Another big advantage with Solr is that it recognizes plurals and slight misspellings, for instance a query with “Arc’teryx” will return results with both “Acrteryx” and “Arc’teryx”. A description of using other search features like facets can be found at the Solr Homepage.

Where would you use Solr? Any site that has a lot of content will benefit from using enterprise search; large popular blogs, e-commerce sites with a large product database, and forums/message boards.

Server requirements depend on the size of database but are fairly modest while running queries. I have been able to run Solr on a VPS with 300 megs or RAM with laser quick results. However, the database import requires much more RAM, for a 1.5 gig DB it requires at least 2 gigs and preferably 4gig. I don’t believe it’s necessary to run Solr on a separate server or VPS, just making sure that you setup your cron jobs for import to run in the evening when traffic is low.

Let’s get started!

First, all the examples in this tutorial will be run on locally on OS X under the domain solr.local. If you don’t already develop on your machine locally I encourage you to do so. On Mac OS X both MAMP and VirtualHostX are great tools for using a local dev environment.

  1. Start by downloading Solr. The current version is 3.1. Place the directory in an appropriate location, ie User/Sites/Solr31.
  2. In the termimal change into the example directory.
  3. Start the Jetty server up by executing in the terminal “java -jar start.jar.
  4. Get familiar with the terminal output for Jetty as this will help you track down errors later in your configuration.
  5. In your browser go to http://solr.local:8983/solr/admin
  6. Go through the Solr tutorial to get yourself familiar with the feature set.
  7. Stop Jetty with Control-C.

Great. Now we are ready to start getting Solr configured to import your MySQL database. In this example I will be configuring Solr to import a generic message board table.

  1. Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/
  2. Copy the downloaded directory into example/lib
  3. In example/solr/conf create a file called data-config.xml. Change the sql queries to reflect the database you are indexing.
    1. <dataConfig>
    2. <dataSource type="JdbcDataSource"
    3. driver="com.mysql.jdbc.Driver"
    4. url="jdbc:mysql://localhost/ccubbupgrade"
    5. user="root"
    6. password=""/>
    7. <document name="content">
    8. <entity name="id"
    9. query="SELECT
    10. p.post_id,
    11. p.post_posted_time,
    12. FROM_UNIXTIME(p.post_posted_time) AS post_posted_datetime,
    13. p.user_id,
    14. t.forum_id,
    15. p.post_subject,
    16. p.post_default_body
    17. FROM
    18. POSTS AS p
    19. JOIN
    20. TOPICS AS t
    21. ON p.topic_id = t.topic_id
    22. WHERE
    23. p.POST_IS_APPROVED = '1'
    24. AND t.TOPIC_STATUS != 'M'"
    25. deltaImportQuery="SELECT
    26. p.post_id,
    27. p.post_posted_time,
    28. FROM_UNIXTIME(p.post_posted_time) AS post_posted_datetime,
    29. p.user_id,
    30. t.forum_id,
    31. p.post_subject,
    32. p.post_default_body
    33. FROM
    34. POSTS AS p
    35. JOIN
    36. TOPICS AS t
    37. ON p.topic_id = t.topic_id
    38. WHERE
    39. p.post_id = ${dataimporter.delta.post_id}"
    40. deltaQuery="SELECT
    41. post_id
    42. FROM
    43. POSTS
    44. WHERE
    45. post_posted_time > UNIX_TIMESTAMP('${dataimporter.last_index_time}')">
    46. <field column="POST_ID" name="post_id" />
    47. <field column="POST_POSTED_TIME" name="post_posted_time" />
    48. <field column="USER_ID" name="user_id" />
    49. <field column="FORUM_ID" name="forum_id" />
    50. <field column="POST_SUBJECT" name="post_subject" />
    51. <field column="POST_DEFAULT_BODY" name="post_default_body" />
    52. </entity>
    53. </document>
    54. </dataConfig>
  4. We need to tell Solr that we will be using the JDBC driver. Open the solrconfig.xml file in the same directory and insert the following code.
    1. <requestHandler name="/dataimport"
    2. class="org.apache.solr.handler.dataimport.DataImportHandler">
    3. <lst name="defaults">
    4. <str name="config">data-config.xml</str>
    5. </lst>
    6. </requestHandler>
  5. Now we need to add some info into the schema.xml file, located in the same directory, to tell Solr more about our table columns and how to conduct a search.
    1. <!-- Field to use to determine and enforce document uniqueness.
    2. Unless this field is marked with required="false", it will be a required field
    3. -->
    4. <uniqueKey>post_id</uniqueKey>
    5. <field name="post_id" type="string" indexed="true" stored="true" required="true"/>
    6. <field name="post_posted_time" type="tint" indexed="true" stored="true"/>
    7. <field name="user_id" type="int" indexed="true" stored="true" />
    8. <field name="forum_id" type="int" indexed="true" stored="true"/>
    9. <field name="post_subject" type="text" indexed="true" stored="true"/>
    10. <field name="post_default_body" type="text" indexed="true" stored="true"/>
    11. <!-- field for the QueryParser to use when an explicit fieldname is absent -->
    12. <defaultSearchField>fulltext</defaultSearchField>
  6. Go ahead and start Jetty up again with “java -jar start.jar” and check for any errors in the terminal.
  7. Now it’s time to do a full index. In your browser go to http://solr.local:8983/solr/dataimport?command=full-import . This process will take a while depending on the size of your database. Check the status of the import with http://solr.local:8983/solr/dataimport . The response format will look like this for the first little bit “00data-config.xmlidleThis response format is experimental. It is likely to change in the future.” Eventually it will show you the index progress after it has loaded the database.
  8. After the index process is done you should be able to search your data at http://solr.local:8983/solr/admin . If not, look at the errors popping up in the terminal. Results will be in XML format.
  9. To setup delta indexing (ie changes made every hour or interval at you choosing) setup a cronjob to the following url http://solr.local:8983/solr/dataimport?command=delta-import
  10. To run the Solr service in the background or in the background on your server make sure you start it like this “java -jar start.jar &”

In another post I will cover locking down Solr so that it is only accessible by IP address or username/password combo.

 

출처 : http://entropytc.com/importing-a-mysql-database-into-apache-solr/#codesyntax_1

Solr

Lucene을 Java API가 아닌 HTTP 호출을 통해서 사용할 수 있게 만든 서비스.

  • 때문에 모든 언어에서 Solr를 통해 Lucene을 사용하는 것이 가능해지고,
  • 여러 WAS에서 Solr 단일 인스턴스에 검색/인덱싱 가능.(확장성이 좋아진다)
  • 기본적으로 모든 요청과 응답은 HTTP GET/POST 를 통해서 XML로 이루어 진다.
  • "솔라"

언어별 응답

응답을 받을 때는 각 언어별로, 언어별 데이터 형으로 받는 것이 가능하다. 또한 언어별 Client 라이브러리가 제공된다.

  • JSON
  • Python : 2007년 6월 말 현재, Solr 1.2와 호환 안되는 상태임. 확인할 것. 살짝 고치면 잘 작동함.
  • Ruby
  • Java의 경우에는 example을 확인할것.

한글 처리

한글 검색을 위해서는 CJKTokenizer를 사용해야 하는데, 현재 Solr 1.2에는 CJKTokenizerFactory가 들어있지 않다.

  • CJKTokenizerFactory.java : 링크 아래 파일을 컴파일하여 Solr WEB-INF/classes 에 넣어주고 재시작 한다.
  1. package org.apache.solr.analysis;

    import java.io.Reader;
    import org.apache.lucene.analysis.cjk.CJKTokenizer ;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.solr.analysis.BaseTokenizerFactory;

    /**
    * CJKTokenizer for Solr
    * @see org.apache.lucene.analysis.cjk.CJKTokenizer
    * @author matsu
    */
    public class CJKTokenizerFactory extends BaseTokenizerFactory {

    /**
    * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
    */
    public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
    }
    }
  • Tokenizer Factory를 인덱싱이 필요한 FieldType에 추가하고, Filter로 StopFilterFactory와 WordDelimiterFilterFactory 를 추가해줘야 한다.
  1. <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
    <tokenizer class="solr.CJKTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.CJKTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    </analyzer>
    </fieldType>

출처 : http://kwon37xi.springnote.com/pages/335478

1)
두개의 디렉토리 생성
C:\TOMCAT_HOME
C:\SOLR_HOME

2) 톰캣 부터 설정
C:\TOMCAT_HOME 에 톰캣 압축을 푼다.
C:\TOMCAT_HOME\conf\Catalina\localhost 디렉토리 생성함
C:\TOMCAT_HOME\conf\Catalina\localhost
C:\TOMCAT_HOME\conf\Catalina\localhost\solr.xml 파일을 만듬
내용은

<Context docBase=”C:\SOLR_HOME\solr-1.4.0.war” debug=”0″ crossContext=”true” >
<Environment name=”solr/home” type=”java.lang.String” value=”C:\SOLR_HOME\solr” override=”true” />
< /Context>

보통 C:\TOMCAT_HOME\bin\solr\data\index 여기에 인덱스가 생성됨

C:\TOMCAT_HOME\bin\startup.bat shutdown.bat로 톰캣을 띄워봄

localhost:8080

3) Solr 설정
솔라 압축푼 파일 중

apache-solr-1.4.0\example\solr 폴더를 C:\SOLR_HOME 에 복사함.

bin 과 conf 폴더인데 Environment name=”solr/home” 으로 지정된 폴더에서 conf를 찾으므로 필수임

\apache-solr-1.4.0\dist\apache-solr-1.4.0.war 을 C:\SOLR_HOME 에 복사 해둠

4) War를 다시 구성할 경우

WEB-INF\lib에 추가를 하거나 소스가 변경되었을시 war를 다시 생성해야함.

C:\SOLR_HOME\apache-solr-1.4.0>jar -cvf ../solr-1.4.0.war *

5) 데이터 테스트 전송

Solr 원본에 있는 example\exampledocs 파일을 이용해서 올려 보자

C:\SOLR_HOME\apache-solr-1.4.0\ example\exampledocs>java -Durl=http://localhost:8080/solr/update -jar *.xml

6) 한글 문제

먼저 톰캣 설정

C:\TOMCAT_HOME\conf\Server.xml

<Connector port=”8080″ protocol=”HTTP/1.1″
connectionTimeout=”20000″ URIEncoding=”UTF-8″
redirectPort=”8443″ />

C:\SOLR_HOME\apache-solr-1.4.0\WEB-INF\lib 에

최신 koreananalyzer-20100525.jar 한글 분석기를 넣음

C:\SOLR_HOME\apache-solr-1.4.0\WEB-INF\classes

를 생성하고 다음 두개 파일을 만듬

A) KoreanFilterFactory.java

package org.apache.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.kr.KoreanFilter;

public class KoreanFilterFactory extends BaseTokenFilterFactory {
private boolean bigrammable=true;
private boolean hasOrigin=true;

public TokenStream create(TokenStream tokenstream) {
return new KoreanFilter(tokenstream,bigrammable,hasOrigin);
}
public void setBigrammable(boolean bool){
this.bigrammable=bool;
}
public void setHasOrigin(boolean bool) {
this.hasOrigin=bool;
}
}

B) KoreanTokenizerFactory.java

package org.apache.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.kr.KoreanFilter;

public class KoreanFilterFactory extends BaseTokenFilterFactory {
private boolean bigrammable=true;
private boolean hasOrigin=true;

public TokenStream create(TokenStream tokenstream) {
return new KoreanFilter(tokenstream,bigrammable,hasOrigin);
}
public void setBigrammable(boolean bool){
this.bigrammable=bool;
}
public void setHasOrigin(boolean bool) {
this.hasOrigin=bool;
}
}

컴파일 방법

c:\SOLR_HOME\apache-solr-1.4.0\WEB-INF\classes>javac -cp ..\lib\lucene-analyzers
-2.9.1.jar;..\lib\lucene-core-2.9.4.jar;..\lib\apache-solr-core-1.4.0.jar;..\lib
\koreananalyzer-20100525.jar -d . *.java

War로 다시 압축하고

톰캣 리스타트..한글 잘됨

테스트 데이터 넣어보기

c:\SOLR_HOME\exampledocs>java -Durl=http://localhost:8080/solr/update -jar post.
jar hd.xml

기본은 8393 포트인가를 찾는다 포트를 지정해서 가려면 -Durl을 쓴다.

 

 

출처 : http://ofwww.com/

[시작하기]
1. solr 최신버전 다운

2. example디렉토리로 이동해서 Solr 서버를 실행
Jetty가 8983 port로 실행된다.
$ cd example
$ example$ java -jar start.jar
>java -jar start.jar
2012-04-16 14:35:38.691:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2012-04-16 14:35:38.803:INFO::jetty-6.1-SNAPSHOT
2012-04-16 14:35:38.830:INFO::Started SocketConnector@0.0.0.0:8983

3. 웹브라우져에서 접속(http://localhost:8983/solr/admin/)을 확인한다.


[색인 테스트]
1. example\exampledocs로 이동해서 post.jar 파일을 이용해서 색인
example\exampledocs$ java -jar post.jar solr.xml monitor.xml
SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file solr.xml
SimplePostTool: POSTing file monitor.xml
SimplePostTool: COMMITting Solr index changes..

2. 다음과 같이 검색 URL로 색인 결과를 확인
http://localhost:8983/solr/select/?q=solr&start=0&rows=10&indent=on


3. 색인 데이터에 대한 통계치 보기
http://localhost:8983/solr/admin/stats.jsp
다음과 같이 현재까지 17개의 문서가 색인되어 있음을 알 수 있다.


4. 색인데이터 삭제하기
다음과 같이 schema.xml에 설정된 uniqueKey를 가진 필드를 지정하여 해당 문서를 색인결과에서 삭제할 수 있다.
example\exampledocs$ java -Ddata=args -Dcommit=no -jar post.jar “SP2514N”

example\exampledocs$ java -Ddata=args -Dcommit=no -jar post.jar “SP2514N”
위와 같이 삭제가 될 것이라는 메시지를 확인할 수 있고,

example\exampledocs$ java -jar post.jar 명령어를 날려 삭제에 대한 반영(commit)을 한다.



[검색 테스트]
1. 파라미터에 의해서 HTTP GET 방식으로 검색
쿼리파라미터의 예) http://localhost:8983/solr/select/?q=id:SOLR1000&fl=id+name
q=SOLR1000&fl=name,id (기본 검색필드에서 ‘SOLR1000′를 검색하고, name과 id만 return 받음)
q=id:SOLR1000&fl=name,id,score (id필드에서 ‘SOLR1000′를 검색하고, name, id와 score를 return 받음)
q=id:SOLR1000&fl=*,score (stored=yes인 모든 필드와 score를 return 받음)
q=id:SOLR1000&sort=price desc&fl=name,id,price (price필드로 내림차순 정렬된 결과를 받음)
q=id:SOLR1000&wt=json (JSON 포맷으로 결과를 받음)

2. 검색결과 하이라이팅
검색과 마찬가지로 검색파라미터에 ‘hl’ 파라미터를 추가하여 강조 (for emphasis) tags를 추가한다.
q=id:SOLR1000&fl=id,name&hl=true&hl.fl=id
–> 검색결과에 대해서 하이라이팅 시키고, 그 필드는 name를 대상으로 한다.


3. Faceted Search (group by 검색)
이 기능은 group by 검색과 유사할 것 같다.
예1) q=*:*&facet=true&facet.field=popularity (모든 결과에 대해서 popularity값에 적용하다.)

예2) q=*:*&facet=true&facet.field=popularity&facet.field=price

예3) q=*:*&facet=true&facet.query=price:[50 TO 300]&facet.query=popularity:[7 TO *]

예4)q=*:*&facet=true&facet.date=manufacturedate_dt&facet.date.start=2004-01-1T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=+1YEAR

날짜를 범위로 검색한다. 2004년부터 2010년까지의 결과를 검색한다.

 

출처 : http://ncue.tistory.com/

+ Recent posts