Adobe CQ/Adobe AEM: How to reduce Lucene Index size in CQ / WEM

Thursday, January 12, 2012

How to reduce Lucene Index size in CQ / WEM

Use Case:
1) Index size is huge and it is taking a lot of time for CQ to start.
2) Index size is huge and it is taking a lot of time for Index to rebuild (Also check http://www.wemblog.com/2011/09/how-to-reindex-large-repository.html for this).
3) you use different search service like solr or FAST for indexing
4) you don't use full text searching of documents in site

Solution:

Step 1: Find tika_config file

For version CQ5.4 and less

If you are really not concern about full text indexing of these documents, You could disable indexing of these document in tika-config (crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/org/apache/jackrabbit/core/query/lucene/tika-config.xml). If this folder structure is not present then you have to create one. Original tika_config.xml can be found by unzipping crx-quickstart/server/runtime/0/_crx/WEB-INF/libs/jackrabbit-core-*.jar (Copy it to some other location, rename it to .zip and then unzip) and then going to org/apache/jackrabbit/core/query/lucene).

For version CQ5.5

1) Find the jackrabbit-core jar file and extract the tika config: find ./crx-quickstart/launchpad/felix -name "jackrabbit-core*.jar" | xargs -I {} jar -xvf {} org/apache/jackrabbit/core/query/lucene/tika-config.xml

2) Update org/apache/jackrabbit/core/query/lucene/tika-config.xml file with updated tika file (See step 2)

3) Update the jackrabbit-core jar file with the updated tika-config.xml file: find ./crx-quickstart/launchpad/felix -name "jackrabbit-core*.jar" | xargs -I {} jar -uvf {} org/apache/jackrabbit/core/query/lucene/tika-config.xml

Step 2: Modify file

You could add org.apache.tika.parser.EmptyParser as class for not to parse document type.

For example (To not index excel sheet)

<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
</parser>

To remove PDF parsing you can remove entry
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>

Above method will also help you to reduce Index size (Lucene) in CQ.

Note: To reduce Lucene Index size you can also add following in workspace.xml

<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>

<param name="supportHighlighting" value="false"/>
</SearchIndex>

As an example please refere this tika-config.xml file.

Some use ful link for tuning your search index in case you can do above,

http://wiki.apache.org/jackrabbit/Search
try to tune "resultFetchSize" and other parameters

Step 3: Disable Indexing using indexing_config.xml file

Please check instruction below of how to do that. You can add your own node type to reduce index size further. You can use attached indexing_config file

http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/SearchIndexingConfig.html

Other Useful Links to reduce lucene Index:

http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/how-to-optimize-lucene-index-to-gain-efficiency.html

http://wiki.apache.org/jackrabbit/IndexingConfiguration

http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/BoostInSearch.html

http://dev.day.com/content/kb/home/cq5/CQ5Troubleshooting/performancetuningtips.html#TIP05

There is an also ongoing issue to expedite start up time if index size is large https://issues.apache.org/jira/browse/JCR-3107

Important:You have to rebuild index after above changes.

Special thanks to Andrew Khoury and other member from Adobe for sharing information.

11 comments:

AnonymousJanuary 21, 2013 at 12:05 AM
How to disable built in Lucence indexing and integrate CQ5.5 with Apache Solr ?
ReplyDelete
Replies
AnonymousJuly 24, 2013 at 9:12 AM
Hi Yogesh,

Any Updates on above question?
ReplyDelete
Replies
AnonymousAugust 3, 2013 at 10:07 PM
Hi Yogesh,
What is process for integrating CQ5.6 with solr search.
please share the solr + CQ demo package.
ReplyDelete
Replies
KostyaAugust 8, 2013 at 3:30 AM
Yes, that would be interesting to hear about hte SOLR integration. Thank you :)
ReplyDelete
Replies
AnonymousMarch 7, 2014 at 2:24 AM
Hi Yogesh,
There is a confusion regarding indexing configuration.
The jackrabbit documentation http://wiki.apache.org/jackrabbit/IndexingConfiguration talks of including the indexing configuration in both repository.xml and workspace.xml.

Line taken from Jackrabbit indexing config:
If you wish to configure the indexing behaviour you need to add a parameter to the SearchIndex element in your workspace.xml and repository.xml file.

indexing_config.xml:

< param name="path" value="${wsp.home}/index"/>
< param name="resultFetchSize" value="50"/>
< param name="indexingConfiguration" value="${wsp.home}/indexing_config.xml"/>
< param name="tikaConfigPath" value="${wsp.home}/tika-config.xml"/>

But you have mentioned of including only in workspace.xml. Can you please comment on this.
ReplyDelete
Replies
RHTJuly 21, 2014 at 10:43 PM
Hi Yogesh,

We have learned that AEM 6.0 has embedded Apache Solr Engine running on CRX3 (OAK). Is there any concrete documentation around optimizing the index as per business requirement?

Once custom indexing schema is set, will querying using JQL, SQL2, XPath etc will honor our custom index schema?

Thanks,
Rohit
ReplyDelete
Replies

Add comment