Use Case:
1) Index size is huge and it is taking a lot of time for CQ to start.
2) Index size is huge and it is taking a lot of time for Index to rebuild (Also check http://www.wemblog.com/2011/09/how-to-reindex-large-repository.html for this).
3) you use different search service like solr or FAST for indexing
4) you don't use full text searching of documents in site
Solution:
Step 1: Find tika_config file
For version CQ5.4 and less
If you are really not concern about full text indexing of these documents, You could disable indexing of these document in tika-config (crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/org/apache/jackrabbit/core/query/lucene/tika-config.xml). If this folder structure is not present then you have to create one. Original tika_config.xml can be found by unzipping crx-quickstart/server/runtime/0/_crx/WEB-INF/libs/jackrabbit-core-*.jar (Copy it to some other location, rename it to .zip and then unzip) and then going to org/apache/jackrabbit/core/query/lucene).
For version CQ5.5
1) Find the jackrabbit-core jar file and extract the tika config: find ./crx-quickstart/launchpad/felix -name "jackrabbit-core*.jar" | xargs -I {} jar -xvf {} org/apache/jackrabbit/core/query/lucene/tika-config.xml
Step 2: Modify file
You could add org.apache.tika.parser.EmptyParser as class for not to parse document type.
For example (To not index excel sheet)
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
</parser>
To remove PDF parsing you can remove entry
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>
Above method will also help you to reduce Index size (Lucene) in CQ.
Note: To reduce Lucene Index size you can also add following in workspace.xml
<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<!-- add below param -->
<param name="supportHighlighting" value="false"/>
</SearchIndex>
As an example please refere this tika-config.xml file.
Some use ful link for tuning your search index in case you can do above,
http://wiki.apache.org/jackrabbit/Search
try to tune "resultFetchSize" and other parameters
Step 3: Disable Indexing using indexing_config.xml file
Please check instruction below of how to do that. You can add your own node type to reduce index size further. You can use attached indexing_config file
http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/SearchIndexingConfig.html
Other Useful Links to reduce lucene Index:
http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/how-to-optimize-lucene-index-to-gain-efficiency.html
http://wiki.apache.org/jackrabbit/IndexingConfiguration
http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/BoostInSearch.html
http://dev.day.com/content/kb/home/cq5/CQ5Troubleshooting/performancetuningtips.html#TIP05
There is an also ongoing issue to expedite start up time if index size is large https://issues.apache.org/jira/browse/JCR-3107
Important:You have to rebuild index after above changes.
Special thanks to Andrew Khoury and other member from Adobe for sharing information.
1) Index size is huge and it is taking a lot of time for CQ to start.
2) Index size is huge and it is taking a lot of time for Index to rebuild (Also check http://www.wemblog.com/2011/09/how-to-reindex-large-repository.html for this).
3) you use different search service like solr or FAST for indexing
4) you don't use full text searching of documents in site
Solution:
Step 1: Find tika_config file
For version CQ5.4 and less
If you are really not concern about full text indexing of these documents, You could disable indexing of these document in tika-config (crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/org/apache/jackrabbit/core/query/lucene/tika-config.xml). If this folder structure is not present then you have to create one. Original tika_config.xml can be found by unzipping crx-quickstart/server/runtime/0/_crx/WEB-INF/libs/jackrabbit-core-*.jar (Copy it to some other location, rename it to .zip and then unzip) and then going to org/apache/jackrabbit/core/query/lucene).
For version CQ5.5
1) Find the jackrabbit-core jar file and extract the tika config: find ./crx-quickstart/launchpad/felix -name "jackrabbit-core*.jar" | xargs -I {} jar -xvf {} org/apache/jackrabbit/core/query/lucene/tika-config.xml
2) Update org/apache/jackrabbit/core/query/lucene/tika-config.xml file with updated tika file (See step 2)
3) Update the jackrabbit-core jar file with the updated tika-config.xml file: find ./crx-quickstart/launchpad/felix -name "jackrabbit-core*.jar" | xargs -I {} jar -uvf {} org/apache/jackrabbit/core/query/lucene/tika-config.xml
You could add org.apache.tika.parser.EmptyParser as class for not to parse document type.
For example (To not index excel sheet)
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
</parser>
To remove PDF parsing you can remove entry
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>
Above method will also help you to reduce Index size (Lucene) in CQ.
Note: To reduce Lucene Index size you can also add following in workspace.xml
<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<!-- add below param -->
<param name="supportHighlighting" value="false"/>
</SearchIndex>
As an example please refere this tika-config.xml file.
Some use ful link for tuning your search index in case you can do above,
http://wiki.apache.org/jackrabbit/Search
try to tune "resultFetchSize" and other parameters
Step 3: Disable Indexing using indexing_config.xml file
Please check instruction below of how to do that. You can add your own node type to reduce index size further. You can use attached indexing_config file
http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/SearchIndexingConfig.html
Other Useful Links to reduce lucene Index:
http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/how-to-optimize-lucene-index-to-gain-efficiency.html
http://wiki.apache.org/jackrabbit/IndexingConfiguration
http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/BoostInSearch.html
http://dev.day.com/content/kb/home/cq5/CQ5Troubleshooting/performancetuningtips.html#TIP05
There is an also ongoing issue to expedite start up time if index size is large https://issues.apache.org/jira/browse/JCR-3107
Important:You have to rebuild index after above changes.
Special thanks to Andrew Khoury and other member from Adobe for sharing information.
How to disable built in Lucence indexing and integrate CQ5.5 with Apache Solr ?
ReplyDeleteHello,
DeleteUnfortunately index files are required for core CQ functionality to work hence you can not disable lucene completely. However you can reduce size of lucence index as described above.
Yogesh
Hi Yogesh,
ReplyDeleteAny Updates on above question?
Hi Yogesh,
ReplyDeleteWhat is process for integrating CQ5.6 with solr search.
please share the solr + CQ demo package.
Hello,
DeleteUnfortunately I do not have demo package, but this integration is very much possible. I think CQ6.0 has this OOTB.
Yogesh
Yes, that would be interesting to hear about hte SOLR integration. Thank you :)
ReplyDeleteCan you please check http://www.gastongonzalez.com/tech-blog/2013/9/13/integrating-apache-solr-with-adobe-cq-aem.html for CQ integration with SOLR
DeleteHi Yogesh,
ReplyDeleteThere is a confusion regarding indexing configuration.
The jackrabbit documentation http://wiki.apache.org/jackrabbit/IndexingConfiguration talks of including the indexing configuration in both repository.xml and workspace.xml.
Line taken from Jackrabbit indexing config:
If you wish to configure the indexing behaviour you need to add a parameter to the SearchIndex element in your workspace.xml and repository.xml file.
indexing_config.xml:
< param name="path" value="${wsp.home}/index"/>
< param name="resultFetchSize" value="50"/>
< param name="indexingConfiguration" value="${wsp.home}/indexing_config.xml"/>
< param name="tikaConfigPath" value="${wsp.home}/tika-config.xml"/>
But you have mentioned of including only in workspace.xml. Can you please comment on this.
Hello Bala,
DeleteIf you are not using multiple workspace, including indexing config in workspace.xml should work.
Yogesh
Hi Yogesh,
ReplyDeleteWe have learned that AEM 6.0 has embedded Apache Solr Engine running on CRX3 (OAK). Is there any concrete documentation around optimizing the index as per business requirement?
Once custom indexing schema is set, will querying using JQL, SQL2, XPath etc will honor our custom index schema?
Thanks,
Rohit
Hello Rohit,
DeleteUnfortunately I don't have step by step guide for AEM 6, But you can refer to http://jackrabbit.apache.org/oak/docs/query.html of how to set up Solr with Oak.
Yogesh