开发学院软件开发 Java 深入浅出 jackrabbit 七文本提取（下) 阅读

深入浅出 jackrabbit 七文本提取（下)

　2009-09-17 00:00:00　来源：WEB开发网　　　

核心提示： 从上面的逻辑，我们可以看出，深入浅出 jackrabbit 七文本提取（下)(3)，一旦一个二进制文本的提取超过100毫秒(默认值，可以修改<param name="extractorTimeout" value="100" />)之后，当然

从上面的逻辑，我们可以看出，一旦一个二进制文本的提取超过100毫秒(默认值，可以修改<param name="extractorTimeout" value="100" />)之后，那么这个document就被加入了消费队列，意味着，有消费者回来收拾它。

2.　消费者

去哪里找消费者呢，只要看indexingQueue被用在了什么地方就可以了，经过几个ctrl+shift+G,我们终于发现，在MultiIndex的构造方法里，有以下逻辑。

Java代码　　

Public　MultiIndex()　{　　 flushTask　=　new　Timer();　　　　　　flushTask.schedule(new　TimerTask()　{　　　　　　　　public　void　run()　{　　　　　　　　//　check　if　there　are　any　indexing　jobs　finished　　 /*英语注释写得还是比较清楚的，就是用来检查是否有提取的任务完成了，很显然这个timer背后的线程就是一个消费者，专门用来处理indexingQueue中的数据。接着，让我们到checkIndexingQueue的方法中走走*/　　　　　　　　　　checkIndexingQueue();　　　　　　　　　　//　check　if　volatile　index　should　be　flushed　　　　　　　　　　checkFlush();　　　　　　}　　　　　　　　　　　　},　0,　1000);　　 } 　

从上面的方法可以看出，主体逻辑在checkIndexingQueue中，那么接着，让我们到checkIndexingQueue的方法中走走。

Java代码　　

private　synchronized　void　checkIndexingQueue()　{　　 /*找到所有提取完成的document的列表，那么如果提出还没有完成，咋办呢，不等待，直接返回new　StringReader("")，这个逻辑在TextExtractorReader#isExtractorFinished*/　　　　　　Document[]　docs　=　indexingQueue.getFinishedDocuments();　　　　　　Map　finished　=　new　HashMap();　　　　　　for　(int　i　=　0;　i　<　docs.length;　i++)　{　　　　　　　　String　uuid　=　docs[i].get(FieldNames.UUID);　　　　　　　　finished.put(UUID.fromString(uuid),　docs[i]);　　　　　　}　　　　　　　　//　now　update　index　with　the　remaining　ones　if　there　are　any　　　　　　if　(!finished.isEmpty())　{　　　　　　　　log.debug("updating　index　with　{}　nodes　from　indexing　queue.",　　　　　　　　　　　　new　Long(finished.size()));　　　　　　　　　　//　remove　documents　from　the　queue　　　　　　　　for　(Iterator　it　=　finished.keySet().iterator();　it.hasNext();　)　{　　　　　　　　　　try　{　　　　　　　　　　　　indexingQueue.removeDocument(it.next().toString());　　　　　　　　　　}　catch　(IOException　e)　{　　　　　　　　　　　　log.error("Failed　to　remove　node　from　indexing　queue",　e);　　　　　　　　　　}　　　　　　　　}　　 /*这里又是调用update方法，在前面的文章中，我们已经详细的分析过了update方法会执行哪些重要的操作，他们分别是deleteNode，addNode，flush*/　　　　　　　　try　{　　　　　　　　　　update(finished.keySet().iterator(),　　　　　　　　　　　　　　finished.values().iterator());　　　　　　　　}　catch　(IOException　e)　{　　　　　　　　　　//　update　failed　　　　　　　　　　log.warn("Failed　to　update　index　with　deferred　text　extraction",　e);　　　　　　　　}　　　　　　}　　 }　　

由此可见，一个document很有可能因为提取操作过长而二进宫，第二次进宫的时候对于一个document来说会有两个操作，一个delete，一个 add，delete的原因是因为之前已经放进去一个copy对象，这个对象的fulltext的field是””,所以必须先删除掉，然后再把提取完成的document放进索引里去。

由此可见，在整体逻辑上还是比较清晰的，关键还是上文分析的TextExtractorReader类中存在一部分比较绕的逻辑，但是和本文结合起来看就非常容易理解了。

通过两篇文章的分析，我们终于对jackrabbit中文本提取这块内容有比较深入的理解了，当然很有可能它还藏着玄机，等待着我们去发现，等待着我们去挖掘。

上一页 1 2 3