Devoxx 2009: Full Text Search for Hibernate

17/11/2009, University sessions, Emmanuel Bernard

Search solutions:
  • categorize upfront
  • show detailed search screen
  • use single search box (preferred)
Plain SQL search limits:
  • performance: like '%...%' causes a full table scan
  • no support for approximation nor synonyms
  • no proximity concept
  • lacking relevance scoring
  • no simple multi-column search
Full-text search solutions:
  • word based
  • captures / indexes frequency and position
  • solutions:
    • RDMS: (like Oracle Text):
      • less flexible
      • not portable (vendor-specific API and behavior)
    • standalone: Lucene
      • text-only
      • no synchronization with model objects
Hibernate Search, general features:
  • LGPL
  • uses Hibernate core
  • uses Lucene under the hood
  • solves object vs text mismatch
  • convert object to text document (+reverse) → Hibernate application uses objects, not text
  • convention over configuration
  • heavily built on annotations
  • Optimize Lucene access:
    • update Lucene docs on commit
    • object graphs are consolidated to single Lucene docs to provide relevant searches
    • avoid flooding Lucene indexer:
      • batch Lucene updates on commit
      • optionally trigger the Lucene indexer asynchronously
    • support clustering (JMS)
Hibernate Search Annotations:
  • @Indexed
  • @Field: tunable how to convert to text with, among others, @FieldBridge. E.g. convert number to 0-padded number.
  • @IndexedEmbedded
  • @Boost: promote a particular field in the relevance score (can be at indexing time or at query time)
  • @Analyzer: e.g. anagram-support
Lucene Index as used by Hibernate Search:
  • event based
  • batches updates per transaction (=at commit time)
  • sync or async mode (optimize Lucenes' locking mechanism)
  • HQL
  • Full Text (Lucene syntax) e.g. with the ~ opperator
  • JPA2 criteria
  • native SQL
  • → always returns Objects, not Lucene documents.
Advanced stuff:
  • tokenizer: split text in words, remove common words
  • complex searches: combination of indexing and querying
  • fuzzy search:
    • “Levenstein distance”: quantifies similarity
    • “n-gram”: word is split in groups of 3 letters → matching groups determines score. (demo looked a bit hackery)
  • phonetic search (soundex-like): disappointing in practice
  • synonyms: use your application-specific list
  • stemming: → 'reduction'
    • Porter Algorithm
    • Snowball stemmer
  • filters: provide efficient an pluggable support for
    • security, categories, temporal data, caching...
  • “explain” query result
  • clustering / Scaling Lucene
    • one Lucene writer at a given time
    • use a JMS queue for indexing (→ 'Master')→ small delay, but very scalable.
    • Distributed in-memory index (Infinispan 4.0) – technical preview
    • index optimizations:
      • sharding
      • defragmenting or re-indexing
Post a Comment