Lucene core
- full-text search library
- concepts
- inverted index:
- term + proximity
- documents
- fields
- field-ids: e.g. category, title, name...
- types: number, date, text...
- unique keys: unique id per document
- terms (aka tokens):
- processed through filters
- synonyms
- ignore words
- stemming
- processed through filters
- scoring relevancy
- term frequency
- inverse document frequency
- field length normalization → control how field length / # occurrences affects scoring
- boost factors: favor or boost some fields (e.g. titles)
- inverted index:
- core
- standalone jar
- core index
- search server
- based on Apache Lucene
- → Lucene exposed over http
- spell checking
- highlighting
- extensible
- scalable
- caching
- replication
- master/slave distributed search → sharding
- multiple inputs
- version 1.4
- setup:
- solrconfig.xml:
- cache settings
- Lucene indexing parameters
- solrconfig.xml:
- API:
- RequestHandlers:
- mini-servlets,
- flexible responses:
- http GET/POST
- JSON
- SolrJ
- ruby, php, …
- content streams (must be shielded)
- indexing / deleting a document
- through api: xml document with commands
- POST or GET with request parameters
- other actions:
- commit / rollback: batching document indexing
- optimize
- search request: simple GET, with optional parameters
- debug
- lucene explanation
- pagination: start / raws
- score: lucene score
- RequestHandlers:
- DataImportHandler
- import from RDBMS, xml and e-mail
- incremental indexing
- extensible
- debug console
- Solr Cell: uses Lucene Tika:
- index Word, pdf, html ...
- ExtractingRequestHandler
- Query parser framework with plugable parsers:
- Lucene syntax:
- powerful
- but user-unfriendly syntax
- exceptions visible to end-users
- Dismax query parser
- simplified syntax
- Lucene syntax:
- standard: query, facet, mlt, highlight, stats, debug
- others: elevation, clustering, term, term vector
- faceting
- counts subset within results
- group 'facets' of a document (like a category field)
- spell checking
- pluggable distance algorithms: Levenstein or JaroWinkler
- highlighting: custom prefix and suffix → response is highlighted
- query elevation → elevate.xml: boost or exclude a document
- clustering: grouping of documents into labeled sets
- enumerate terms for a field
- term vectors: term frequency, document frequency, position, offset
- statistics: stats.jsp (in RAM); returns xml
- scaling:
- replication:
- master is polled
- replicant pulls Lucene index / config files
- replicate + load balance
- distributed search: single index is too large → sharding
- replication:
- agile, iterative process works best:
- basic schema
- bring in data
- check requirement gaps
- adjust solr
No comments:
Post a Comment