The calculus of spans.
The calculus of spans.
A span is a
<doc,startPosition,endPosition>
tuple.
The following span query operators are implemented:
- A SpanTermQuery matches all spans
containing a particular Term.
- A SpanNearQuery matches spans
which occur near one another, and can be used to implement things like
phrase search (when constructed from SpanTermQueries and inter-phrase
proximity (when constructed from other SpanNearQueries.
- A SpanOrQuery merges spans from a
number of other SpanQueries.
- A SpanNotQuery removes spans
matching one SpanQuery which overlap
another. This can be used, e.g., to implement within-paragraph
search.
- A SpanFirstQuery matches spans
matching
q
whose end position is less than
n
. This can be used to constrain matches to the first
part of the document.
In all cases, output spans are minimally inclusive. In other words, a
span formed by matching a span in x and y starts at the lesser of the
two starts and ends at the greater of the two ends.
For example, a span query which matches "John Kerry" within ten
words of "George Bush" within the first 100 words of the document
could be constructed with:
SpanQuery john = new SpanTermQuery(new Term("content", "john"));
SpanQuery kerry = new SpanTermQuery(new Term("content", "kerry"));
SpanQuery george = new SpanTermQuery(new Term("content", "george"));
SpanQuery bush = new SpanTermQuery(new Term("content", "bush"));
SpanQuery johnKerry =
new SpanNearQuery(new SpanQuery[] {john, kerry}, 0, true);
SpanQuery georgeBush =
new SpanNearQuery(new SpanQuery[] {george, bush}, 0, true);
SpanQuery johnKerryNearGeorgeBush =
new SpanNearQuery(new SpanQuery[] {johnKerry, georgeBush}, 10, false);
SpanQuery johnKerryNearGeorgeBushAtStart =
new SpanFirstQuery(johnKerryNearGeorgeBush, 100);
Span queries may be freely intermixed with other Lucene queries.
So, for example, the above query can be restricted to documents which
also use the word "iraq" with:
Query query = new BooleanQuery();
query.add(johnKerryNearGeorgeBushAtStart, true, false);
query.add(new TermQuery("content", "iraq"), true, false);