---
title: "module analysis::text::search::Lucene"
id: Lucene
slug: /Packages/org.rascalmpl.rascal-lucene/API/analysis/text/search/Lucene
---

<div class="theme-doc-version-badge badge badge--secondary">rascal-0.41.2</div> <div class="theme-doc-version-badge badge badge--secondary">org.rascalmpl.rascal-lucene-0.1.0</div>

Simple interface to the Lucene text analysis library
#### Usage

```rascal
import analysis::text::search::Lucene;
```



This module wraps the Apache Lucene framework for text analysis. 

* It integrates deeply by
providing the interfaces for the analysis extension points of Lucene via Rascal callback functions: Analyzers, Tokenizers, Filters.
* It provides access to the full library of Lucene's text analyzers via their class names.
* It is a work in progress. Some configurability of Lucene is not yet exposed, for example
programmable weights for fields and the definition of similarity functions per document field. Also Query expressions are not yet exposed.
* This wrapper provides full abstraction over source locations. Both the directory of the index
as well as the locations of input documents are expressed using any existing rascal `loc`.


## data Document {#analysis-text-search-Lucene-Document}
A Lucene document has a src and an open set of keyword fields which are also indexed

```rascal
data Document  
     = document(loc src, real score=.0)
     ;
```


A lucene document has a `src` origin and an open set of keyword fields. 
Add as many keyword fields to a document as you want. They will be added to the Lucene document as "Fields".

* fields of type `str` will be stored and indexed as-is
* fields of type `loc` will be indexed but not stored

## data Analyzer {#analysis-text-search-Lucene-Analyzer}

```rascal
data Analyzer  
     = analyzerClass(str analyzerClassName)
     | analyzer(Tokenizer tokenizer, list[Filter] pipe)
     ;
```

## data Analyzer {#analysis-text-search-Lucene-Analyzer}
A fieldsAnalyzer declares using keyword fields which Analyzers to use for which Document field.

```rascal
data Analyzer  
     = fieldsAnalyzer(Analyzer src)
     ;
```


The `src` parameter of `fieldsAnalyzer` aligns with the `src` parameter of a Document: this analyzer is used
to analyze the `src` field. Any other keyword fields, of type `Analyzer` are applied to the contents of a
`Document` keyword field of type `loc` or `str` with the same name.

## data Term {#analysis-text-search-Lucene-Term}

```rascal
data Term  
     = term(str chars, loc src, str kind)
     ;
```

## data Tokenizer {#analysis-text-search-Lucene-Tokenizer}

```rascal
data Tokenizer  
     = tokenizer(list[Term] (str input) tokenizerFunction)
     | tokenizerClass(str tokenizerClassName)
     ;
```

## data Filter {#analysis-text-search-Lucene-Filter}

```rascal
data Filter  
     = \editFilter(str (str term) editor)
     | \removeFilter(bool (str term) accept)
     | \splitFilter(list[str] (str term) splitter)
     | \synonymFilter(list[str] (str term) generator)
     | \tagFilter(str (str term, str current) tagger)
     | \filterClass(str filterClassName)
     ;
```

## function createIndex {#analysis-text-search-Lucene-createIndex}

Creates a Lucene index at a given folder location from the given set of Documents, using a given set of text analyzers

```rascal
void createIndex(loc index, set[Document] documents, Analyzer analyzer = standardAnalyzer(), str charset="UTF-8", bool inferCharset=!(charset?))
```

## function searchIndex {#analysis-text-search-Lucene-searchIndex}

Searches a Lucene index indicated by the indexFolder by analyzing a query with a given set of text analyzers and then matching the query to the index.

```rascal
set[Document] searchIndex(loc index, str query, Analyzer analyzer = standardAnalyzer(), int max = 10)
```

## function searchDocument {#analysis-text-search-Lucene-searchDocument}

Searches a document for a query by analyzing it with a given analyzer and listing the hits inside the document, for debugging and reporting purposes.

```rascal
list[loc] searchDocument(loc doc, str query, Analyzer analyzer = standardAnalyzer(), int max = 10, str charset="UTF-8", bool inferCharset=!(charset?))
```

## function analyzeDocument {#analysis-text-search-Lucene-analyzeDocument}

Simulate analyzing a document source location like `createIndex` would do, for debugging purposes

```rascal
list[Term] analyzeDocument(loc doc, Analyzer analyzer = standardAnalyzer())
```

## function analyzeDocument {#analysis-text-search-Lucene-analyzeDocument}

Simulate analyzing a document source string like `createIndex` would do, for debugging purposes

```rascal
list[Term] analyzeDocument(str doc, Analyzer analyzer = standardAnalyzer())
```

## function listTerms {#analysis-text-search-Lucene-listTerms}

Inspect the terms stored in an index for debugging purposes (what did the analyzers do to the content of the documents?)

```rascal
rel[str chars, int frequency] listTerms(loc index, str field, int max = 10)
```

## function listFields {#analysis-text-search-Lucene-listFields}

Inspect the fields stored in an index for debugging purposes (which fields have been indexed, for how many documents, and how many terms?)

```rascal
rel[str field, int docCount, int sumTotalTermFreq] listFields(loc index)
```

## function classicAnalyzer {#analysis-text-search-Lucene-classicAnalyzer}

```rascal
Analyzer classicAnalyzer()
```

## function simpleAnalyzer {#analysis-text-search-Lucene-simpleAnalyzer}

```rascal
Analyzer simpleAnalyzer()
```

## function standardAnalyzer {#analysis-text-search-Lucene-standardAnalyzer}

```rascal
Analyzer standardAnalyzer()
```

## function whitespaceAnalyzer {#analysis-text-search-Lucene-whitespaceAnalyzer}

```rascal
Analyzer whitespaceAnalyzer()
```

## function classicTokenizer {#analysis-text-search-Lucene-classicTokenizer}

```rascal
Tokenizer classicTokenizer()
```

## function lowerCaseTokenizer {#analysis-text-search-Lucene-lowerCaseTokenizer}

```rascal
Tokenizer lowerCaseTokenizer()
```

## function lowerCaseFilter {#analysis-text-search-Lucene-lowerCaseFilter}

```rascal
Filter lowerCaseFilter()
```

