terça-feira, junho 14, 2005

Suggest com o Lucene II

Como vcs viram em um post passado, desenvolvi um pequeno esquema para realizar suggests com o Lucene. Agora eu finalmente terminei e coloquei para funcionar em um dos sistemas em que trabalhei recentemente. É incrivel como o Lucene permite desenvolver coisas interessantes sem muita complexidade, realmente simplicidade é uma vantagem forte dessa api. Vamos lá, primeiro eu refatorei a classe Suggest para guardar tanto a palavra original quanto a sugestão e o numero de ocorrências da tal sugestão. O resultado final é:

public class Suggest implements Comparable<Suggest> {

private String suggest;
private String original;
private int occurrences;

/**
* @param suggest
* @param original
* @param occurrences
*/
public Suggest ( String suggest, String original, int occurrences ) {

this.suggest = suggest;
this.original = original;
this.occurrences = occurrences;
}

/**
* @see java.lang.Object#equals(java.lang.Object)
*/
public boolean equals( Object obj ) {

boolean result = false;

if (obj instanceof Suggest) {

Suggest s = (Suggest) obj;

String sWord = s.suggest.toLowerCase();
String thisWord = suggest.toLowerCase();

result = sWord.equals(thisWord) && occurrences == s.occurrences;

}

return result;
}

/**
* @see java.lang.Object#hashCode()
*/
public int hashCode() {

return suggest.toLowerCase().hashCode() * occurrences;

}

/**
* @see java.lang.Object#toString()
*/
public String toString() {

return suggest + " - " + occurrences;

}

/**
* @param other
* @return
* @see java.lang.Comparable#compareTo(T)
*/
public int compareTo( Suggest other ) {

return other.occurrences - occurrences;

}

// somente gets, o objeto é immutable

}

Ter a palavra original é interessante porque a partir dela posso saber se houve mesmo uma sugestão ou não. Quando trato de uma palavra simples isso não chega a ser necessário, mas no caso de frases, é importante para que eu destaque apenas as sugestões e não todas as palavras presentes em uma consulta. Tambem refatorei algumas partes da classe Suggestor para evitar que ele sugerisse a propria palavra, fiz algumas pequenas melhorias na performance, sugestões para frases e as adaptações simples decorrentes do refactoring de Suggest. O bom da historia foi ver as classes clientes de Suggestor funcionarem perfeitamente depois disso tudo. Enfim, o novo codigo de Suggestor é:

public class Suggestor {

/**
*
*/
public static final String PUNCTUATION_SPLITER = "[\\p{Punct}\\s]+";

private static final Log logger = LogFactory.getLog(Suggestor.class);

private Directory lucenePath;
private Analyzer analyzer;
private float minSimilarity = FuzzyQuery.defaultMinSimilarity;

/**
* @param path
* @param analyzer
* @param similarity
*/
public Suggestor ( Directory path, Analyzer analyzer, float similarity ) {

lucenePath = path;
minSimilarity = similarity;
this.analyzer = analyzer;
}

/**
* @param path
* @param analyzer
* @param similarity
* @throws IOException
*/
public Suggestor ( Resource path, Analyzer analyzer, float similarity )
throws IOException {

this(FSDirectory.getDirectory(path.getFile(), false), analyzer, similarity);

}

/**
* @param word
* @param field
* @return
*/
public List<Suggest> suggestsByFrequency( String word, String field ) {

List<Suggest> suggests = suggestsBySimilarity(word, field);

Collections.sort(suggests);

return suggests;

}

/**
* @param word
* @param field
* @param maxSuggestions
* @return
*/
public List<Suggest> suggestsByFrequency( String word, String field,
int maxSuggestions ) {

List<Suggest> suggests = suggestsByFrequency(word, field);

suggests = chopList(maxSuggestions, suggests);

return suggests;

}

/**
* @param word
* @param field
* @param maxSuggestions
* @return
*/
public List<Suggest> suggestsBySimilarity( String word, String field,
int maxSuggestions ) {

List<Suggest> suggests = suggestsBySimilarity(word, field);

suggests = chopList(maxSuggestions, suggests);

return suggests;

}

/**
* @param word
* @param field
* @return
* @throws IOException
*/
public List<Suggest> suggestsBySimilarity( String word, String field ) {

IndexReader reader = null;
List<Suggest> suggests = new ArrayList<Suggest>();

try {

synchronized (LuceneMonitor.LUCENE_MONITOR) {

reader = IndexReader.open(lucenePath);
Term term = new Term(field, word);

FuzzyTermEnum termEnum;
termEnum = new FuzzyTermEnum(reader, term, minSimilarity);

suggests = termEnumToList(word, termEnum);

}

} catch (IOException ex) {

throw new LuceneIndexException(ex);

} finally {

try {

if (reader != null) reader.close();

} catch (Exception ex) {

if (logger.isDebugEnabled()) {

logger.debug("Could not close reader", ex);

}
}

}

filterSuggests(suggests, word);

return suggests;

}

private void filterSuggests( List<Suggest> suggests, String word ) {

Predicate predicate = new AvoidWordItseltPredicate(word);
CollectionUtils.filter(suggests, predicate);

}

/**
* @param phrase
* @param field
* @return
*/
public List<Suggest> phrasalSuggest( String phrase, String field ) {

List<Suggest> suggests = new ArrayList<Suggest>();

IndexReader indexReader = null;
TokenStream tokenStream = null;

try {

Reader reader = new StringReader(phrase);
tokenStream = analyzer.tokenStream(field, reader);

synchronized (LuceneMonitor.LUCENE_MONITOR) {

indexReader = IndexReader.open(lucenePath);

Token token;
while ((token = tokenStream.next()) != null) {

Term term = new Term(field, token.termText());

FuzzyTermEnum termEnum;
termEnum = new FuzzyTermEnum(indexReader, term, minSimilarity);

List<Suggest> temp = termEnumToList(token.termText(), termEnum);

if (!temp.isEmpty()) {

suggests.add(Collections.min(temp));

}
}
}

} catch (IOException ex) {

throw new LuceneIndexException(ex);

} finally {

try {

if (indexReader != null) indexReader.close();

} catch (IOException ex) {

if (logger.isDebugEnabled()) {

logger.debug("Could not close reader", ex);

}

}

}

filterSuggests(suggests, phrase);

return suggests;

}

private List<Suggest> termEnumToList( String word, TermEnum termEnum )
throws IOException {

List<Suggest> suggests = new ArrayList<Suggest>();

while (termEnum.next()) {

Term term = termEnum.term();

String termValue = term.text();
int frequency = termEnum.docFreq();

suggests.add(new Suggest(termValue, word, frequency));

}

return suggests;
}

private List<Suggest> chopList( int maxSuggestions, List<Suggest> suggests ) {

if (suggests.size() > maxSuggestions) {

suggests = suggests.subList(0, maxSuggestions);

}

return suggests;
}

/**
* @author Marcos Silva Pereira - marcos.pereira@vicinity.com.br
*
*
* @since 24/05/2005
* @version $Id$
*/
static class AvoidWordItseltPredicate implements Predicate {

private Set words;

/**
* @param words
*/
public AvoidWordItseltPredicate( String words ) {

this(makeSet(words));

}

/**
* @param words
*/
public AvoidWordItseltPredicate( Set words ) {

this.words = words;

}

/**
* @see org.apache.commons.collections.Predicate#evaluate(java.lang.Object)
*/
public boolean evaluate( Object obj ) {

boolean result = false;

if (obj instanceof Suggest) {

Suggest suggest = (Suggest) obj;
String word = suggest.getSuggest().toLowerCase();

result = !words.contains(word);
}

return result;

}

private static Set makeSet( String words ) {

Set<String> set = new HashSet<String>();

String[] strings = words.split(PUNCTUATION_SPLITER);

for (String string : strings) {

set.add(string.toLowerCase());

}

return set;
}

}
}

O metodo phrasalSuggests usa o Analyzer para parsear as palavras e evitar que eu tente fazer sugestões para stop words, por exemplo. AvoidWordItseltPredicate é uma implementação simples de Predicate, interface do Jakarta Commons Collections, e é ele quem filtra o conjunto de sugestões para evitar que a propria palavra seja sugerida. Alteradas essas classes, criei uma helper para gerar codigo HTML a partir de uma frase e seu conjunto de sugestões, nada demais como vc pode ver abaixo:

public class HTMLPhrasalSuggest {

/**
*
*/
private HTMLPhrasalSuggest () {

// private constructor to avoid instantiation...
}

/**
* @param phrase
* @param suggests
* @param tag
*
* @return
*/
public static String htmlPhrasalSuggest( String phrase,
List<Suggest> suggests, String tag ) {

StringBuilder result = new StringBuilder();

String[] wordsInPhrase = phrase.split(Suggestor.PUNCTUATION_SPLITER);

int i = 0;
for (String string : wordsInPhrase) {

if (hasSuggest(string, suggests)) {

result.append(openTag(tag));
result.append(suggests.get(i++).getSuggest());
result.append(closeTag(tag));

} else {

result.append(string);

}

result.append(" ");

}

return result.toString().trim();

}

/**
* @param phrase
* @param suggests
* @return
*/
public static String phrasalSuggest( String phrase, List<Suggest> suggests ) {

StringBuilder result = new StringBuilder();

String[] wordsInPhrase = phrase.split(Suggestor.PUNCTUATION_SPLITER);

int i = 0;
for (String string : wordsInPhrase) {

if (hasSuggest(string, suggests)) {

result.append(suggests.get(i++).getSuggest());

} else {

result.append(string);

}

result.append(" ");

}

return result.toString().trim();

}

private static boolean hasSuggest( String word, List<Suggest> suggests ) {

boolean result = false;

for (Suggest suggest : suggests) {

if (suggest.getOriginal().equalsIgnoreCase(word)) {

result = true;
break;

}

}

return result;

}

private static String openTag( String tag ) {

return "<" + tag + ">";

}

private static String closeTag( String tag ) {

return "</" + tag + ">";

}

}

No metodo privado hasSuggest dá para ver a utilidade de guardar a palavra original. Se ela está presente no conjunto de sugestões, é porque houve uma sugestão para ela. Por exemplo, "jakarta luceni" vai gerar um conjunto de sugestões apenas com "lucene" já que "jakarta" está grafada corretamente. E por fim, o uso dessa tralha toda é feito via uma tag que criei baseado no suporte do WebWork com OGNL e tudo mais:

public class LuceneSuggestTag extends WebWorkTagSupport {

private String query;
private String field;
private String url;
private String tag;

private Suggestor suggestor;

/**
* @see javax.servlet.jsp.tagext.BodyTagSupport#doEndTag()
*/
public int doEndTag() throws JspException {

try {

List<Suggest> suggests = suggestor.phrasalSuggest(query, field);

String htmlSuggest;
htmlSuggest = HTMLPhrasalSuggest.htmlPhrasalSuggest(query, suggests, tag);

String textSuggest;
textSuggest = HTMLPhrasalSuggest.phrasalSuggest(query, suggests);

StringBuilder toShow = new StringBuilder();
toShow.append("<a href=\"").append(url).append(textSuggest);
toShow.append("\">").append(htmlSuggest).append("</a>");

Writer writer = pageContext.getOut();
writer.write(toShow.toString());

} catch (Exception ex) {

throw new JspException(ex.getMessage(), ex);

}

return EVAL_PAGE;
}

/**
* @see javax.servlet.jsp.tagext.BodyTagSupport#doStartTag()
*/
public int doStartTag() throws JspException {

try {

ServletContext servletContext = pageContext.getServletContext();

suggestor = (Suggestor) servletContext.getAttribute("luceneSuggestor");

query = String.valueOf(getStack().findValue(query, String.class));
field = String.valueOf(getStack().findValue(field, String.class));

return SKIP_BODY;

} catch (Exception ex) {

throw new JspException(ex.getMessage(), ex);

}
}

// sets e gets para os atributos.

}

E o uso, no meu caso numa jsp:
<%@ taglib prefix="ww" uri="webwork" %>
<%@ taglib prefix="lucene" uri="lucene" %>

...

<lucene:suggest field="'PageText'" query="query" tag="em" url="Search.pc?query=">

...

Agora, coisas que preciso melhorar:
1. Evitar que a view precise indicar o campo a ser buscado para as suggests (PageText);
2. Suggests para palavras coladas (comunidadeblastemica -> comunidade blastemica);
3. Tornar a tag library compativel com o metodo POST do HTTP;
4. Escapes para HTML e evitar que algum mal intencionado envie queries como <javascript bla bla bla>

É isso, sugestões e comentarios são muito bem vindos.

valeuz...

16 Comentarios:

Anonymous Anônimo disse...

Hello
Merry Christmas!
Please delete from here....

beating casino online
betting casino online
play casino online
best online casino gambling
online casino gambling directory
casino game online
online casino poker
casino online poker
online casino betting
online casino slot
bonus casino online
casino online top
casino fortunelounge online
gambling casino online
top online casino
casino las online vegas
fortunelounge online casino


play casino online
best online casino gambling
online casino gambling directory
casino game online
online casino poker
casino online poker
online casino betting
online casino slot
bonus casino online
casino online top
casino fortunelounge online
gambling casino online
top online casino
casino las online vegas
fortunelounge online casino

http://play-casino-888.info/casino-online-review.html
http://play-casino-888.info/online-casino-review.html
http://play-casino-888.info/online-gambling-casino.html
http://play-casino-888.info/casino-gambling-online.html
http://play-casino-888.info/online-casino-bonus.html
http://play-casino-888.info/casino-directory-gambling-online.html
http://play-casino-888.info/best-online-casino-directory.html


G'night

11:47 AM  
Anonymous Anônimo disse...

Hi all!
Please delete from here....

wood flooring
chicago il real estate
garage flooring
laminate flooring
cork flooring

I`m sorry...

cruise agent
cristmas cards

Bye

2:16 AM  
Anonymous Anônimo disse...

Hi
Brilliant site!
Good work

Pharmacy Meds - Diet Pill

Bye
http://diet-pill-weight.info/

1:45 AM  
Anonymous Anônimo disse...

Hi all!
Merry Christmas!
Please delete from here....

casino machine poker room
casino gambling online
online casino bonus
casino directory gambling online
best online casino directory
black casino jack
machine online slot yourbestonlinecasino.com

casino online play
best casino gambling online
best casino directory online
beating online casino
casino online slot
black casino jack online

Online Gambling Guide New Christmas
casino online
online casino
best casino online
best online casino
online casino gambling
casino free game online
free online casino game
casino free online
free online casino
10 best online casino
10 best casino online
online casino game
casino online review
online casino review
online gambling casino
http://play-casino-888.info/
http://play-casino-888.info/casino-online.html
http://play-casino-888.info/online-casino.html
http://play-casino-888.info/best-casino-online.html
http://play-casino-888.info/best-online-casino.html
http://play-casino-888.info/online-casino-gambling.html


G'night

8:31 AM  
Anonymous Anônimo disse...

Hi all!
Happy Xmas
Nice work...

website atomic ski
look ski stunt simulator
find ski accessory
site vermont ski resort
deals wolf creek ski
website ski apache
website ski

[url="http://website-ski.christmas-find.info"]website ski[/url]
[url="http://cheap-jet-ski.christmas-find.info"]cheap jet ski[/url]
[url="http://cheap-e-a-ski.christmas-find.info"]cheap e a ski[/url]
[url="http://about-colorado-ski-vacation.christmas-find.info"]about colorado ski vacation[/url]
[url="http://look-ski-area.christmas-find.info"]look ski area[/url]
[url="http://website-e-a-ski.christmas-find.info"]website e a ski[/url]
[url="http://cheap-ski-package.christmas-find.info"]cheap ski package[/url]
[url="http://look-snow-ski.christmas-find.info"]look snow ski[/url]
GL

3:34 PM  
Anonymous Anônimo disse...

Hi everybody!
Brilliant site!
Really good!
pill
pills
buy pill
cheap pill
pills online
buy cheap piil
online pharmacy
prescription pills
non prescription pill
[url="http://best-pill-buy.info/pill.php"]pill[/url]
[url="http://best-pill-buy.info/pills.php"]pills[/url]
[url="http://best-pill-buy.info/buy-pill.php"]buy pill[/url]
[url="http://best-pill-buy.info/cheap-pill.php"]cheap pill[/url]
[url="http://best-pill-buy.info/pills-online.php"]pills online[/url]|
[url="http://best-pill-buy.info/buy-cheap-piil.php"]buy cheap piil[/url]
[url="http://best-pill-buy.info/online-pharmacy.php"]online pharmacy[/url]
[url="http://best-pill-buy.info/prescription-pills.php"]prescription pills[/url]
[url="http://best-pill-buy.info/non-prescription-pill.php"]non prescription pill[/url]
http://best-pill-buy.info/pills-online.php
http://best-pill-buy.info/buy-cheap-piil.php
http://best-pill-buy.info/online-pharmacy.php
http://best-pill-buy.info/prescription-pills.php
http://best-pill-buy.info/non-prescription-pill.php
http://best-pill-buy.info/pill.php
http://best-pill-buy.info/pills.php
http://best-pill-buy.info/buy-pill.php
http://best-pill-buy.info/cheap-pill.php

Goog_luck
http://good-diet-tips.info
http://best-pill-buy.info

7:08 AM  
Anonymous Anônimo disse...

Hi all!
Merry Christmas!
Please delete from here....

casino machine poker room
casino gambling online
online casino bonus
casino directory gambling online
best online casino directory
black casino jack
machine online slot yourbestonlinecasino.com

casino online play
best casino gambling online
best casino directory online
beating online casino
casino online slot
black casino jack online


[url="http://play-casino-888.info/casino-machine-online-online-poker-room-slot.html"]casino machine online online poker room
slot[/url]
[url="http://play-casino-888.info/casino-gambling-online.html"]casino gambling online[/url]
[url="http://play-casino-888.info/online-casino-bonus.html"]online casino bonus[/url]
[url="http://play-casino-888.info/casino-directory-gambling-online.html"]casino directory gambling online[/url]
[url="http://play-casino-888.info/best-online-casino-directory.html"]best online casino directory[/url]
[url="http://play-casino-888.info/black-casino-jack-machine-online-slot-yourbestonlinecasino-com.html"]black casino jack
machine online slot yourbestonlinecasino.com[/url]
[url="http://play-casino-888.info/casino-online-play.html"]casino online play[/url]
[url="http://play-casino-888.info/best-casino-gambling-online.html"]best casino gambling online[/url]
[url="http://play-casino-888.info/best-casino-directory-online.html"]best casino directory online[/url]
[url="http://play-casino-888.info/beating-online-casino.html"]beating online casino[/url]
[url="http://play-casino-888.info/casino-online-slot.html"]casino online slot[/url]
[url="http://play-casino-888.info/black-casino-jack-online.html"]black casino jack online[/url]
[url="http://play-casino-888.info/beating-casino-online.html"]beating casino online[/url]
[url="http://play-casino-888.info/betting-casino-online.html"]betting casino online[/url]

http://play-casino-888.info/casino-free-game-online.html
http://play-casino-888.info/free-online-casino-game.html
http://play-casino-888.info/casino-free-online.html
http://play-casino-888.info/free-online-casino.html
http://play-casino-888.info/10-best-online-casino.html
http://play-casino-888.info/10-best-casino-online.html


Bye

10:46 PM  
Anonymous Anônimo disse...

Good day
weight loss pill,
cheap pills teme


diet pill
penis pill
prescription diet pill
weight loss diet pill
weight loss pill

[url=http://best-pill-buy.info/breast-enhancement-pill]breast enhancement pill[/url]
[url=http://best-pill-buy.info/breast-enlargement-pill]breast enlargement pill[/url]
[url=http://best-pill-buy.info/abortion-pill]abortion pill[/url]
[url=http://best-pill-buy.info/cheap-diet-pill]cheap diet pill[/url]
[url=http://best-pill-buy.info/diet pill]diet pill[/url]
[url=http://best-pill-buy.info/penis-pill]penis pill[/url]
http://best-pill-buy.info/pills-online.php
http://best-pill-buy.info/buy-cheap-piil.php
http://best-pill-buy.info/online-pharmacy.php
http://best-pill-buy.info/prescription-pills.php
http://best-pill-buy.info/non-prescription-pill.php
http://best-pill-buy.info/pill.php
http://best-pill-buy.info/pills.php
http://best-pill-buy.info/buy-pill.php
http://best-pill-buy.info/cheap-pill.php
Goog_luck

6:51 PM  
Anonymous Anônimo disse...

Hy!
Penis pill
online Christmas pharmacy


diet pill
penis pill
prescription diet pill
weight loss diet pill
weight loss pill

[url=http://best-pill-buy.info/prescription-diet-pill]prescription diet pill[/url]
[url=http://best-pill-buy.info/weight-loss-diet-pill]weight loss diet pill[/url]
[url=http://best-pill-buy.info/weight-loss-pill]weight loss pill[/url]
http://best-pill-buy.info/pills-online.php
http://best-pill-buy.info/buy-cheap-piil.php
http://best-pill-buy.info/online-pharmacy.php
http://best-pill-buy.info/prescription-pills.php
http://best-pill-buy.info/non-prescription-pill.php
http://best-pill-buy.info/pill.php
http://best-pill-buy.info/pills.php
http://best-pill-buy.info/buy-pill.php
http://best-pill-buy.info/cheap-pill.php
G'night

11:26 PM  
Anonymous Anônimo disse...

Hi,
What are the content rules, if any, for posting erotic poems?

[URL=http://erotic-dance.top33.org ]erotic dance[/URL]
http://top33.org
Ciao,

8:06 PM  
Anonymous Anônimo disse...

hi!
Nu stiu daca nu cumva te-ar starni si mai mult asa ceva.
[URL=http://most-erotic-teen.top33.org ]most erotic teen[/URL]
http://free-erotic-video.top33.org
Please, Thanks!

6:09 AM  
Anonymous Anônimo disse...

Hi Guys and Girls
Nu stiu daca nu cumva te-ar starni si mai mult asa ceva.

[URL=http://erotic-photography.top33.org ]erotic photography[/URL]
http://top33.org
THANKS!

8:39 AM  
Anonymous Anônimo disse...

Non-Surgical Breast Enhancement and Lift!

7:31 AM  
Anonymous Anônimo disse...

see nice special site -

[url=http://trailfire.com/lewismorrison] cheap phentermine without prescription [/url]

http://trailfire.com/lewismorrison
[url=http://trailfire.com/lewismorrison] phentermine tablets [/url]

2:12 PM  
Anonymous Anônimo disse...

learn my video site -

[url=http://www.young-dro.com/profiles/blogs/buy-cheap-ambien-ambien-cr] danger ambien [/url]

http://www.young-dro.com/profiles/blogs/buy-cheap-ambien-ambien-cr
[url=http://www.young-dro.com/profiles/blogs/buy-cheap-ambien-ambien-cr] ambien cod [/url]

9:19 PM  
Anonymous Anônimo disse...

find nice video site -

[url=http://trailfire.com/amoxil] drug or amoxil [/url]

http://trailfire.com/amoxil
[url=http://trailfire.com/amoxil] amoxil doseage [/url]

9:11 AM  

Postar um comentário

<< Home