String Search in Java (using the JDK-API)

String search is a common problem, appearing every time we check that a string is contained in another. The JDK Standard API provides some basic functionality for this, yet there are more efficient algorithms for larger patterns and larger texts.

In this posting I present an overview over the string searching features of the Java API, starting with a short definition of string searching and ending with an overview of libraries that provide more features for string searching (more details will follow in later postings).

String-Searching

String searching (or String matching) tries to find a pattern in a text. The result is the set of matches in the text. Contrary to regular expression searching the pattern must be a pure string (not containing wildcards or operators). In the following examples we will call the pattern variable  needle , and the text haystack.

String Searching with java.lang.String

The most familiar way to use string searching in Java is provided by the class java.lang.String – as the method String.indexOf. If the position in the pattern is not necessary the method String.contains  is a even simpler alternative. A typical string matching would look like this:

Java uses a naive string searching algorithm, to determine the position of a match. Naive means that the text is processed character by character, looking for a match with the first character of the pattern. After encountering the first character match a new subroutine is started, comparing each following character with the following characters in the pattern and breaking on a character mismatch. This algorithm is sufficient for small patterns and small texts, but performs quite inefficient if used for long texts with real words.

String Searching with java.util.Matcher

Another not so obvious way to use string searching is provided by  java.util.Matcher with the method java.util.Matcher.find. Properly this method searches for regular expressions, so one would not expect string search here. Yet each string literal has its equivalent as regular expression and so it is technically possible to search strings with the regular expression API:

The concrete implementation would probably look different, but the main question is: Why should one use the slow Matcher.find method for text search:

The main reason is that java.util.Matcher.find does not execute in regex mode if given a fixed string as argument. This method first checks if the given argument is free from regex operators and if so it uses a Boyer-Moore algorithm to process the search. The performance is quite ok – even for longer texts and patterns.

More about String Search …

Generally the Java Standard API covers only a small part of the subject of string search. There are may algorithms, each optimized for other purpose, e.g. in the dimensions of:

  • the size of the alphabet (smaller alphabets vs larger alphabets)
  • the length of the pattern (shorter patters vs longer patterns)
  • the length of the texts (short texts vs long texts)

I will probably discuss some algorithms in my blog in future.

… and Multi-String Search

Besides there exists another related problem, that is not solved by the java API – the Multi-String Search. Instead of searching only one pattern in the text, here multiple patterns are searched – with the expectation that it performs more efficient than multiple single pattern searches in sequence. I wished that calling Matcher.find  with the expression (pattern1|pattern2|...|patternN)  would be optimized in a way as with pattern1, but unfortunately the regex feature of java does optimize only constant string search but not the search for finite languages (regular languages with finite words).

Libraries for String Search

I suppose that string search is a subject that is usually solved by implementing an algorithm from scratch. The few projects that provide string searching features do not resemble the maturity of common popular open source projects – some can be considered at most experimental. To rate such libaries I have developed the benchmark library StringBench with following results:

Features Supported Algorithms Benchmark Passed?
Java-API String Search,
Regular Expressions
naive algorithm,
Boyer-Moore
 yes
ByteSeek String Search,
Wildcard Search,
Reguläre Ausdrücke
Boyer-Moore-Horspool,
Sunday
 yes
StringSearch String Search,
Wildcard Search
Boyer-Moore-Horspool,
Boyer-Moore-Horspool-Raita,
BNDM,
Shift-Or
 no
AhoCorasick Multi-String Search Aho-Corasick  no
StringSearchAlgorithms String Search,
Multi-String Search,
Regular Expression Search (experimental)
Knuth-Morris-Pratt,
Shift-And,
Horspool,
Sunday,
BNDM,
BOM,
Aho-Corasick,
Set-Horspool,
Wu-Manber,
Set-BOM,
 yes

It is obvious that many libraries do not pass the benchmark. Most of the failing libraries fail with timeouts (search takes longer than a minute). This is not acceptable with look on the naive algorithm of the Java API, that passes the benchmark (taking a few seconds). Generally these libraries do their job fine, but probably the ordering/post processing of matches is not done efficiently (feeling like quadratic effort related to the number of matches). This performance problem will hopefully be changed by updates in future.

4 thoughts on “String Search in Java (using the JDK-API)

    • In theory BNDM can be easily extended to support case insensitive search, yet none of the implementations known to me do this job well.

      So just transform both, pattern and text, to lower case. Then any algorithm will do. E.g. with StringSearchAlgorithms, prepare a lower-case pattern, implement a LowerCaseCharProvider and apply them together.

Leave a Reply

Your email address will not be published. Required fields are marked *