java - way to improve ngram generation? -


i know there many threads in name. have code generate ngrams. know can improved better speed when handling thousands of strings?

example string="abcdefghijkl1245ty789"

public static string[] ngrams(string s) {         int len=12;         string[] parts = s.split("(?!^)");         string[] result = new string[parts.length - len + 1];         for(int = 0; < parts.length - len + 1; i++) {            stringbuilder sb = new stringbuilder();            for(int k = 0; k < len; k++) {                sb.append(parts[i+k]);            }            result[i] = sb.tostring();         }         return result;     } 

the above code gets string,generates ngrmas of given length. in case 12.

sure:

public static string[] ngrams(string str, int length) {     char[] chars = str.tochararray();     final int resultcount = chars.length - length + 1;     string[] result = new string[resultcount];     (int = 0; < resultcount; i++) {         result[i] = new string(chars, i, length);     }     return result; } 

the changes made:

  • instead of splitting via regexp, used string#tochararray() single array copy , therefore faster
  • instead of rebuilding resulting strings stringbuilder, used an appropriate string constructor which, again, single arraycopy
  • (not needed performance, still) changed method signature have length parameter testing causes. feel free change - make sure rename method ngrams() ngrams12() or something.

or drop altogether , use naïve approach string#substring() similar work under hood:

public static string[] ngramssubstring(string str, int length) {     final int resultcount = str.length() - length + 1;     string[] result = new string[resultcount];     (int = 0; < resultcount; i++) {         result[i] = str.substring(i, i+length);     }     return result; } 

by way, if ever had use regexp in future, try compiling once , reusing instead of compiling every time method gets used. example, code like:

private static final pattern every_char = pattern.compile("(?!^)"); 

and then, in method, instead of string#split, you'd use

string[] parts = every_char.split(str); 

Comments