i know there many threads in name. have code generate ngrams. know can improved better speed when handling thousands of strings?
example string="abcdefghijkl1245ty789"
public static string[] ngrams(string s) { int len=12; string[] parts = s.split("(?!^)"); string[] result = new string[parts.length - len + 1]; for(int = 0; < parts.length - len + 1; i++) { stringbuilder sb = new stringbuilder(); for(int k = 0; k < len; k++) { sb.append(parts[i+k]); } result[i] = sb.tostring(); } return result; }
the above code gets string,generates ngrmas of given length. in case 12.
sure:
public static string[] ngrams(string str, int length) { char[] chars = str.tochararray(); final int resultcount = chars.length - length + 1; string[] result = new string[resultcount]; (int = 0; < resultcount; i++) { result[i] = new string(chars, i, length); } return result; }
the changes made:
- instead of splitting via regexp, used
string#tochararray()
single array copy , therefore faster - instead of rebuilding resulting strings
stringbuilder
, used an appropriatestring
constructor which, again, single arraycopy - (not needed performance, still) changed method signature have
length
parameter testing causes. feel free change - make sure rename methodngrams()
ngrams12()
or something.
or drop altogether , use naïve approach string#substring() similar work under hood:
public static string[] ngramssubstring(string str, int length) { final int resultcount = str.length() - length + 1; string[] result = new string[resultcount]; (int = 0; < resultcount; i++) { result[i] = str.substring(i, i+length); } return result; }
by way, if ever had use regexp in future, try compiling once , reusing instead of compiling every time method gets used. example, code like:
private static final pattern every_char = pattern.compile("(?!^)");
and then, in method, instead of string#split
, you'd use
string[] parts = every_char.split(str);
Comments
Post a Comment