check update 3 below found out issue ran related known serious problem c# string comparers .net 4.0, 4.0 client , 4.5, lead inconsistent sort order of lists of strings (causing output depend on order in input , sort algorithm used). problem reported microsoft in dec 2012 , closed "won't fixed". work around available, slower hardly practical large collections.
while implementing immutable patriciatrie, wanted compare performance system.collections.generic.sortedlist. used following file https://github.com/rkapsi/patricia-trie/blob/master/src/test/resources/org/ardverk/collection/hamlet.txt create input wordlist testing.
when inserting each of words in c# sortedlist, using either comparer<string>.default
or stringcomparer.invariantculture
key comparer, number of entries inserted cannot retrieved using normal search methods (e.g. containskey
returns false) key present in list observed iterating list.
even more curious, comparer returns value '0' when comparing key retrieved sorted list search key not found using containskey
.
the complete example below demonstrates issue on system.
using system; using system.io; using system.linq; using system.collections.generic; class program { static void main(string[] args) { // problem possibly related comparison. var fail = true; var comparer = fail ? stringcomparer.invariantculture : stringcomparer.ordinal; // read hamlet (contains duplicate words) var words = file .readalllines("hamlet.txt") .selectmany(l => l.split(new[] { ' ', '\t' }, stringsplitoptions.removeemptyentries)) .select(w => w.trim()) .where(w => !string.isnullorempty(w)) .distinct(comparer) .toarray(); // insert hamlet's words in sorted list. var list = new sortedlist<string, int>(comparer); var ndx = 0; foreach (var word in words) list[word] = ndx++; // search each of added words. foreach (var keytosearch in words) { if (!list.containskey(keytosearch)) { // inserted, cannot retrieved. console.writeline("error - key not found: \"{0}\"", keytosearch); // when iterate on list, see entry present var prefix = keytosearch.substring(0, math.min(keytosearch.length, 3)); foreach (var wordclosetosearchkey in list.keys.where(s => s.startswith(prefix))) { // , using sortedlist's supplied comparison returns 0, signaling equality var comparisonresult = list.comparer.compare(wordclosetosearchkey, keytosearch); console.writeline("{0} - comparison result = {1}", wordclosetosearchkey, comparisonresult); } } } // check sort order of list.keys correct var keys = list.keys.toarray(); binarysearchall("list.keys", keys, list.comparer); checkcorrectsortorder("list.keys", keys, list.comparer); // check sort order of array.sort(list.keys) correct var arraysortedkeys = copysortsearchandcheck("array.sort(list.keys)", keys, list.comparer); // check sort order of array.sort(input words) correct var sortedinput = copysortsearchandcheck("array.sort(input words)", words, list.comparer); console.readline(); } static string[] copysortsearchandcheck(string arraydesc, string[] input, icomparer<string> comparer) { // copy input var sortedinput = new string[input.length]; array.copy(input, sortedinput, sortedinput.length); // sort array.sort(sortedinput, comparer); // check can find keys in array using bin. search binarysearchall(arraydesc, sortedinput, comparer); // check sort order correct checkcorrectsortorder(arraydesc, sortedinput, comparer); return sortedinput; } static void binarysearchall(string arraydesc, string[] sortedinput, icomparer<string> comparer) { // check each key in input can found using bin. search foreach (var word in sortedinput) { var ix = array.binarysearch(sortedinput, word, comparer); if (ix < 0) // , appears cannot! console.writeline("error - {0} - key not found: \"{1}\"", arraydesc, word); } } static void checkcorrectsortorder(string arraydesc, string[] sortedkeys, icomparer<string> comparer) { (int n = 0; n < sortedkeys.length; n++) { (int = n + 1; < sortedkeys.length; up++) { var cmp = comparer.compare(sortedkeys[n], sortedkeys[up]); if (cmp >= 0) { console.writeline( "{0}[{1}] = \"{2}\" not < {0}[{3}] = \"{4}\" - cmp = {5}", arraydesc, n, sortedkeys[n], up, sortedkeys[up], cmp); } } (int down = n - 1; down > 0; down--) { var cmp = comparer.compare(sortedkeys[n], sortedkeys[down]); if (cmp <= 0) { console.writeline( "{0}[{1}] = \"{2}\" not > {0}[{3}] = \"{4}\" - cmp = {5}", arraydesc, n, sortedkeys[n], down, sortedkeys[down], cmp); } } } } }
does have explanation unexpected , odd behaviour?
when changing comparer used sortedlist stringcomparer.ordinal
(by e.g. changing fail
false
in above example) problem disappears, seems point comparison issue, don't quite understand why.
updated noted sébastien issue described here not show on .net 3.5 , 3.5 client profiles. on .net 4.0, 4.0 client , 4.5.
after more digging, noticed if take sorted keys list , run array.binarysearch
on keys, returns negative (not found) values same keys not found using sortedlist.containskey
. suggest sort order of keys not correct.
if take sorted keys list , sort them using array.sort
sort order of output different keys problematic.
so added function check (using list's comparer) if sort order of given array correct (i.e. preceding key smaller, succeeding key larger) , restricted input words distinct according comparer. applied function on 3 different inputs (all using same comparer):
- the sortedlist's keys collection.
- the output of array.sort on these keys.
- the output of array.sort on input file.
the output of (2) , (3) identical , different (1). performing array.binarysearch on array.sort outputs of (2) , (3) again fails find same keys (returning < 0). function checks correct sort order indicates 3 cases, sort order around involved problematic keys not correct.
at point hoping did incredibly stupid , there simple explanation. can point out me.
the example code updated additional troubleshooting experiments, screenshot of output can found here http://imgur.com/du8scsa.
update 2 ok, have narrowed problem down seems me serious problem c# string comparers introduced of .net 4.0.
in summary, suppose have 3 values: a1, a2 , a3. kind of sorting work correctly, expect if a1 < a2
, a2 < a3
in order comparison consistent, consequence following holds: a1 < a3
.
this not case c# string comparers (at least comparer<string>.default
, stringcomparer.invariantculture
).
the little program below illustrates exact problem:
class program { static void main(string[] args) { var comparer = stringcomparer.invariantculture; var a1 = "a"; var a2 = "a\'"; var a3 = "\'a"; printcomparison("a1", a1, "a2", a2, comparer); printcomparison("a2", a2, "a3", a3, comparer); printcomparison("a1", a1, "a3", a3, comparer); console.readline(); } public static void printcomparison(string firstsymbol, string first, string secondsymbol, string second, icomparer<string> comparer) { var cmp = comparer.compare(first, second); var result = cmp == 0 ? "=" : cmp < 0 ? "<" : ">"; console.writeline("{0} {1} {2} ({3} {1} {4})", firstsymbol, result, secondsymbol, first, second); } }
this output:
a1 < a2 (a < a') a2 < a3 (a' < 'a) a1 > a3 (a > 'a)
the conclusion seem not safe rely on sort order determined using c# string comperators, or missing something?
update 3 issue seems have been reported ms in december 2012, , closed status "won't fixed", rather disappointing; refer link posted in comments below (it appears can't post in here due limited reputation points). lists workaround, have implemented , used verify indeed resolves problems observed standard comparers.
public class workaroundstringcomparer : stringcomparer { private static readonly func<compareinfo, string, compareoptions, int> _gethashcodeofstring; private readonly compareinfo _compareinfo; private readonly compareoptions _compareoptions; static workaroundstringcomparer() { // need internal method compute hashcode // iequalitycomparer implementation. _gethashcodeofstring = buildgethashcodeofstringdelegate(); } static func<compareinfo, string, compareoptions, int> buildgethashcodeofstringdelegate() { var compareinfotype = typeof(compareinfo); var argtypes = new[] { typeof(string), typeof(compareoptions) }; var flags = bindingflags.nonpublic | bindingflags.instance; var methods = compareinfotype.getmethods(flags).toarray(); ; var method = compareinfotype.getmethod("gethashcodeofstring", flags, null, argtypes, null); var instance = expression.parameter(compareinfotype, "instance"); var stringarg = expression.parameter(typeof(string), "string"); var optionsarg = expression.parameter(typeof(compareoptions), "options"); var methodcall = expression.call(instance, method, stringarg, optionsarg); var expr = expression.lambda<func<compareinfo, string, compareoptions, int>>(methodcall, instance, stringarg, optionsarg); return expr.compile(); } public workaroundstringcomparer() : this(cultureinfo.invariantculture) { } public workaroundstringcomparer(cultureinfo cultureinfo, compareoptions compareoptions = compareoptions.none) { if (cultureinfo == null) throw new argumentnullexception("cultureinfo"); this._compareinfo = cultureinfo.compareinfo; this._compareoptions = compareoptions; } public override int compare(string x, string y) { if (referenceequals(x, y)) return 0; if (referenceequals(x, null)) return -1; if (referenceequals(y, null)) return 1; var sortkeyfor_x = _compareinfo.getsortkey(x, _compareoptions); var sortkeyfor_y = _compareinfo.getsortkey(y, _compareoptions); return sortkey.compare(sortkeyfor_x, sortkeyfor_y); } public override bool equals(string x, string y) { return compare(x, y) == 0; } public override int gethashcode(string obj) { return _gethashcodeofstring(_compareinfo, obj, _compareoptions); } }
the problem workaround is hardly practical sizable collections, because order of magnitude slower e.g. stringcomparer.invariantculture
.
time taken when sorting given word list 1000 times using both comparers:
stringcomparer.invariantculture : 00:00:15.3120013 workaroundstringcomparer : 00:01:35.8322409
so still hoping either microsoft reconsiders or knows viable alternative. otherwise option remains falling on using stringcomparer.ordinal
.
could related .net framework 4/4.5? have adapted example .net 3.5 this:
var words = readfile("hamlet.txt"); //... private static string[] readfile(string path) { list<string> lines = new list<string>(); using (streamreader sr = new streamreader(path)) { string text = sr.readtoend(); lines.add(text); } return lines.selectmany(l => l.split(new[] { ' ', '\t' }, stringsplitoptions.removeemptyentries).select(w => w.trim())) .where(w => !(w.tochararray().all(c => c == ' '))) .toarray(); }
and both comparers work fine on xp using .net 3.5.
Comments
Post a Comment