regex - XSLT - Regular Expression Parsing -


my current project revolves around translating number of test cases in document form of xml compatible test case management system. in many of these cases, title prefixed number of ticket identifiers, document location numbers , on, need removed before can uploaded system.

given many of these ticket identifiers exist elsewhere in title , valid, i've written translation in current form start of string checked regular expression. have written 2 approaches, varying results.

sample input

1.

<case-name>3.1.6 (c0) tid#eiiy chm-2213 bz-7043 client side java upgrade r8</case-name> 

2.

<case-name>4.2.7    (c1) tid#f1dr – aip - ehd-319087 - bz6862 - datalink builder res...</case-name> 

desired output

1.

<tr:summary>client side java upgrade r8</tr:summary> 

2.

<tr:summary>datalink builder res...</tr:summary> 

first approach

    <xsl:template match="case-name">     <tr:summary>         <xsl:variable name="start">             <xsl:apply-templates/>         </xsl:variable>         <xsl:variable name="start" select="normalize-space($start)"/>         <xsl:variable name="nofloat"        select="normalize-space(fn:remfirstregex($start,        '^[0-9]+([.][0-9]+)*'                       ))"/>         <xsl:variable name="nofloatdash"    select="normalize-space(fn:remfirstregex($nofloat,      '^[\p{pd}]'                                 ))"/>         <xsl:variable name="noc"            select="normalize-space(fn:remfirstregex($nofloatdash,  '^\(c[0-2]\)'                               ))"/>         <xsl:variable name="nocdash"        select="normalize-space(fn:remfirstregex($noc,          '^[\p{pd}]'                                 ))"/>         <xsl:variable name="notid"          select="normalize-space(fn:remfirstregex($nocdash,      '^(tid)(#|\p{pd})(\w+)'                     ))"/>         <xsl:variable name="notiddash"      select="normalize-space(fn:remfirstregex($notid,        '^[\p{pd}]'                                 ))"/>          <xsl:variable name="noaip"          select="normalize-space(fn:remfirstregex($notiddash,    '^aip'                                      ))"/>         <xsl:variable name="noaipdash"      select="normalize-space(fn:remfirstregex($noaip,        '^[\p{pd}]'                                 ))"/>         <xsl:variable name="nochm"          select="normalize-space(fn:remfirstregex($noaipdash,    '^(chm)[\p{pd}]([0-9]+)'                    ))"/>         <xsl:variable name="nochmdash"      select="normalize-space(fn:remfirstregex($nochm,        '^[\p{pd}]'                                 ))"/>         <xsl:variable name="noehd"          select="normalize-space(fn:remfirstregex($nochmdash,    '^(ehd)[\p{pd}]([0-9]+)'                    ))"/>         <xsl:variable name="noehddash"      select="normalize-space(fn:remfirstregex($noehd,        '^[\p{pd}]'                                 ))"/>            <xsl:variable name="nobz"           select="normalize-space(fn:remfirstregex($noehddash,    '^(bz)(((#|\p{pd})[0-9]+)|[0-9]+)'          ))"/>         <xsl:variable name="nobzdash"       select="normalize-space(fn:remfirstregex($nobz,         '^[\p{pd}]'                                 ))"/>         <xsl:variable name="nott"           select="normalize-space(fn:remfirstregex($nobzdash,     '^(tt)[#](\w)+'                             ))"/>         <xsl:variable name="nottdash"       select="normalize-space(fn:remfirstregex($nott,         '^[\p{pd}]'                                 ))"/>         <xsl:variable name="nobrack"        select="normalize-space(fn:remfirstregex($nottdash,     '^\[(.*?)\]'                                ))"/>         <xsl:variable name="nobrackdash"    select="normalize-space(fn:remfirstregex($nobrack,      '^[\p{pd}]'                                 ))"/>         <xsl:value-of select="normalize-space($nobrackdash)"/>     </tr:summary> </xsl:template>  <xsl:function name="fn:remfirstregex">     <xsl:param name="instring"/>     <xsl:param name="regex"/>      <xsl:variable name="words" select="tokenize($instring, '\p{z}')"/>     <xsl:variable name="outstring">         <xsl:for-each select="$words">             <xsl:if test="not(matches(., $regex)) or index-of($words, .) > 1">                 <xsl:value-of select="."/><xsl:text> </xsl:text>             </xsl:if>         </xsl:for-each>     </xsl:variable>      <xsl:value-of select="string-join($outstring, '')"> </xsl:function> 

note: namespace fn, purpose of translation, "function/namespace", used write own functions.

first results

1. success

<tr:summary>client side java upgrade r8</tr:summary> 

2. failure

<tr:summary>- ehd-319087 - bz6862 - datalink builder resolution selector may drop leading zeros on coordinate seconds</tr:summary> 

second approach

<xsl:function name="fn:remfirstregex">     <xsl:param name="instring"/>     <xsl:param name="regex"/>      <xsl:analyze-string select="$instring" regex="$regex">         <xsl:non-matching-substring>             <xsl:value-of select="."/>         </xsl:non-matching-substring>     </xsl:analyze-string> </xsl:function> 

this approach fails completely, i'm including here because it's more obvious solution , did not work @ all.

it should noted there large number of regular expressions in above solution, account possible ids might come through. mercifully, ids seem come in consistent order.

the problem, have concluded, dashes. have noted in every case in documents translation has failed, failing id has been both preceded and followed dash. if precedes, it'll go through fine. if follows, no issues. both falls down, , curiously, dash still shows up, though has been seemingly eliminated string.

there 2 kinds of dashes @ play here, normal dash (&#8211;) , minus sign (&#45;).

paradoxically: sorry long question, , let me know if i've missed out.

edit: forgot say, regular expressions exception of dashes have been tested elsewhere , known work on input stuff.

edit ii: following @acheong87's solution, tried run following:

<xsl:template match="case-name">         <tr:summary>         <xsl:variable name="regex" select=         "'^[\s\p{pd}]*(\d+([.]\d+)*)?[\s\p{pd}]*(\(c[0-2]\))?([\s\p{pd}]*(tid|aip|chm|ehd|bz|tt)((#|\p{pd}|)\w+|))*[\s\p{pd}]*(\[.*?\])?'"/>         <xsl:analyze-string select="string(.)" regex="{$regex}">             <xsl:non-matching-substring>                 <xsl:value-of select="."/>             </xsl:non-matching-substring>         </xsl:analyze-string>     </tr:summary> </xsl:template> 

and saxon gives me following error:

error @ xsl:analyze-string @ line (for our purposes, 5): xtde1150: regular expression must not 1 matches zero-length string 

i can why come up, given optional. there way of running won't give me error?

thanks again.

here main components go single regex. i've rewritten of expressions.

\d+([.]\d+)* \(c[0-2]\) tid(#|\p{pd})\w+ aip chm[\p{pd}]\d+ ehd[\p{pd}]\d+ bz(#|\p{pd}|)\d+ tt#\w+ \[.*?\] 

each component should wrapped in (...)? make optional, , components should joined separator, [\s\p{pd}]*. produces:

^[\s\p{pd}]*(\d+([.]\d+)*)?[\s\p{pd}]*(\(c[0-2]\))?[\s\p{pd}]*(tid(#|\p{pd})\w+)?[\s\p{pd}]*(aip)?[\s\p{pd}]*(chm[\p{pd}]\d+)?[\s\p{pd}]*(ehd[\p{pd}]\d+)?[\s\p{pd}]*(bz(#|\p{pd}|)\d+)?[\s\p{pd}]*(tt#\w+)?[\s\p{pd}]*(\[.*?\])? 

you can see in this rubular demo above expression indeed matches 2 examples.


there may elegant simplification may interested in.

\d+([.]\d+)* \(c[0-2]\) (tid|aip|chm|ehd|bz|tt)((#|\p{pd}|)\w+|) \[.*?\] 

maybe codes aip should separate, can see spirit of version. is, it's unlikely valid titles begin such codes; in fact more examples missing possible combination such ehd#, may appear in future past-based formulation miss. (of course, point irrelevant if there is no future—and data have data you'll need process.) if there future though, imo, it's better in case loosen rigor of expression capture potential related combinations.

the above become:

^[\s\p{pd}]*(\d+([.]\d+)*)?[\s\p{pd}]*(\(c[0-2]\))?([\s\p{pd}]*(tid|aip|chm|ehd|bz|tt)((#|\p{pd}|)\w+|))*[\s\p{pd}]*(\[.*?\])? 

here the rubular demo.


Comments