c# - regular expression get all hosts from html -


i'm trying urls in 1 regular expression, i'm using pattern.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/   

however regex returns pages/files, instead of hosts. instead of having run second regular expression, i'm hoping here can help

this returns http://www.yoursite.com/index.html

i'm attempting return yoursite.com.

also the regex parsing html , hosts checked after, 100% accuracy isn't crucial.

assuming regex:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ 

actually parse urls (i haven't checked it), use capture group host:

/^(https?:\/\/)?(?<host>([\da-z\.-]+)\.([a-z\.]{2,6}))([\/\w \.-]*)*\/?$/ 

when match result, can examine groups["host"] host name.

but you're better off, in opinion, using uri.trycreate, although you'll need little logic around possible lack of scheme. is:

if (!regex.ismatch(line, "https?:\/\/"))     line = "http://" + line; uri uri; if (uri.trycreate(line, urikind.absolute, out uri)) {     // it's valid url.     host = uri.host; } 

parsing urls pretty tricky business. example, no individual dotted segment can exceed 63 characters, , there's nothing preventing last dotted segment having numbers or hyphens. nor limited 6 characters. you're better off passing entire string uri.trycreate trying duplicate craziness of url parsing single regular expression.

it's possible rest of url (after host name) trash. if want eliminate bit causing problem, extract end of host name:

^https?:\/\/[^\/]* 

then run through uri.trycreate.


Comments