regex - Matching Unicode Dashes in Java Regular Expressions? -


I want to create a Java regular expression to split strings of "foo-bar" in the normal format Trying and using "Pattern.split" (), the "bar" character can be one of several dashes: ASCII '-', m-dash, en-dash, etc. I have created the following regular expression:

  Private static last pattern title Segment Separator = Pattern.compile ("\\ s (\\ x45 | \\ u8211 | \ u8212 | \\ U8213 | \\ u8214) \\ s ");  

Which, if I am correctly reading pattern documents, then any unicode dash or ocean dash should be captured, when the whitespace is surrounded on either side. I am using pattern as follows:

  string [] sectionSegments = titleSegmentSeparator.split (sectionTitle);  

No happiness. For sample input below, the dash is not detected, and title segments separator. Mitch (section title) .Find () returns false returns!

To ensure that I was not reminiscing any unusual character organizations, I would like to print some debug info to System.out Production is as follows - After every character, four is produced, which should be its 'Unicode code point', no?

Sample Input:

Study Summary (1 out of 10) - Contest

S (83) T (116) U (117) D (100) Y (121) (32) S (83) U (117) m (109) m (109) a (97) r (114) y (121) (32) ((40) 1 (49) ( 32) C (67) O (111) M (109) P (112) O (111) F (102) (32) 1 (49) 0 (48) (41) (32) - (8211) (32) E (101) T (116) I (105) T (116) I (105) O (111) N (110)

I think the dash code is POPoint 8211, which Regex should be matched, but it is not! What is happening?

( 8211 ) and Exxcmile ( 0x8211 ).

\ x and \ u both expect a hexadecimal number, You must use \ u2014 for Em-dash, \ u8211 (and \ x2d for common hyphen etc.).

But why not just use the Unicode property "Dash punctuation"?

As a Java string: "\\ s \\ p {Pd} \\ s"


Comments

Popular posts from this blog

windows - Heroku throws SQLITE3 Read only exception -

lex - Building a lexical Analyzer in Java -

python - rename keys in a dictionary -