How can I replace non-printable Unicode characters in Java

Question

The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F]):

my_string.replaceAll("\\p{Cntrl}", "?");

The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20]), including accented characters:

my_string.replaceAll("[^\\p{Print}]", "?");

However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?

User · Answer

I have used this simple function for this:

private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
    Matcher matcher = pattern.matcher(text);
    if ( matcher.find() ) {
        text = text.replace(matcher.group(0), "");
    }
    return text;
}

Hope this is useful.

User · Answer

Based on the answers by Op De Cirkel and noackjr  the following is what I do for general string cleaning  1  trimming leading or trailing whitespaces  2  dos2unix  3  mac2unix  4  removing all  invisible Unicode characters  except whitespaces   myString trim replaceAll   r n     n   replaceAll   r     n   replaceAll     p Cc   p Cf   p Co   p Cn  amp  amp     s          Tested with Scala REPL

User · Answer

You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).

In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.

User · Answer

I have redesigned the code for phone numbers  9  987  124124 Extract digits from a string in Java   public static String stripNonDigitsV2  CharSequence input         if  input    null          return null      if   input length      0           return          char   result   new char input length         int cursor   0      CharBuffer buffer   CharBuffer wrap  input        int i 0      while   i lt  buffer length          buffer hasRemaining           char chr   buffer get i           if  chr   u                i i 5              chr buffer get i                      if   chr  gt  39  amp  amp  chr  lt  58               result cursor      chr          i i 1             return new String  result  0  cursor

User · Answer

Op De Cirkel is mostly right   His suggestion will work in most cases    myString replaceAll    p C             But if myString might contain non-BMP codepoints then it s more complicated    p C  contains the surrogate codepoints of  p Cs    The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair   It s possible this is a Java bug rather than intended behavior   Using the other constituent categories is an option   myString replaceAll     p Cc   p Cf   p Co   p Cn             However  solitary surrogate characters not part of a pair  each surrogate character has an assigned codepoint  will not be removed   A non-regex approach is the only way I know to properly handle  p C    StringBuilder newString   new StringBuilder myString length     for  int offset   0  offset  lt  myString length           int codePoint   myString codePointAt offset       offset    Character charCount codePoint           Replace invisible control characters and unused code points     switch  Character getType codePoint                 case Character CONTROL          p Cc          case Character FORMAT           p Cf          case Character PRIVATE USE      p Co          case Character SURROGATE        p Cs          case Character UNASSIGNED       p Cn              newString append                   break          default              newString append Character toChars codePoint                break

User · Answer

I propose it remove the non printable characters like below instead of replacing it    private String removeNonBMPCharacters final String input        StringBuilder strBuilder   new StringBuilder        input codePoints   forEach  i  - gt            if  Character isSupplementaryCodePoint i                 strBuilder append                 else               strBuilder append Character toChars i                          return strBuilder toString

User · Answer

my string replaceAll    p C            See more about Unicode regex  java util regexPattern String replaceAll supports them

User · Answer

methods in blow for your goal  public static String removeNonAscii String str        return str replaceAll      x00-  x7F            public static String removeNonPrintable String str     All Control Char       return str replaceAll     p C             public static String removeSomeControlChar String str     Some Control Char       return str replaceAll     p Cntrl   p Cc   p Cf   p Co   p Cn             public static String removeFullControlChar String str        return removeNonPrintable str  replaceAll     r  n  t

[java] How can I replace non-printable Unicode characters in Java?

The answer is

Examples related to java

Examples related to string

Examples related to unicode

Tags