Finding and removing non ascii characters from an Oracle Varchar2

Question

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit   When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters  What I want to do within PL SQL is locate these characters to see what they are and then either change them or remove them   I would like to do    SELECT REGEXP REPLACE COLUMN      ascii          but Oracle does not implement the   ascii   character class   Is there a simple way doing what I want to do

User · Answer

You can try something like following to search for the column containing non-ascii character    select   from your table where your col  lt  gt  asciistr your col

User · Answer

I had similar requirement  to avoid this ugly ORA-31061  XDB error  special char to escaped char conversion failed     but had to keep the line breaks   I tried this from an excellent comment      -    space       but got this ORA-12728  invalid range in regular expression    but it lead me to my solution   select t    regexp replace deta       print     space           from        select  -    lt - strangest thing here  and I want to keep line break after -  deta from dual   t   displays  in my TOAD tool  as      replace all that      is not in the sets  of printing   print   or space    space   chars

User · Answer

I found the answer here   http   www squaredba com remove-non-ascii-characters-from-a-column-255 html  CREATE OR REPLACE FUNCTION O1DW RECTIFY NON ASCII INPUT STR IN VARCHAR2  RETURN VARCHAR2 IS str VARCHAR2 2000   act number   0  cnt number   0  askey number   0  OUTPUT STR VARCHAR2 2000   begin str         TO CHAR INPUT STR          cnt  length str   for i in 1    cnt loop askey   0  select ascii substr str i 1   into askey from dual  if askey  lt  32 or askey  gt  127 then str          REPLACE str  CHR askey        end if  end loop  OUTPUT STR    trim ltrim rtrim trim str                  RETURN  OUTPUT STR   end      Then run this to update your data  update o1dw rate ipselect p 20110505 set NCANI   RECTIFY NON ASCII NCANI

User · Answer

Answer given by Francisco Hayoz is the best  Don t use pl sql functions if sql can do it for you   Here is the simple test in Oracle 11 2 03  select s        regexp replace s       chr 1    -   chr 127            rep  1-127         dump regexp replace s      chr 127    -   chr 225             rep 127-255  from   select listagg c      within group  order by c  s   from  select 127 level l chr 127 level  c from dual connect by level  lt  129     And  rep 127-255  is        Typ 1 Len 30  226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255  i e for some reason this version of Oracle does not replace char 226  and above  Using      chr 127    -   chr 225       gives the desired result  If you need to replace other characters just add them to the regex above or use nested replace regexp replace if the replacement is different then     null string

User · Answer

Try the following   -- To detect select 1 from dual where regexp like trim  xx test text                  xmx    number          chr 128    -   chr 255        in    -- To strip out select regexp replace trim  xx test text                  xmxmx    number          chr 128    -   chr 255          1 0  in   from dual

User · Answer

Thanks  this worked for my purposes  BTW there is a missing single-quote in the example  above   REGEXP REPLACE  COLUMN         CHR  32      -     CHR  127                  I used it in a word-wrap function  Occasionally there was an embedded NewLine  NL   CHR 10    0A in the incoming text that was messing things up

User · Answer

In a single-byte ASCII-compatible encoding  e g  Latin-1   ASCII characters are simply bytes in the range 0 to 127  So you can use something like   x80- xFF  to detect non-ASCII characters

User · Answer

The following also works   select dump a 1016   a from   SELECT REGEXP REPLACE             CONVERT                 3735844533120  03                    US7ASCII                 WE8ISO8859P1                          lt  gt     amp       alnum    blank      a   FROM DUAL

User · Answer

I m a bit late in answering this question  but had the same problem recently  people cut and paste all sorts of stuff into a string and we don t always know what it is   The following is a simple character whitelist approach   SELECT est clients ref    TRANSLATE                 est clients ref                   abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890     amp      -          lt  gt                        REPLACE                             TRANSLATE                                        est clients ref                                       abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890     amp      -          lt  gt                                                                                                                                                                                                                                                      abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890     amp      -          lt  gt                            clean ref   FROM edms staging table est

User · Answer

Do this  it will work   trim replace ntwk slctor key txt  chr 0

User · Answer

I wouldn t recommend it for production code  but it makes sense and seems to work   SELECT REGEXP REPLACE COLUMN         CHR 1      -     CHR 127

User · Answer

Please note that whenever you use   regexp like column    A-Z      Oracle s regexp engine will match certain characters from the Latin-1 range as well  this applies to all characters that look similar to ASCII characters like   - A    - O    - U  etc   so that  A-Z  is not what you know from other environments like  say  Perl   Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade   Another approach  instead of cutting away part of the fields  contents you might try the SOUNDEX function  provided your database contains European characters  i e  Latin-1  characters only  Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters  like         a       a       o   of course only for text blocks exceeding 4000 bytes when transformed to UTF-8

User · Answer

There s probably a more direct way using regular expressions  With luck  somebody else will provide it  But here s what I d do without needing to go to the manuals   Create a PLSQL function to receive your input string and return a varchar2    In the PLSQL function  do an asciistr   of your input  The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL    That function converts the non-ASCII characters to  xxxx notation  So you can use regular expressions to find and remove those  Then return the result

User · Answer

I had a similar issue and blogged about it here  I started with the regular expression for alpha numerics  then added in the few basic punctuation characters I liked   select dump a 1016   a  b from   select regexp replace COLUMN     alnum         gt  -            a           COLUMN b   from TABLE  where a is not null order by a    I used dump with the 1016 variant to give out the hex characters I wanted to replace  which I could then user in a utl raw cast to varchar2

User · Answer

The select may look like the following sample   select nvalue from table where length asciistr nvalue    length nvalue    order by nvalue

User · Answer

If you use the ASCIISTR function to convert the Unicode to literals of the form  nnnn  you can then use REGEXP REPLACE to strip those literals out  like so     UPDATE table SET field   REGEXP REPLACE ASCIISTR field         xdigit    4             where field and table are your field and table names respectively

User · Answer

I think this will do the trick    SELECT REGEXP REPLACE COLUMN       print

[regex] Finding and removing non ascii characters from an Oracle Varchar2

Examples related to regex

Examples related to oracle

Examples related to ascii