Handling Strings with special characters in Java

问题: I'm implementing a String matching algorithm that requires handling Strings with special characters. On one side of matching, the Strings were prepared in Python, then went...

问题:

I'm implementing a String matching algorithm that requires handling Strings with special characters. On one side of matching, the Strings were prepared in Python, then went through JAVA. On the other side, they were prepared by another environment. Now I'm matching them in my program in Java (Strings retrieved from JSON inputs).

While some of the characters are handled, I have issues handling many others.

For instance, I get a MATCH for this (both showed on my console as >> AS IT COMES CRUMBLING):

"text":"u003eu003e AS IT COMES CRUMBLING"
"caption":">> AS IT COMES CRUMBLING"

But these ones shown as NON-MATCH:

"text":"What if you had fewer headachesnand migraines a month?"
"text":"What if you had fewer headaches\nand migraines a month?"

Or this one:

"text":"Effects of BOTOX® may spread"
"text":"Effects of BOTOX\xc2\xae may spread"

Or this:

"text":"Let's also rethink hownwe care for ourselves."
"text":"Let'\xe2\x80\x99s also rethink how\nwe care for ourselves."

In my code, I use JSONPath to read JSON inputs from both sides, put them in an ArrayList, and then compare one against all items in the list.

boolean found=false;
myText foundText = null;
for (int i = 0; i < scheduledText.size(); i++) {
    if(current.text.equals(scheduledText.get(i).text)) {
        found = true;
        foundText =scheduledText.get(i);
        break;
    }
}
if(found)
   //print MATCH
else
   //print NON_MATCH

I'm frustrated. What should I do? How can I handle these?


回答1:

So for my proposed solution, you would use a function in your java code like below.

private static String cleanTextContent(String text)
    {
        // strips off all non-ASCII characters
        text = text.replaceAll("[^\x00-\x7F]", "");

        // erases all the ASCII control characters
        text = text.replaceAll("[\p{Cntrl}&&[^rnt]]", "");

        // removes non-printable characters from Unicode
        text = text.replaceAll("\p{C}", "");

        text = text.replaceAll("[^ -~]","");

        text = text.replaceAll("[^\p{ASCII}]", "");

        text = text.replaceAll("\\x\p{XDigit}{2}", "");

        text = text.replaceAll("\n","");

        text = text.replaceAll("[^\x20-\x7e]", "");
        return text.trim();
    }

Once you call this function you can use Apache Commons lib to convert string to md5 hash something like this.

private static String hashMyString(String text)  {

    String hashText= text;

    String md5Hex = DigestUtils
      .md5Hex(hashText).toUpperCase();

   return md5Hex;
}

Finally just compare the two hashes in your main program.

Edit: If using maven this is the library the basically makes the DigestUtils work.

   <!-- https://mvnrepository.com/artifact/commons-codec/commons-codec -->
<dependency>
    <groupId>commons-codec</groupId>
    <artifactId>commons-codec</artifactId>
    <version>1.9</version>
</dependency>

Edit: My full Test code for String.

public class App 
{
    public static void main( String[] args ) throws UnsupportedEncodingException
    {

       String sideOneString = "Effects of BOTOX® may spread";
       String sideTwoString = "Effects of BOTOX\xc2\xae may spread";
       String sideThreeString = "BOTOX injections take aboutn15 mins";
       String sideFourString  = "BOTOX\xc2\xae injections take about\n15 mins";


       System.out.println( hashMyString(cleanTextContent(sideOneString)));
       System.out.println( hashMyString(cleanTextContent(sideTwoString)));
       System.out.println( hashMyString(cleanTextContent(sideThreeString)));
       System.out.println( hashMyString(cleanTextContent(sideFourString)));
    }





    private  static  String hashMyString(String text)  {

        String hashText= text;

        String md5Hex = DigestUtils.md5Hex(hashText).toUpperCase();
        //System.out.println(md5Hex);
       return md5Hex;
    }

    private static String cleanTextContent(String text)
    {
        // strips off all non-ASCII characters
        text = text.replaceAll("[^\x00-\x7F]", "");

        // erases all the ASCII control characters
        text = text.replaceAll("[\p{Cntrl}&&[^rnt]]", "");

        // removes non-printable characters from Unicode
        text = text.replaceAll("\p{C}", "");

        text = text.replaceAll("[^ -~]","");

        text = text.replaceAll("[^\p{ASCII}]", "");

        text = text.replaceAll("\\x\p{XDigit}{2}", "");
        text = text.replaceAll("\\n","");


        text = text.replaceAll("[^\x20-\x7e]", "");
        return text.trim();
    }
}

result :

F928A529F380EB59575AC8A175FDFE79
F928A529F380EB59575AC8A175FDFE79
B4740299C53E18C9ECAF18BA35151D43
B4740299C53E18C9ECAF18BA35151D43
  • 发表于 2019-02-20 02:45
  • 阅读 ( 237 )
  • 分类:sof

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除