Java: String replaceAll method usage causes infinite loop?

Discussion in 'Mac Programming' started by dejo, Dec 7, 2007.

  1. dejo Moderator

    dejo

    Staff Member

    Joined:
    Sep 2, 2004
    Location:
    The Centennial State
    #1
    I have some code that processes a String and replaces the path for SRC attributes (from HTML) with a new path. But with certain attribute values the replaceAll method seems to go into an infinite loop rather than just not matching (or matching). If you run the following Java, rather than getting output for the modifiedString, it just appears to hang. Is this a legitimate bug with Java's replaceAll method or am I doing something wrong?

    Code:
    class ReplaceAllBugApp {
        public static void main(String[] args) {
            String originalString = new String("<img src=\"/images/subdir/company_logo_final.jpg />");
            System.out.println("Original String: " + originalString); // Display the original string.
            String modifiedString = originalString.replaceAll("(?i)src ?= ?\"((/?[\\w\\d\\-@%]+)*/)?([\\w\\d\\.\\_\\-@]+\\.[\\w\\d]+)\"", "src=\"/webapp/UserFiles/Images/$3\"");
            System.out.println("Modified String: " + modifiedString); // Display the modified string.
        }
    }
    
    P.S. I am running Java 5 on Windows, in case it matters.
     
  2. robbieduncan Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #2
    I'd not be surprised if it could end up in an infinite loop if the string you replace ends up inserting the string you are replacing...
     
  3. dejo thread starter Moderator

    dejo

    Staff Member

    Joined:
    Sep 2, 2004
    Location:
    The Centennial State
    #3
    But the strings are definitely different in this example case. We're going from src="/images/subdir/company_logo_final.jpg to src="/webapp/UserFiles/Images/company_logo_final.jpg".

    And here's some fun stuff: if I add a (proper) closing " to the end of the src value (it's missing in the original example), the code works as hoped. But it shouldn't hang in the original case. It should just not match and do no replacing.
     
  4. jeremy.king macrumors 603

    jeremy.king

    Joined:
    Jul 23, 2002
    Location:
    Fuquay Varina, NC
    #4
    Are you simply looking to replace the path on the images?
     
  5. dejo thread starter Moderator

    dejo

    Staff Member

    Joined:
    Sep 2, 2004
    Location:
    The Centennial State
    #5
    Yes.
     
  6. kylos macrumors 6502a

    kylos

    Joined:
    Nov 8, 2002
    Location:
    MI
    #6
    Run it through a debugger and see where it hangs. I'd recommend Eclipse, as the version I'm using has the source for java.util.regex (used by String.replaceAll()), so you can step through the code and possibly see the actual problem.

    From Matcher.replaceAll (used by String.replaceAll()):
    This method first resets this matcher. It then scans the input sequence looking for matches of the pattern. Characters that are not part of any match are appended directly to the result string; each match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences as in the appendReplacement method.

    This para seems to indicate a sequential match, not a recursive one, so there should be no issues with rematching an already matched tag.
     
  7. jeremy.king macrumors 603

    jeremy.king

    Joined:
    Jul 23, 2002
    Location:
    Fuquay Varina, NC
    #7
    Does this help at all?

    Code:
    String modifiedString = originalString.replaceAll("(?i)src ?= ?\"(/?[^/]+/)*([^\"]+\")", "src=\"/webapp/UserFiles/Images/$2");
    
    I tend to approach negative logic approach when using regexp (ie. everything but some character)

    Tested fine in my limited test cases...

    I think the where the regexp gets all weird is when you don't have that closing quote in your test case, you expose the closing slash of the image tag to the matcher and it bombs somehow. I suggest you submit that test program to sun as a bug.
     
  8. dejo thread starter Moderator

    dejo

    Staff Member

    Joined:
    Sep 2, 2004
    Location:
    The Centennial State
    #8
    Thanks, kingjr3, that should do it! I think I will heed your advice and revisit some of my other regular-expressions and see if I can rework with the negative logic approach. And I will submit this to Sun as a bug. I know it's not the best regex but it should never hang. I expect no matching when I mess up my regex.

    P.S. Now to just adjust the regex so that the closing double-quote is not the only allowed delimiter of the src= value but also whitespace too. That way, replacement WILL occur for either src="..." or src=... (or even src="... or src=...").

    P.P.S. How does this look?
    Code:
    String modifiedString = originalString.replaceAll("(?i)src ?= ?\"?(/?[^/]+/)*([^\\s|\"]+)\"?\\s", "src=\"/webapp/UserFiles/Images/$2\" ");
    
    This way I force the new src value to be surrounded by "s.
     
  9. kylos macrumors 6502a

    kylos

    Joined:
    Nov 8, 2002
    Location:
    MI
    #9
    That'll work better in case your site isn't yet up to snuff with quotes. Technically, it accepts invalid input, but then it make such input valid, so there's nothing wrong with that.
     
  10. dejo thread starter Moderator

    dejo

    Staff Member

    Joined:
    Sep 2, 2004
    Location:
    The Centennial State
    #10
    Re: Proposed Solution

    So, seems that once I try to apply this regex to a much larger HTML string (220K), I'm getting a stack overflow exception. Thoughts?
     
  11. kylos macrumors 6502a

    kylos

    Joined:
    Nov 8, 2002
    Location:
    MI
    #11
    Does it work on a much smaller file? Something with more than one src attribute but say only 1k in size? The reason I ask is because you are using greedy operators in your regex which read in the entire file and then back off until they find an acceptable match. Depending on the implementation of the regex engine, this could result in a lot of data being pushed on to the stack.

    Check the Sun tutorial on quantifiers for more info.

    I would recommend using the following

    Code:
    "(?i)src ?= ?\"(/?[^/]+?/)*?([^\"]+?\")"
    instead of

    Code:
    "(?i)src ?= ?\"(/?[^/]+/)*([^\"]+\")"
    The question marks following the (+,*) operators turn them into reluctant operators, meaning they will start with the smallest possible match and then increase until they fail, instead of starting with the largest string and stopping when it becomes a valid match.
     
  12. jeremy.king macrumors 603

    jeremy.king

    Joined:
    Jul 23, 2002
    Location:
    Fuquay Varina, NC
  13. SC68Cal macrumors 68000

    Joined:
    Feb 23, 2006
    #13
    Welcome to the limitations of Java. Your VM is probably too small to handle the task. Allocate some more memory.
     
  14. kylos macrumors 6502a

    kylos

    Joined:
    Nov 8, 2002
    Location:
    MI
    #14
    While increasing the VM stack using the VM option -Xss may help avoid stack overflows for this situation, the stack overflow is only indirectly related to the file size While local references to objects are stored on the stack, the actual objects are stored on the heap. In other words, the String data associated with the file is not stored on the stack, so the file size does not directly cause the stack overflow.

    In Java, only method calls, primitive types, and references are stored on the the stack. Thus, in Java, stack overflows can only plausibly be caused by deep recursion. Using greedy quantifiers on large files when only a small portion of the data will be matched will cause massive backtracking, which will not only take much longer to process, but will also potentially cause stack overflows due to apparently deep recursion in the regex engine.

    More information on Java memory structure.

    It would be interesting if the OP would post a stack trace to see what is causing the overflow.

    In the meantime, I would wager that using reluctant quantifiers would not only dramatically speed up execution, but would also eliminate stack problems.
     
  15. jeremy.king macrumors 603

    jeremy.king

    Joined:
    Jul 23, 2002
    Location:
    Fuquay Varina, NC
    #15
    I'll put my money on Pattern.matches or Matcher.matches!

    Overly ambitious regexp will almost always cause stack overflow because of the recursive nature of the Pattern/Matcher classes.

    I don't think its unreasonable to try the reluctant quantifiers, modify the pattern to include end of line terminator, or simply read the file line by line (or tag by tag) and apply the pattern accordingly...

    Several ways to skin this cat :)
     
  16. kylos macrumors 6502a

    kylos

    Joined:
    Nov 8, 2002
    Location:
    MI
    #16
    Agreed. :)
     

Share This Page