Find links in NSString with html

Discussion in 'Mac Programming' started by GRMrGecko, Nov 18, 2008.

  1. GRMrGecko macrumors member

    GRMrGecko

    Joined:
    Jun 7, 2008
    Location:
    Nowhere and everywhere
    #1
    Hello I am trying to find links in a NSString that has html in it. I am working on a web crawler and I'll need to find all links so that I can add it to my database.
    I've tried to use RegexKit, but it didn't seem to work at all for me.

    I know how to do this in php using preg_match_all
    Code:
    <?php
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,"http://example.com/");
    curl_setopt($ch, CURLOPT_HEADER, false);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_5_5; en-us) AppleWebKit/528.5+ (KHTML, like Gecko) Version/4.0dp1 Safari/526.11.2");
    $result = curl_exec($ch);
    curl_close($ch);
    $links = array();
    preg_match_all("/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>.*<\/a>/siU", $result, $links);
    print_r($links);
    ?>
    Thanks for any help.
     
  2. kainjow Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #2
    A quick and dirty way is to use the rangeOfString:eek:ptions:range: method with NSCaseInsensitiveSearch. You can then use the returned NSRange in a loop to find all instances of your search string.
     
  3. GRMrGecko thread starter macrumors member

    GRMrGecko

    Joined:
    Jun 7, 2008
    Location:
    Nowhere and everywhere
    #3
    I've Decided to use NSXML to parse the html and xpath to get all links.

    But it would still be nice to know how to use regex like in preg_match_all.

     
  4. kainjow Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #4
    I've used several Cocoa ways of scraping webpages and they're all fairly slow (including NSXML with the tidy option, NSRanges, etc).

    What I used to do was use NSTask to pipe the HTML to a Perl script which would then use regex. That was the fastest method I found, even over C-based regex libraries (maybe I wasn't using them right?). But it's a bit of a hassle to do that though, and writing Perl is no fun ;)
     

Share This Page