MySQL / PHP String Comparison

Discussion in 'Web Design and Development' started by SChaput, Jul 22, 2009.

  1. SChaput macrumors regular

    Joined:
    Jul 2, 2008
    Location:
    United States
    #1
    Good morning all,

    I'd appreciate your guidance on this issue, i currently have a website in which people are allowed to post anything they want regarding certain topics. I use php to output the information from my mysql database onto the main page of site.

    From there i display a link, bringing the user to another page, while passing the ID of the original post through the URL. On this new page i'd like to compare , every word of the original post (which i can put into a variable by querying the database for that ID number) to every word of another paragraph, from another table in my database. If it finds some sort of similarity's between the two, to print out the paragraph from the second table.

    I originally coded this using the mysql LIKE while incorporating the use of % and _, until i realized this was comparing the entire string to the entire string, with wild-cards on the end.

    From there i managed to explode the original post but am just giving my headache trying to figure out how to compare each word of the 1st post, to every word in the 2nd post.

    Any help is greatly appreciated, Thanks!
     
  2. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #2
    Actually, using "like %$string%" in PHP/MySQL will return a record if $string exists in that table at any position, no need for additional wildcards. Unless I am mis-understanding you, please post a highly simplified example using sentences instead of paragraphs for each and an example query you've tried which doesn't do as you expected.

    -jim
     
  3. SChaput thread starter macrumors regular

    Joined:
    Jul 2, 2008
    Location:
    United States
    #3
    Alright, i dont have access to my direct code right now, but ill try and post a simple example of what im trying to accomplish.

    Table Posts
    ---------------------
    ID:15
    ----------------------
    POST: Today i did this and That
    ---------------------

    Table Compare
    ----------------------
    comparetext: something something this something
    ------------------------

    Code:
    $id=$_GET['id'];
    $sqlFindOriginal = "SELECT * from posts WHERE id=$id"
    		mysql_query(sqlFindCompare);
    ///put $post into variable called, $originalpost
    
    
    $sqlFindCompare = "SELECT * from compare WHERE comparetext LIKE %$originalpost%'";   
    mysql_query(sqlFindCompare);
    
    ///Output anything from compare that is similar to Posts
    
    Hope this helps. From what i can remember from my testing these sort of queries only returned items that had one word similarities. For instance if the table structure was,
    Table Posts
    ---------------------
    ID:15
    ----------------------
    POST: test
    ---------------------

    Table Compare
    ----------------------
    comparetext: test
    ------------------------

    It would work...
    However, if the table structure was

    Table Posts
    ---------------------
    ID:15
    ----------------------
    POST: test
    ---------------------

    Table Compare
    ----------------------
    comparetext: something testing
    ------------------------
     
  4. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #4
    I follow, working on a solution, will post in a bit...
     
  5. SChaput thread starter macrumors regular

    Joined:
    Jul 2, 2008
    Location:
    United States
    #5
    That is excellent to hear, thank you very much for your assistance.
     
  6. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #6
    Tested with my own database, works for me...

    First the code, then I'll explain:

    PHP:
    // You'll extract this from your Posts table, I'm simply setting a variable for testing...
    $post="Jim was here! 'testing'";  

    // PHP to extract each word from $post into an array, loop through it creating proper SQL syntax for REGEXP
    $sql_post='';
    $sql_post_arr=str_word_count(trim(strtolower($post)),1);
    for (
    $counter=0$counter<count($sql_post_arr); $counter++) {
        
    $sql_post.=trim($sql_post_arr[$counter],"'\"");
        if (
    $counter<count($sql_post_arr)-1) {$sql_post.="|";}
    }
    // Query to pull data from compare.comparetext if post words exist (adjust SQL for your needs, add your own connection/DB select code of course)...
    $sql="select comparetext from compare where lower(comparetext) REGEXP '$sql_post'";
    $result=mysql_query($sql);
    while (
    $data=mysql_fetch_assoc($result)) {
        print 
    $data['comparetext'];
    }
    The comments explain most of it, but for simple testing I set $post to simulate the post data you'll extract from your Posts table. I then use the handy PHP str_word_count function to find all the words and stuff them into a simple array. I loop through the array creating a string named $sql_post which holds the proper SQL syntax (each word appended by a pipe character "|") for the powerful REGEXP expression in the select statement a few lines down. I then include that string in the query and run it, returning any comparetext rows where the expression is true.

    The actual query will look like this at runtime:

    Hopefully this is what you want, otherwise you get the concept and should be able to take it from here. If not, ask and I"ll be happy to help further.

    Technical notes:

    This method uses one query, but if the post data contains alot of words (hundreds) the query will be quite long, obviously. You might want to remove dupe words in $post before running the query if this is true and the query is run often, wasting your resources. Also, other folks might suggest using the MySQL "match" function, but bear in mind that function requires fulltext fields and you didn't say you had them, so my method works universally. Also, faster, as REGEXP is known for its speed.

    Thanks! :cool:

    -jim
     
  7. SChaput thread starter macrumors regular

    Joined:
    Jul 2, 2008
    Location:
    United States
    #7
    Looks great! I'm at work for the next few hours but i'll try the code out when I get home and post if I'm succsesfull or not.
    Thanks alot for your quick and very helpful responses!
     
  8. memco macrumors 6502

    Joined:
    May 1, 2008
    #8
    Just a note to add to Jim's response–you can just pull the data from the two tables into an array and do the comparisons on each that way. This way, you can close the DB connection quicker and you can let PHP handle the heavy lifting required by the regexp. I don't have the syntax down for PHP very well, but I know you'd use mysql_fetch_array for this.
     
  9. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #9
    Respectfully, that is not advised if the data set is large - and in this case it's not just one table with each record containing one or more paragraphs, it's two. I might concede it would be fine for this specific user if 1) fulltext was not involved and 2) small website, not alot of traffic, small number of records now and in the future. In short, the database is VERY efficient at doing things like this due to query optimization as a result of indexing techniques. In this case REGEXP is fast (due to the where clause which uses indexes) as opposed to STRCMP or MATCH which don't use indexing (and neither does PHP when looping through/comparing arrays). I hinted at REGEXP in MySQL being a fast technique in my earlier comments, and acknowledged dupe checking would be wise on the PHP side for $post prior to the query. Since you raised the issue now you see technically why I said that about REGEXP.

    I'm not saying there there aren't other solutions out there, and I thank you for your input, memco, as always.

    -jim
     
  10. savar macrumors 68000

    savar

    Joined:
    Jun 6, 2003
    Location:
    District of Columbia
    #10
    If I understand your objective, you want to be able to show text that is related to the text that the user submitted?

    Querying word-by-word is going to be incredibly slow once you build up a medium sized dataset.

    A really nice solution would be to use the MySQL fulltext index.

    I can't get on their website right now, but I found this: http://www.databasejournal.com/feat.../Using-Fulltext-Indexes-in-MySQL---Part-1.htm

    Pay attention to the section about "relevance"... I think you could use that to guage how similar two records are.

    If this doesn't cut it, then the only remaining solution would be to build your own index that does exactly what you need. Part of the problem is going to be how you define "similarity" -- how do you quantify similarity and measure it? How do you perform a 3-way comparison to order which of 2 articles is most similar to the 1st?? Lots of thorny questions that you'll need to consider.
     
  11. savar macrumors 68000

    savar

    Joined:
    Jun 6, 2003
    Location:
    District of Columbia
    #11
    This is a good example of why not to learn programming on a forum. Especially a non-programmers forum.

    Your solution may be workable, but in practice it would be the worst possible way to solve the problem in a real world business.

    Edit: Also, for the OP, I took a look at the code you posted. I hope you realize that it has SQL injection vulnerabilities.
     
  12. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #12
    @savar

    You were a little more blunt than I in responding to memco, but let's keep it friendly and also we all know this forum isn't the University of PHP, nobody earns credits here! This education is free and with all things free, you get what you pay for! I think most of us realize that, the rest won't get it anyway and are beyond help! As to the other issues you raised with the OP, all good points but maybe tone it down a bit, so newbies are encouraged to participate and learn. Thanks. I'll refrain from any further replies until the OP tests the code and we'll go from there as to fulltext and fixing SQL injection issues.

    -jim
     
  13. SChaput thread starter macrumors regular

    Joined:
    Jul 2, 2008
    Location:
    United States
    #13
    I appreciate it Jim.

    I also appreciate the code you supplied above working wonders with what i intended on doing. Im very thankful.

    I'm trying to edit it a little bit in order to make sure that when there are no matches it displays something to the effect of, "none found' and maybe even leave out certain words, such as 'he' 'she' 'the' etc etc.. I'm trying to find the number of rows but get the error,

    Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /compare on line 172

    Code:
    $sql="select entry1 from compare where lower(entry) REGEXP '$sql_post'";
    $result=mysql_query($sql);
    $number=mysql_num_rows($sql); 
    
    Also, i'm not sure the source of this other error, it may be because i havent impleted the above solution yet but heres what gets displayed, sometimes.

    Warning: mysql_fetch_assoc(): supplied argument is not a valid MySQL result resource in /compare on line 175

    Thanks Again!
     
  14. angelwatt Moderator emeritus

    angelwatt

    Joined:
    Aug 16, 2005
    Location:
    USA
    #14
    I believe they're referring mostly to SQL injection attacks. Here's some starter reading from Wikipedia.
     
  15. ChrisA macrumors G4

    Joined:
    Jan 5, 2006
    Location:
    Redondo Beach, California
    #15
    OK I've read the comments above. If you really want to compare one paragraph to potentially tens of thousands of other paragraphs you can NOT read each of the other paragraphs one at a time and compare them. You need to query this out of the DBMS in one simple query.

    Here is the way it is commonly done at place like Google

    What you do is when you store the paragraph you first create a "word list" that contains the unique words in the paragraph.

    Also you have a "dictionary" table in the DBMS that has one row for every unique word you have seen so far in all stored paragraphs. This table might grow to 20,000 rows or so but I bet only about 5K. But in any case even 50,000 is small.

    You have another table that keeps the paragraph text and a "paragraph ID" number

    A third table relates the paragraph ID to the unique word ID. This table will grow to become truely HUGE. (But disk are cheap now, $100 per TB.) A simple query of this table can find paragraphs that use any set of words. It is fast if you keep the columns indexed. So now after I did step #1 and have a word list for my new para. I can query the huge table for "all para_ID where word_ID=x or word_ID=y or word_ID=Z and find them all.

    The trick is to query only for ther "Importence" words and NOT every word in the new para. Hence the "statistics" in the dictionary. You might look up the "Importence" of each word then query only for the N most importent words. or the Nth percentile most importent or whatever

    The in the SQL also try "order by" the count of the matches and "limit" to the first N found.

    So there you have it, you've search 100,000 paragraphs in just one SQL query

    Also you might want a synonym table and so other refinements so you don't do the match on common words.

    Remember to keep all these table in normal form. ou need them normailed.

    In short the KEY to what Google does is to keep a list of unque words an a huge set of points that point to the location where each word is used.
     
  16. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #16
    @OP:

    Glad the code worked for you, and you're expanding it for your needs. Others have responded with advanced (and strongly suggested) techniques that are probably above your current skill level, but worth researching, such as fulltext, setting up an index to speed searches and SQL injection to make you a better overall coder.

    BTW, "$number=mysql_num_rows($sql);" in your code should be "$number=mysql_num_rows($result);" - pass the result set not the query.

    -jim
     
  17. ChrisA macrumors G4

    Joined:
    Jan 5, 2006
    Location:
    Redondo Beach, California
    #17
    When you test this the one thing to avoid is using an small "test" database. The reason is that typically with this kind of application performance will be quite good with a small number of paragraphs but then as the database grows a threshold is cross and the performance falls by a factor of 1,000. No kidding you cross a line inthe sand and the system crawls to a stop. The reason is that MySQL or any other database tend to cache stuff in RAM. When tables are small the entire table fits in RAM. But at some point it will no longer fit and every read thn requires a disk access which is literally 1000 times slower. I've been bitten by this more than once. Something that once took a half second now takes an hour. Users will not wait an hour.

    The solution is to write a program that will fill the database tables with test data with can be giberish. I use random number generators. You have to test with a realistic amount of data. The reason is to detect problems easy. Find them before investing much of your time in a dead end design path.

    The solution I proposed is really the only thing that has been found to work well with large collections of text and it is a well know approach. You can read more here.
    http://en.wikipedia.org/wiki/Inverted_index

    You do not have to write this code. As I said this is a common, well known algorithm and you can either choose a database that directly implements inverted indexes and full text searches using them, or you can use a free open source search library to handle this for you. It is like a math library. No one should have to code a square root function. Same here. Spend a few days hunting around using Google and choose tools that are well matched to your task. Yes you can use PHP and MySQL but that is like wring a square root function yourself. It would likely take months.
     
  18. memco macrumors 6502

    Joined:
    May 1, 2008
    #18
    *Insert foot in mouth*

    @Jim and Savar: Thanks for the explanation on that.
     
  19. savar macrumors 68000

    savar

    Joined:
    Jun 6, 2003
    Location:
    District of Columbia
    #19
    I'm just not a big fan of people who build some toys and then start applying for $80k programming jobs. Believe me, I've interviewed plenty of "Senior" developers who didn't even know the syntax of the language that they were applying for.

    If you're a novice programmer, teaching yourself by building stuff -- that's AWESOME -- and I wish you the best. But those programmers need to be open to criticism, since they essentially don't know anything yet. Learning the basics is just a small step down the complex path of fully understanding software construction.

    Anyway, I try to always be constructive, but I will admit that I am blunt about it. My greatest fear is that somebody else will visit this thread a year from now and copy-paste the code into their own script, not realizing that it has security vulnerabilities.

    Security is nothing to laugh at, because computers affect everything we do all day long. Most beginning programmers aren't aware of the security ramifications -- partly because nobody (or no books) teach them.
     
  20. SrWebDeveloper macrumors 68000

    SrWebDeveloper

    Joined:
    Dec 7, 2007
    Location:
    Alexandria, VA, USA
    #20
    Criticism = good
    Concerns = good
    Copying bad code = unstoppable

    On the last point, we can't ask the OP to re-edit their post because you're concerned. Better left as is, and also the thread follows a natural evolution to resolve such matters, so anyone dumb enough to copy/paste without reading the ENTIRE thread deserves the pain and suffering. But it's beyond our control.

    And on other points you raised, agreed entirely and let's all remember not everyone is at the same skill level so the best thing we can do is post helpful links resolve problems as well as pointing them out.

    I'm outta here, cya guys.

    -jim
     

Share This Page