1. Welcome to the new MacRumors forums. See our announcement and read our FAQ

MySQL Question Re: Complicated Comparisons

Discussion in 'Mac Programming' started by scem0, Aug 12, 2013.

  1. macrumors 604


    Thanks for taking the time to help:

    I have a table of media each with a unique ID.

    *  ID  *  Filename   *
    *  56  *  Cat.jpg    *
    *  57  *  Dog.jpg    *
    *  58  *  Hitler.gif *
    I have a table of tags, each with a unique ID.

    *  ID  *  Attribute  *
    *  20  *  Funny      *
    *  21  *  Sad        *
    *  22  *  Confusing  *
    I have a table that matches the tags to the media:

    * Media ID * Tag ID *
    *  58      *  22    *
    *  58      *  21    *
    *  56      *  20    *
    So, in this example, hitler.gif is confusing and sad. Cat.jpg is just funny.

    This table that associates tags with media will end up being HUGE. I don't mind it being huge, but I do need to be extra efficient when I work with it to keep CPU load down. I need an efficient way to find photos with very specific attributes. For example, I might need to find all photos that are both Funny (tag 20) AND Confusing (tag 22) but that are not also tagged as offensive (tag 25) OR NSFW (tag 30).

    I'm an SQL newbie, so I'm used to "SELECT x from y where Tag = 22 as a generic simple call to find specific data. But I can't just say where tag = 22 AND tag = 24. No ands will ever return a result because each association between tag and image is a separate row. No row tests true for two attributes. So, how do I do comparisons on multi-row data? Specifically, how do I do comparisons on multi-row data that combines looking for media with certain tags, specifically without certain tags, potentially with "ands", and potentially with "ors". Hopefully that makes sense. I'm trying to avoid doing this logic part with PHP, if possible.

    Sorry if the answer is obvious. I tried to google for this, and couldn't find anything because I wasn't quite sure what keywords to search for. :p

    Thank you, thank you, thank you for the help.
  2. macrumors 68040


    select a.media_id from media_tag a
    left outer join media_tag b on a.media_id = b.media_id and b.tag_id = 21
    left outer join media_tag c on a.media_id = c.media_id and c.tag_id = 24
    left outer join media_tag d on a.media_id = d.media_id and d.tag_id = 30
    where (a.tag_id = 20 OR b.tag_id is not null) and c.tag_id is null and d.tag_id is null;

    It's been a long time, so this is likely wrong, and definitely inefficient. What I'd hope this would yield is all media_ids where it is tagged with 20 or 21, but is not tagged with 24 or 30.

    I'm confident a DBA or SQL wunderkind will point out a clause or function that will turn this into a tiny, super efficient query. I thought I'd try to stretch my brain.

  3. macrumors 604


    Very enlightening, Lee. I'm the last person to be able to judge the efficiency of your code, so if some one else wants to weigh in, please do so. But I'm very thankful to have a direction, at least. So, thank you!

    – Emerson
  4. macrumors newbie


    You seem to have constructed a many to many table structure when I think you want a many to one scenario.

    As a media column can have many tags but an individual tag column can only have one media associated with it.

    How about if you structure the data like this:

    Keep the media table
    Drop the media id / tag id join table

    create a new table for each of:
    Funny Tag
    Sad Tag
    Confusing Tag

    each of these tables will have a column called 'media_id'

    representing each media objects id that is tagged for that particular category

    Then your queries will be much simpler.
  5. macrumors 68000

    This seems like it would be bad having to create a table for each tag. Wouldn't the other way work just fine, especially if it was indexed well?
  6. macrumors 6502a

    May I add that you can prefix each SQL statement with EXPLAIN and MySQL will tell you what it had to do to obtain the result.

    EXPLAIN SELECT data FROM table WHERE this > that;

    It will tell you about use of indexes, etc.

    Just a comment.
  7. macrumors 604


    Thanks pmau. I'll definitely try this out.

    Mattofsmeg –*thanks for the response. I don't think this will be practical for me because, in reality, I could have thousands of tags. For the example I only have three, but chances are I'll have a couple hundred to 1000+.
  8. macrumors 6502


    Another way of doing this is with self-joins and subquery expressions (lets assume that 'media_tag' is the table of relations between media_id and tag_id):

        (SELECT DISTINCT media_id FROM media_tag WHERE tag_id = 24) as A JOIN
        (SELECT DISTINCT media_id FROM media_tag WHERE tag_id = 22) as B ON A.media_id = B.media_id
        LEFT OUTER JOIN (SELECT DISTINCT media_id FROM media_tag WHERE tag_id = 25 OR tag_id = 30) as C ON A.media_id = C.media_id
        WHERE C.media_id IS NULL;
    This should result in a set of all media_ids which have a relation to tag_id 24 and 22, but do not have a relation to tag_id 25 or 30.

    The first inner-join enforces the AND part of the desired query (has both tags 24 and 22), while the LEFT OUTER JOIN ... WHERE ... IS NULL is a sort of anti-join that excludes the OR part of the query (does not have tags 25 or 30).
  9. 960design, Aug 14, 2013
    Last edited: Aug 14, 2013

    macrumors 65816

    I dislike complex SQL statements. I use multiple simple SQL statements.
    Far easier to debug and modify over time, at least for me.

    You've created the most efficient table structure for a many to many ( well if you have unique ids for the media_tag table ) relationship.

    HUGE is not a problem for indexed SQL. A very loose definition for huge is a little over 2 Billion elements in an SQL table.
  10. macrumors member

    Well, not sure if you can do something like this and if I understand your requirement correctly. :rolleyes:

    my tag table
    20 funny
    21 sad
    22 confusing
    25 offensive
    30 NSFW

    my match table
    58 22
    58 21
    56 20
    58 25
    SELECT *
      FROM (SELECT media_id,
              FROM match a,
                   media b,
                   tags c
             WHERE a.media_id = b.ID
               AND a.tag_id = c.ID) a
     WHERE a.tag_id IN (20, 22) -- get media_id that has these tags.    
      -- AND a.tag_id NOT IN (25); -- don't do this because we need to look as a 'group'. Use below subquery. 
       AND NOT EXISTS -- do not include media_id that has these tags
              (  SELECT 'x'
                   FROM match a1
                  WHERE a1.tag_id IN (25,30)
                    AND a.media_id = a1.media_id
               GROUP BY media_id);
    media_id tag_id attribute filename
    56 20 funny cat.jpg

    PS. I tried this on Oracle, and never worked on MySQL or mobile database.
  11. Moderator


    Staff Member

    2 responses mention indexes. The OP does not.

    Are these tables indexed?
  12. aly
    macrumors member

    I came up against a similar type of problem a while back. I ended up with a table with millions of rows and while this is of course fine, I found that using SELECT statements and trying to join tables lead to significant waits after a query was sent. My solution was to break the table up in to multiple tables which is similar to the suggestion above to have a table per tag.

    Alternatively have you thought of creating a table with multiple columns where each column is a different tag and each row corresponds to a Media ID? Then it is very simple to return all tags of one Media ID or return all Media IDs meeting a particular tag or set of tags. Managing the adding of tags is a case of adding a column to this table. Each of the tag columns can also be a simple bit type.

    I guess this method depends on the column limitations of a table? Which I don't know.
  13. macrumors newbie

    Ok sweet


    Ok, do you have any of your own ideas?
  14. macrumors member


    I don't have much experience in MySQL but I have done some complex SQL statements in DB2, SQL Server and MS Access.

    Here are 2 examples in DB2 format that would acheive what you want.
    I would expect that some variation of these would work in MySQL.
    If you are only dealing with a few thousand records, I would expect it to be quick, but indexing the tables would improve the speed.

    1. Join to a summarised sub-select:
    select m.ID, m.Filename
    from media as m
    join ( select tm.MediaID,
    sum(if tm.TagID=20 then 1 else 0 end) as s20,
    sum(if tm.TagID=22 then 1 else 0 end) as s22,
    sum(if tm.TagID=25 then 1 else 0 end) as s25,
    sum(if tm.TagID=30 then 1 else 0 end) as s30
    from tag_media as tm
    group by tm.MediaID ) as s on s.MediaID = m.ID
    where s.s20>0 and s.s22>0 and s.s25=0 and s.s30=0;

    2. Check for presence of media ID in 4 lists of media IDs, 1 for each tag selection:
    select m.ID, m.Filename
    from media as m
    m.ID in ( select tm1.MediaID from tag_media as tm1 where tm1.TagID=20 ) and
    m.ID in ( select tm2.MediaID from tag_media as tm2 where tm1.TagID=22 ) and
    m.ID not in ( select tm3.MediaID from tag_media as tm3 where tm3.TagID=25 ) and
    m.ID not in ( select tm4.MediaID from tag_media as tm4 where tm4.TagID=30 );
  15. macrumors 6502a

    Some really interesting approaches in this thread.

    In MySQL, 'select in' queries can have terrible performance. It has something to do with the (select ... ) statement being executed once for every row of the main select query. The approach I take is to run the ( select ... ) queries individually, and then build the sql with implodes, e.g. (not working code!):

    $result1 = execute('select tm2.MediaID from tag_media as tm2 where tm1.TagID=22');
    $result2 = execute('select tm2.MediaID from tag_media as tm2 where tm1.TagID=25');
    $result3 = execute('select tm2.MediaID from tag_media as tm2 where tm1.TagID=30');

    $query = 'select m.ID, m.Filename
    from media m
    m.ID in ( implode(',', $result1 )
    and m.ID in ( implode(',', $result2 )
    and m.ID not in ( implode(',', $result3 )'
  16. sreeves, Aug 14, 2013
    Last edited: Aug 14, 2013

    macrumors newbie

    GROUP_CONCAT is a fun solution

    In the following example we search for pics with tags 1 and 3 but not 2 or 4:
    SELECT picid, GROUP_CONCAT(tagid ORDER BY tagid) AS thispicstagids FROM matches
    GROUP BY `picid`
    HAVING thispicstagids REGEXP '^([0-9]*,)*1(,[0-9]*)*,3(,[0-9]*)*$'
    AND thispicstagids NOT REGEXP '(^|,)2(,|$)'
    AND thispicstagids NOT REGEXP '(^|,)4(,|$)'
    We create an ordered list of all the tags of each pic with the GROUP_CONCAT function.
    Then we use a regular expression to search that list. Here's a breakdown of that expression:
    '^' starts us at the beginning of the string. If this wasn't here, due to the nature of the next part, we might match numbers that ended with the figures (like allowing '24' when you only wanted '4').
    '([0-9]*,)*' searches for any amount (the last asterisk) of the group in parenthesis, being any amount of digits (the class declared by the square brackets) followed by a comma.
    '1' searches for our first search id. Replace with the first figure in your array.
    '(,[0-9]*)*' searches for any amount of the group of a comma and any number of digits. This is where the search should end if you are searching for only one tag.
    ',' then there should be at least one comma between the first tag and the second. Only insert this comma if you're looking for more than one tag.
    '3' searches for our second search id. Replace with the second and following figures in your array.
    '(,[0-9]*)*' searches for any amount of the group of a comma and any number of digits, allowing for tags to appear after your last search figure. If this wasn't here, we'd match 3 when 33 was present.
    '$' matches the end of the string, for the same reason we matched the beginning.
    The negative searches use NOT REGEXP, and search for our figures 2 and 4 preceded exclusively either by the beginning of the string or a comma, and followed exclusively either by a comma or the end of the string

    I haven't tested this as thoroughly as one might, but you might enjoy using GROUP_CONCAT for this.

    Note I am only working with the matches table, because that's where all this magic happens.

    I should also post this on my wiki at http://shawnreeves.net/wiki/index.php?title=SQL_recipes


    I just realized that what precedes the first tag could simply be the beginning of the string or a comma, so I shortened the REGEX. Likewise, final search term could also be followed by a search for a comma or the end of the string, shown in this example of searching for three tag ids (1,3,72) instead of two:
    SELECT picid, GROUP_CONCAT(tagid ORDER BY tagid) AS thispicstagids FROM matches
    GROUP BY `picid`
    HAVING thispicstagids REGEXP '(^|,)1(,[0-9]*)*,3(,[0-9]*)*,73(,|$)'
    AND thispicstagids NOT REGEXP '(^|,)2(,|$)'
    AND thispicstagids NOT REGEXP '(^|,)4(,|$)'
    Also, note you need to order the tags in the regexp in the same order they'd appear in the concatenation.
  17. Irfin, Aug 21, 2013
    Last edited: Aug 21, 2013

    macrumors newbie

    Try IN operator

    Suppose you want to retrieve all media which have tags funny(20), confusing (22), but not offensive (25) or NSFW (30). You can try to use the 'IN' operator:

    SELECT media_id FROM tags_to_media
    WHERE tag_id IN (20, 22)
      AND tag_id NOT IN (25,30)
    Of course you need to first make list for the tags you want to include and excluded. Perhaps querying them first and then construct the SQL in string form, and then execute it.

    Sorry bad english here.

Share This Page