MySQL question: disambiguation

BlueRevolution · Jul 13, 2009

MySQL isn't my strong point, but normally I do all right. This one is beyond my knowledge though.

Background: I'm starting a website akin to Identifont to identify types of aircraft (737, A320, etc) based on a series of questions: wing shape, number and type of engines, and so forth.

Basically, the site content will be contributed by a group of privileged users who can add and categorize new aircraft and attributes. New attributes will be needed to distinguish between similar aircraft to ensure that the user isn't presented with 40 possible candidates based on a given set of responses. I need a query that can identify these lumps of data, sets of user responses that would result in a large number of possibilities returned. That information would then be presented in an admin panel, with the largest lumps listed first.

More details on the problem here and the database here. I'm sorry if the question is a bit confused, that's probably just the questioner's confusion coming across.

trule · Jul 14, 2009

I would have a long hard think about what you are wanting in your UI and then work from there. You will be wanting JOINs between the three tables or possibly a sub query.

I think you want to SELECT all ATTRIBUTES that are not existing in DESCRIPTION for a particular AIRCRAFT ID

SELECT * FROM ATTRIBUTES WHERE ID NOT IN(SELECT ATTID FROM DESCRIPTION WHERE AIRID = ?)

something like that

BertyBoy · Jul 14, 2009

I like a challenge, I'll see what I can come up with tonight, while I'm not busy on Diablo II.

Don't worry, I do this for a living (I wish it was for playing Diablo II) with very large databases.

Edit:

A quick think over a cup of tea.

I see a (basic) solution over 4 tables. Simplifying the names to keep SQL to a minimum and to avoid confusion. From your other posts I'm presuming that you want to limit the answers available to the questions (the attributes the site users can select values for), at least for some qttributes (ie. engines, wheels, etc, but it would be tiresome for others, eg. max seats, max range, etc).
Table A - Aircraft
Table B - Attribute Names or Questions, what you've called "Attributes", ie. "Number of Engines" is Question 1.
Table C - Answers, ie. for question 1 there will be records for "1", "2", "3", "4", "6" and "8". Keys will be the question number and the id of the answer record. id doesn't have to be unique across the whole table, just within the question, but it makes no difference if it is unique across the table.
Table D - Data, what you've called "Description". This is the actual data on the aircraft, contains an aircraft id, and the question id and answer id from Table C.

So a query like (where DD is an alias for table D):
SELECT * FROM D WHERE NOT EXISTS (
SELECT 1 FROM DD
WHERE DD.aircraft_id != D.aircraft_id
AND DD.question_id = D.question_id
AND DD.answer_id != D.answer_id )

It's basic, but it's looking for any (other) aircraft where there isn't a different answer to the same question. It's not going to be quick when you get into many thousands of aircraft, think you mentioned 3,000 with six or seven attributes ("questions") each, even then it won't be too slow. And if you add an index to Table D on question it'll help.
Each match will be reported twice, once for each aircraft in the match, you could cut down on this by restricting the fields chosen in the outer SELECT (currently *) and grouping by aircraft_id.

The query as is has limitations, it's not going to list aircraft where you have set no attributes at all. And it's not going to easily tell you which aircraft it clashes with - not helped by allowing aircraft to miss attributes.
But with a specific aircraft in mind, you can report on the other aircraft it clashes with by a simple adjustment to the original query (with my_id as the aircraft_id we want to report on):

SELECT distinct D.aircraft_id FROM D
WHERE D.aircraft_id != my_id
AND NOT EXISTS (
SELECT 1 FROM DD
WHERE DD.aircraft_id = my_id
AND DD.question_id = D.question_id
AND DD.answer_id != D.answer_id )

Apologies for it being rather basic, my cup of tea is finished so the grey cells are settling down for the night. We could look at a more Data Warehouse approach, but I don't think that's what you want, it's more disk space, more periodic processing for what may be little benefit.

BlueRevolution · Aug 24, 2009

Sorry for the thread necromancy, but I'm finally getting back into this project and am going to have another go at the tough SQL questions.

BertyBoy said:
I see a (basic) solution over 4 tables. Simplifying the names to keep SQL to a minimum and to avoid confusion. From your other posts I'm presuming that you want to limit the answers available to the questions (the attributes the site users can select values for), at least for some qttributes (ie. engines, wheels, etc, but it would be tiresome for others, eg. max seats, max range, etc).
Table A - Aircraft
Table B - Attribute Names or Questions, what you've called "Attributes", ie. "Number of Engines" is Question 1.
Table C - Answers, ie. for question 1 there will be records for "1", "2", "3", "4", "6" and "8". Keys will be the question number and the id of the answer record. id doesn't have to be unique across the whole table, just within the question, but it makes no difference if it is unique across the table.
Table D - Data, what you've called "Description". This is the actual data on the aircraft, contains an aircraft id, and the question id and answer id from Table C.

I'd considered adding what you're calling Table C before, but decided against it. As I mentioned on the project's forum:

Lord Lucan said:
You need another table to hold multiple alternative answers, not multiple fields within a single table. Otherwise you have the same problem you were trying to avoid when adding attributes--when a question pops up that has 8 possible answers you have to modify your database.

BlueRevolution said:
I'd considered that, but I don't see the point. It adds an extra layer of joins, and for what? If a question has more than 7 possible responses, it needs to be rephrased and divided into multiple questions lest it be too confusing for the reader. I kind of like the artificial constraint in this case.

Numerical questions like maximum range and maximum seating capacity aren't terribly useful because the process is designed to identify aircraft based on observations. If the technical information is available to the user, they likely already know the model number.

BertyBoy said:
And it's not going to easily tell you which aircraft it clashes with - not helped by allowing aircraft to miss attributes.

Unfortunately, this is exactly what I need. It's simple enough to run queries to find aircraft matching (or not matching) certain criteria. However, I'm envisioning a sort of admin panel which outlines what parts of the database need attention. One list would show the top 10 aircraft with unanswered questions in descending order, since when a new question is added it's not feasible to provide answers for every aircraft in the database. Another would show the reverse, the top questions sorted by the number of missing aircraft, again descending. Both of those are easy enough to write.

However, by far the most important is a listing of the cases where a particular set of responses will lead to a large list of aircraft. Having two or three possibilities to read through is not a problem, but twenty or thirty is unacceptable and indicates that new questions need to be added to distinguish between them. Unfortunately, while users will encounter this situation on a regular basis, I can't figure out how to make the information readily available to administrators. Without this information, I fear the project is doomed to failure.

BertyBoy · Sep 9, 2009

No problem, you can drop Table C without changing the SQL too much. So instead of an "answer_id" on Table D, it'll be the actual value in a field called "answer" or similar.
Table C was just there to help provide a multiple choice for answering the questions in an online search, and for limiting the values that can be entered. But for non-integer numerical and alphanumeric answers it would provide some sort of data integrity, ie. a "First Built" question can now have any number of answers keyed representing the same date - "May 1970", "may 1970", "1970", "05/1970", etc. Or "Engine Manufacturer" could have many different spellings for "Rolls-Royce", which wouldn't come up as a match in your search for non-unique attributes. If you can live with that, it's not essential.

I can't formulate a single query to give you a list of names of aircraft and all the names of the aircraft that it is indistinguishable from based on all of it's attributes.
But I can just give you a list of names of aircraft that are indistinguishable from one or more other aircraft. That is the first SQL provided before.

Then for each of these aircraft in turn you can run a query (the second query above) to produce a list of all aircraft that it is indistinguishable from based on all its attributes.
For a HTML/PHP front end on a MySQL database, this shouldn't be an issue, you will be able to retrieve a list of Aircraft Ids with the query, and for each retrieve details of Aircraft Name and number of attributes set. If your web page has the space, you could even query Table B and Table D, listing the unanswered questions for that aircraft.

Again, no single query is going to give you everything, but a series of queries after the first query above will give you everything.

BlueRevolution · Sep 13, 2009

Okay, then I'm lost. The query you gave me seems to find the attributes which have the same values for every aircraft. If the value is different for even one aircraft, it will not be returned. For example, for the following data set:

aircraft | attribute | value
1 | 1 | 0
1 | 2 | 0
1 | 3 | 0
2 | 1 | 0
2 | 2 | 0
2 | 3 | 0
3 | 1 | 0
3 | 2 | 1
3 | 3 | 1

the query returns

aircraft | attribute | value
1 | 1 | 0
2 | 1 | 0
3 | 1 | 0

where I'd like it to identify that aircraft 1 and 2 share all of the same values while aircraft 3 is distinct. Even in a brute-force solution I'm not sure how the information above would be helpful.

Thanks for your patience.

BertyBoy · Sep 14, 2009

Understand you now.

So I'll write it out exactly, rather than just hinting at changes to be made. Still tables A, B and D.

So we have to identify aircraft where any other aircraft has no different attribute values. Our first attempt will be:

SELECT A.aircraft_id FROM A
WHERE NOT EXISTS (
SELECT 1 FROM D, DD
WHERE D.aircraft_id = A.aircraft_id
AND DD.aircraft_id != D.aircraft_id
AND DD.question_id = D.question_id
AND DD.answer_id != D.answer_id )

And with that result array, we can then investigate individual (indistinguishable) aircraft one at a time.

Thinking ahead to when you want to "score" aircraft depending on how few of their attributes are set, or how many aircraft they match against (if any), the queries may not be so simple. And may take many iterations to get to produce the results you want.

You may have to think quite hard about which criteria you want to use. ie. An aircraft with 5 engines may be unique in a database of thousands of aircraft, and this may be the only attribute set, but if your users don't key that as one of the search criteria it doesn't help. From the earlier posts I do think you're planning to tackle this, but non-uniqueness may be your priority.

savar · Sep 14, 2009

I agree with BertyBoy. You definitely want that C table to hold answers to the attributes. As a professional developer, I can tell you that the solution without C is very amateur. I realize that you probably are an amateur (as you describe yourself), but I'm telling you, you will eventually regret it, and it will take more work to fix it then than it will now.

But honestly, what you're trying to do is a very difficult problem. It's going to be nearly impossible to solve in a relational database, and even if you solve it with a custom-written program, it is still quite complicated.

Think of it this way. You're looking for the following situation: Aircraft A and B have identical answers 1, 2, 3, and 4. (Notice that if you structure the schema with table C, as I recommended above, you don't even have to care what the questions are. All you care about are the answers.) For the sake of this exercise, let's assume there are many other aircraft (C through Z) and many other answers (5 through 1000), but none of those are related to the aircraft A-B or answers 1-4.

A is joined to 1, 2, 3 and 4 by a foreign key, and B is joined to 1, 2, 3, and 4. What you are asking for is, "show me all aircraft that have identical answers 1, 2, 3 and 4." And if you run this query, you'll get the right answer.

But then your next question would be, "do the same for 2, 3, 4, 5". And then you'll ask "do the same for 3, 4, 5, 6". Essentially, the problem here is that the number of joins (permutations of answers) you need to do is n! (n factorial), where n is the number of answers that you have. For a handful of attributes, this is possible. But once you have even 50 answers to all of the questions, 50! is 3 x 10^64, which is a number so huge that even if you had a computer fast enough to check a million permutations per second, it would take you longer than the history of the universe to check all 50! permutations.

So in short, you're biting off an insanely hard problem and you're using the wrong tools. Now then, if you want to keep trying, I can point you in the right direction. Your problem is actually a type of clustering problem (http://en.wikipedia.org/wiki/Cluster_analysis) and would probably be solvable by a "nearest neighbors" algorithm: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.

Search

Search

MySQL question: disambiguation

BlueRevolution

macrumors 603

trule

macrumors 6502

BertyBoy

macrumors 6502

BlueRevolution

macrumors 603

BertyBoy

macrumors 6502

BlueRevolution

macrumors 603

BertyBoy

macrumors 6502

savar

macrumors 68000

Our Staff