PDA

View Full Version : SQL SELECT question




MarkCollette
Jan 30, 2006, 01:40 PM
Hello, I'm using PostgreSQL, and have a table like this:


TABLE PartOrders
partNumber : integer
orderStatus : integer
orderTime : timestamp


So, there can be many rows with the same partNumber, because over time there will be reorders for that product. And every row will not necessarily have a unique orderTime, because orders can be inputted in batches. Although, those same times will probably correspond to different parts.

I'm trying to come up with an SQL query that will give me a result set with one row for each partNumber, where that row has the latest orderTime for the part. So, the latest order for each part.

I tried GROUP BY partNumber, but it complains that I haven't used an aggregate on orderStatus. And all the examples I find are for returning a single column, but I need to return every column in the table. See, there are more columns than I mentionned, but I've simplified the example.

Right now I'm getting the distinct partNumbers in one query, and looping over that, doing separate queries per partNumber, and then just taking the row with the latest orderTime. This is a total hack!

Can someone please help me? Or at least point me to some complex examples that might help? I don't do much SQL so any help would be appreciated.



mwpeters8182
Jan 30, 2006, 01:50 PM
Hello, I'm using PostgreSQL, and have a table like this:


TABLE PartOrders
partNumber : integer
orderStatus : integer
orderTime : timestamp


So, there can be many rows with the same partNumber, because over time there will be reorders for that product. And every row will not necessarily have a unique orderTime, because orders can be inputted in batches. Although, those same times will probably correspond to different parts.

I'm trying to come up with an SQL query that will give me a result set with one row for each partNumber, where that row has the latest orderTime for the part. So, the latest order for each part.

I tried GROUP BY partNumber, but it complains that I haven't used an aggregate on orderStatus. And all the examples I find are for returning a single column, but I need to return every column in the table. See, there are more columns than I mentionned, but I've simplified the example.

Right now I'm getting the distinct partNumbers in one query, and looping over that, doing separate queries per partNumber, and then just taking the row with the latest orderTime. This is a total hack!

Can someone please help me? Or at least point me to some complex examples that might help? I don't do much SQL so any help would be appreciated.

Try the MAX() function:

SELECT partNumber, orderStatus, MAX(orderTime) FROM PartOrders ORDER BY partNumber

That might work, but it's worth a shot.

whocares
Jan 30, 2006, 01:53 PM
Try the MAX() function:

SELECT partNumber, orderStatus, MAX(orderTime) FROM PartOrders ORDER BY partNumber

That might work, but it's worth a shot.

That should list the whole table, with the newest order time in each row. I think you need to GROUP BY partNumber, so:

SELECT partNumber, MAX(orderTime) FROM PartOrders GROUP BY partNumber

This probably won't work, but it's also worth a try (if you haven't already) ;)
Also, I would be a good idea to add a column to the table calles something like "id" that auto-increments with each new row added. This will make every entry unique. :)


[Edit:] Just tried that query on a similarly structured MySQL table and it worked.

MarkCollette
Jan 30, 2006, 02:39 PM
That should list the whole table, with the newest order time in each row. I think you need to GROUP BY partNumber, so:

SELECT partNumber, MAX(orderTime) FROM PartOrders GROUP BY partNumber

This probably won't work, but it's also worth a try (if you haven't already) ;)
Also, I would be a good idea to add a column to the table calles something like "id" that auto-increments with each new row added. This will make every entry unique. :)


[Edit:] Just tried that query on a similarly structured MySQL table and it worked.

But did you try getting orderStatus as well, particularly from a dataset where orderStatus has the same value a lot of the time? This is what makes PostgreSQL complain that I need to use an aggregate function (things like MAX) on orderStatus.

- Mark

whocares
Jan 30, 2006, 03:14 PM
But did you try getting orderStatus as well, particularly from a dataset where orderStatus has the same value a lot of the time? This is what makes PostgreSQL complain that I need to use an aggregate function (things like MAX) on orderStatus.

- Mark

Sorry, I missed that point, try:

SELECT partNumber, MAX(orderTime), DISTINCT(orderStatus) FROM PartOrders GROUP BY partNumber

I don't use PostgreSQL, but I guess it works the same as MySQL...

firestarter
Jan 30, 2006, 05:19 PM
Sorry, I missed that point, try:

SELECT partNumber, MAX(orderTime), DISTINCT(orderStatus) FROM PartOrders GROUP BY partNumber

I don't use PostgreSQL, but I guess it works the same as MySQL...

Nope, that's not what you want. Distinct isn't an aggregate either.

firestarter
Jan 30, 2006, 05:35 PM
But did you try getting orderStatus as well, particularly from a dataset where orderStatus has the same value a lot of the time? This is what makes PostgreSQL complain that I need to use an aggregate function (things like MAX) on orderStatus.

- Mark

You can't do that in a single select.

I've never used postgres - but I do Sybase for a living.

Firtly, you're on to a looser if you have no unique id on your table. You really need this. On Sybase you can call a column an identity, and have the database assign it unique successive values as you post to the table. Not sure what the same is in Postgres.

Let's assume then that you've modified your table, and now it's:
TABLE PartOrders
orderNumber: integer
partNumber : integer
orderStatus : integer
orderTime : timestamp

Whocares is right with this solution:
SELECT partNumber, MAX(orderTime) FROM PartOrders GROUP BY partNumber

And answers your original question - but now you want the order status too. The problem with this, is that the 'group by' and aggregate functions kind of decouple the orderTime row from the partNumber. It's possible to show the max orderStatus and max orderTime for a given partNumber, but in a single select it's not possible to show the exact orderStatus related to a given max orderTime.

With the modified table (given that orderTime is useless at selecting unique rows) we can get the orderNumber for each partNumber with maximum orderTime. (Actually max orderNumber for a given partNumber should correspond to partNumber with max orderTime, if orderNumber is assigned consecutively).

select max(orderNumber) from PartOrders group by partNumber

This should return a list of orderNumbers, the max for each partNumber. This can be then used as a subquery to drive the main query, and give you the partNumber, orderStatus and orderTime that you want:

select partNumber, orderStatus, orderTime
from PartOrders
where orderNumber in
(select max(orderNumber) from PartOrders group by partNumber)

I hope that:
1/ You follow
2/ Postgres can do this
3/ I didn't make any stupid goof :D

Not at work at the moment, so I'm afraid I haven't been able to check this.

MarkCollette
Jan 30, 2006, 05:59 PM
You can't do that in a single select.

I've never used postgres - but I do Sybase for a living.

<snip>


Thanks a lot. I'll try this tonight. In PostgreSQL there's an implied "oid" column that's globally unique, not just to that table. I'll try what you described with s/orderNumber/oid

Unfortunately the version of PostgreSQL I use does not do sequences easily, so I tend to use oid, even though it makes the code less portable.

MarkCollette
Jan 30, 2006, 06:08 PM
Hmm, firestarter, I was thinking that maybe if the system is down, and old orders have to be manually inputted, that the sequencing of orderTime and orderNumber (or oid) might not be the same.

How about:

SELECT * FROM PartOrders WHERE orderNumber IN
( SELECT orderNumber, MAX(orderTime) FROM PartOrders GROUP BY partNumber );


This might sound dumb, but in the sub-select, would the orderNumber come from the same row as the maximum orderTime?

Thanks again.

firestarter
Jan 30, 2006, 06:41 PM
Hmm, firestarter, I was thinking that maybe if the system is down, and old orders have to be manually inputted, that the sequencing of orderTime and orderNumber (or oid) might not be the same.


A valid concern.

Would it be more appropriate just to key on date (no time) and order entry number - with orders fulfilled in strict entry order?

Who provides the timestamp, the computer or the order entry clerk?

If orders get entered in a random order after the system was offline, and the computer provides timestamp - then really you're hosed both ways.

There's a fine line between paranoid engineering (good) and overengineering (bad).


How about:

SELECT * FROM PartOrders WHERE orderNumber IN
( SELECT orderNumber, MAX(orderTime) FROM PartOrders GROUP BY partNumber );


This might sound dumb, but in the sub-select, would the orderNumber come from the same row as the maximum orderTime?

Thanks again.

You'd need a group by on the orderNumber too (everything in the select of a statement using group by must be in the group, or an aggregate function - ignore this and you're in danger of getting big cartesian products).

Another problem here is that the nested select should just produce one column of values - 'IN' won't work with the two used here.

I think to do what you want to do in this more complex scenario it gets more complex still. Break it down, use stored procedures and temporary tables to store the interim result if necessary (and cleaner).

MarkCollette
Jan 30, 2006, 08:06 PM
A valid concern.

Would it be more appropriate just to key on date (no time) and order entry number - with orders fulfilled in strict entry order?

Who provides the timestamp, the computer or the order entry clerk?

If orders get entered in a random order after the system was offline, and the computer provides timestamp - then really you're hosed both ways.

There's a fine line between paranoid engineering (good) and overengineering (bad).


The system provides the time, so yeah, that's probably overengineering.



SELECT * FROM PartOrders WHERE orderNumber IN
( SELECT orderNumber, MAX(orderTime) FROM PartOrders GROUP BY partNumber );


You'd need a group by on the orderNumber too (everything in the select of a statement using group by must be in the group, or an aggregate function - ignore this and you're in danger of getting big cartesian products).

Another problem here is that the nested select should just produce one column of values - 'IN' won't work with the two used here.

I think to do what you want to do in this more complex scenario it gets more complex still. Break it down, use stored procedures and temporary tables to store the interim result if necessary (and cleaner).

Thanks for setting me straight on this. I think now that I've accepted insertion order at my definition of latest, your SELECT suggestion a few posts back is quite a good fit. I lost sight that the timestamp was for system arrival time, and not some arbitrary timing.

Thanks. You've inspired me to go to the bookstore and try to get a better SQL book for my intro level.

ChrisA
Jan 30, 2006, 08:53 PM
You will have to join the table with a subset of itself.

SELECT *
FROM partOrders,
(
SELECT partNumber as PN ,
max(orderTime) f
FROM PartOrders
GROUP BY (partNumber)
} as latest

WHERE partOrders.partNumber = latest.PN;

So you have two tables (1) partNumber and (2) a subset of partNumber that contains only the latest orders.

I don't know your application but I worry you may have a bug. Why do you want the lastest orders? Unless you are working inside a transaction the rows returned may not remain the "latest" as the table could be updated by another user while your querry is processing. If you are simply driving a screen display of the "hotest parts" then is OK but if you use this information to modify the database your in trouble




Hello, I'm using PostgreSQL, and have a table like this:


TABLE PartOrders
partNumber : integer
orderStatus : integer
orderTime : timestamp


So, there can be many rows with the same partNumber, because over time there will be reorders for that product. And every row will not necessarily have a unique orderTime, because orders can be inputted in batches. Although, those same times will probably correspond to different parts.

I'm trying to come up with an SQL query that will give me a result set with one row for each partNumber, where that row has the latest orderTime for the part. So, the latest order for each part.

I tried GROUP BY partNumber, but it complains that I haven't used an aggregate on orderStatus. And all the examples I find are for returning a single column, but I need to return every column in the table. See, there are more columns than I mentionned, but I've simplified the example.

Right now I'm getting the distinct partNumbers in one query, and looping over that, doing separate queries per partNumber, and then just taking the row with the latest orderTime. This is a total hack!

Can someone please help me? Or at least point me to some complex examples that might help? I don't do much SQL so any help would be appreciated.

zimv20
Jan 31, 2006, 01:08 AM
i've never been the world's greatest DB designer or SQL programmer, but by the time we're doing either two calls to a table or a join to itself, shouldn't we be looking at the design of table and asking ourselves if perhaps the table wasn't properly designed to do what we need it to do?

MarkCollette
Jan 31, 2006, 01:28 AM
i've never been the world's greatest DB designer or SQL programmer, but by the time we're doing either two calls to a table or a join to itself, shouldn't we be looking at the design of table and asking ourselves if perhaps the table wasn't properly designed to do what we need it to do?

I guess I could split the PartOrders table into HistoricalPartOrders and LatestPartOrders, and do INSERTs into HistoricalPartOrders and UPDATEs into LatestPartOrders. But that sounds like it's have a bunch of tradeoffs, and I'd have to do a query from both to get all the data out.

Anyways, I still have to try out everyone's suggestions (had to fix some other code first).

zimv20
Jan 31, 2006, 01:42 AM
all right, granted i'm not involved in the project and have no idea the full requirements of the system, but from the little i've seen and applying my rusty and shoddy DB design skills, here's where i'd start:

i'd have tables to track Parts and Orders, then another table to track which parts go on which order, like this (using very generic data types):

Parts
------
part no: id
description: string
weight: double
manufacturer: id
num in stock: int
most recently ordered: date

Orders
--------
order no: id
customer: id
status: id
when placed: date
when fulfilled: date

OrderedParts
---------------
order no: id
part no: id

in practice, there would be one row for each unique part, one row for each unique order, and for each order containing n parts, n rows in OrderedParts.

further, when the entry to Order is made (and the entries to OrderedParts are made), each part for that order has its 'most recently ordered' field updated in Parts. a little kludgy, but at least that query is straightforward.

thoughts?

firestarter
Jan 31, 2006, 08:58 AM
3/ I didn't make any stupid goof :D

Not at work at the moment, so I'm afraid I haven't been able to check this.

OK, I checked and my grouped subquery works great (in Sybase!)


1> select * from ProductOrders
2>
orderId orderStatus productCode
----------- ----------- -----------
1 0 100
2 0 100
3 1 100
4 1 100
5 1 200
6 0 200

(6 rows affected)
1> select max(orderId) from ProductOrders group by productCode
2>

-----------
4
6

(2 rows affected)
1> select * from ProductOrders where orderId in (select max(orderId) from ProductOrders group by productCode)
2>
orderId orderStatus productCode
----------- ----------- -----------
4 1 100
6 0 200

(2 rows affected)

savar
Jan 31, 2006, 09:36 AM
i've never been the world's greatest DB designer or SQL programmer, but by the time we're doing either two calls to a table or a join to itself, shouldn't we be looking at the design of table and asking ourselves if perhaps the table wasn't properly designed to do what we need it to do?

YES! This is exactly what I was about to say.

Sub-queries are horrible and should never, ever be used.*


*OF course, when you're just a low-level coder and the database design is already written in stone, then you have to resort to hacks to get things working.

firestarter
Jan 31, 2006, 10:35 AM
YES! This is exactly what I was about to say.

Sub-queries are horrible and should never, ever be used.*

*OF course, when you're just a low-level coder and the database design is already written in stone, then you have to resort to hacks to get things working.

BS!

If you're finding the latest of anything on a table, you'll be using groups and aggregates, and you will possibly have to use a subquery (or a temp table or suchlike) to make use of the results.

Subqueries give you a load of powerful techniques (and a good speedup with 'not exists' statements). They're essential in non-toy applications (and in this instance, still needed on the improved schema cited).

devman
Jan 31, 2006, 11:00 AM
BS!

If you're finding the latest of anything on a table, you'll be using groups and aggregates, and you will possibly have to use a subquery (or a temp table or suchlike) to make use of the results.

Subqueries give you a load of powerful techniques (and a good speedup with 'not exists' statements). They're essential in non-toy applications

You are correct. And so are the others that said the schema should be fixed. That's the correct place to start.

MarkCollette
Jan 31, 2006, 02:09 PM
You will have to join the table with a subset of itself.

SELECT *
FROM partOrders,
(
SELECT partNumber as PN ,
max(orderTime) f
FROM PartOrders
GROUP BY (partNumber)
} as latest

WHERE partOrders.partNumber = latest.PN;


Sorry, but that didn't give the right results.

MarkCollette
Jan 31, 2006, 02:15 PM
Parts
------
part no: id
description: string
weight: double
manufacturer: id
num in stock: int
most recently ordered: date

Orders
--------
order no: id
customer: id
status: id
when placed: date
when fulfilled: date

OrderedParts
---------------
order no: id
part no: id


I was reading that kind of thing in my SQL book, about reducing redundancy in a given table, by splitting things into separate tables. So, where I had PartOrders, you have Orders + OrderedParts (I already had Parts). One drawback is that it looks like the SQL queries would be harder to come up with, since every query would involve some kind of joining.

Thank you for the suggestion, but I think that's outside of my capabilities right now.

MarkCollette
Jan 31, 2006, 02:15 PM
OK, I checked and my grouped subquery works great (in Sybase!)

And it works great in PostgreSQL too :)

zimv20
Jan 31, 2006, 02:42 PM
Thank you for the suggestion, but I think that's outside of my capabilities right now.
joins aren't that difficult. most of the ones i write a pretty much boilerplate, so i only had to figure out the first one.

here's an example from one of my sites, where given the id of an album, i display the songs from that album. note that the songs table lists all existing songs, and because a given song can exist on more than one album, there's no column 'album' in the songs table:


function displaySongsPerAlbum($album_id, $conn)
{
// grab the songs for the given album

$row_query = 'SELECT a.song_name, a.version, a.song_id'
. ' FROM songs a, albums b, bands c, band_album_assign d,'
. ' album_song_assign e'
. ' WHERE b.album_id = '. $album_id . ' AND d.band_id = c.band_id AND'
. ' d.album_id = b.album_id AND e.album_id = d.album_id AND'
. ' e.song_id = a.song_id'
. ' ORDER BY e.play_order';

if (!($row_result = @ mysql_query($row_query, $conn)))
showerror();

// display each song

while ($row = mysql_fetch_array($row_result))
{
displaySong($row, $conn);
}
}

MarkCollette
Jan 31, 2006, 02:59 PM
Abbreviated quote:

joins aren't that difficult. most of the ones i write a pretty much boilerplate, so i only had to figure out the first one.


$row_query = 'SELECT a.song_name, a.version, a.song_id'
. ' FROM songs a, albums b, bands c, band_album_assign d,'
. ' album_song_assign e'
. ' WHERE b.album_id = '. $album_id . ' AND d.band_id = c.band_id AND'
. ' d.album_id = b.album_id AND e.album_id = d.album_id AND'
. ' e.song_id = a.song_id'
. ' ORDER BY e.play_order';


It looks like a consequence of your pulling the pairing out into a separate table, is that instead a 1-n correspondence, yours supports m-n. Like if multiple bands worked together on a single album, or a song in in multiple albums, like greatest hits collections, or best of the 80s collections.

In my case, it would allow me to make meta-orders that could include sub-orders for multiple parts.

Thanks, I'm going to save your SELECT example, in case that dynamism is needed.

zimv20
Jan 31, 2006, 05:42 PM
It looks like a consequence of your pulling the pairing out into a separate table, is that instead a 1-n correspondence, yours supports m-n. Like if multiple bands worked together on a single album, or a song in in multiple albums, like greatest hits collections, or best of the 80s collections.
yes, that's it exactly. as i said, i'm not really a DB designer, but that's the flexibility i put in the design. i've written some php code for it (bands member can submit songs, create albums, and put songs on that album, plus all the music pages (http://www.boxsetauthentic.com/music.php) are working), but i've come nowhere near writing all the code to take advantage of that flexibility.

maybe some day...

ChrisA
Jan 31, 2006, 06:03 PM
You need to read up on DBMS Design theory. Specifically
what they call "Database Normalization". Not understanding
the basics of DBMS theory is likely to result in painting yourself
into a corner. DBMS system have a way of starting out simple and then growing. Starting with a fully normalized design means that you can
grow the application without needing to start over.

Use Google to find "Database Normalization"
but one refference is
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html

Your suggestion amounts to a kind of "anti-normalization". About the last thing you'd want to do.

Do not be afraid of doing Joins. the Join operation is the whole reason
we use _relational_ databases, because they enable Joins.

One "Red Flag" in any DBMS design is when you find that your code needs to make one query, hold a result from it and use that result to make a second query where the result of the second querry is really what you want. A design such as that will break due to concurrentcy or if you try to use locks it will not scale and could evn rsult in deadlocks unless yu understand theory related to deadlock prevention and detection.

The second "Red Flag" is any design that stores the same information twice
Like spliting the table as suggested.

OK Put all the above into use:

You need to make several tables
(1) A Parts table. It will have one row for each part. The primary key is "part_number"
(2) An "Order Table" it has one row for each order. Primary ket is "order number"
(3) An Oder Detail table. This has two colums "order number" and "part number"

The above design is a "clasic" used in many text book examples

You can extend it to include a "customer" table and an "saleman" table
and so on.

A good "rule of thumb" in DBMS design, if you don't want to bother with formal normalization theory is to make a table row represend ONE (and one ONE) object or event. for example a prt or a transaction or a person. You table does not do this. It describbes a part AND when it was ordered. So it is not surprize that the only way to querry the info you anted was to join to subsets of the table together.
.

I guess I could split the PartOrders table into HistoricalPartOrders and LatestPartOrders, and do INSERTs into HistoricalPartOrders and UPDATEs into LatestPartOrders. But that sounds like it's have a bunch of tradeoffs, and I'd have to do a query from both to get all the data out.

Anyways, I still have to try out everyone's suggestions (had to fix some other code first).

devman
Feb 1, 2006, 04:56 AM
I was reading that kind of thing in my SQL book, about reducing redundancy in a given table, by splitting things into separate tables. So, where I had PartOrders, you have Orders + OrderedParts (I already had Parts). One drawback is that it looks like the SQL queries would be harder to come up with, since every query would involve some kind of joining.

Yes - that's precisely the point. Seriously, this is database 101. Go get a good book on relational database and start learning. In general - things are much more difficult if the data is not normalised.

edit: ah my oops. didn't see there was a page 2 and hence ChrisA above said this better than I.

ChrisA
Feb 2, 2006, 11:28 AM
Originally Posted by MarkCollette
I was reading that kind of thing in my SQL book, about reducing redundancy in a given table, by splitting things into separate tables. So, where I had PartOrders, you have Orders + OrderedParts (I already had Parts). One drawback is that it looks like the SQL queries would be harder to come up with, since every query would involve some kind of joining.
.

OK so it take a join. This is not an expensive operation. May time the DBMS can work with the index files and the result is very fast. The query optimizer is quite good.

The thing about data bases is that thier usfullness grows over time. As son as you decide to keep inventory into it then yu start thinging about putting a workstion in the wharehouse so "stuff can be loggin in and out and then the sale staff wants access and the execs want reports maed from it and then you decide to link it with a customer table and before long youhave 50 people depending on it every day.

Now you can see why you wil so nuts if you don't design it right from day one. Eventually you will be forced to normalized the data and going back and changing after numerous apps depend on it is very painfull.

Also if you do go and read some of the more advanced DBMS books, mostly university level textbooks, you can work though a serious of theroms and proofs (using some advanced algabra) where they prove that if the DBMS has certain charicteristics that it is possable to extract _any_ specifiable set of information. That's a very powerful thing: Amathmatical proof that you have not painted yourself into a corner.

Relational database system where for years a mathmatical construct studied by computer scientists and people though you could never actually build one that it would remain just something like a "turring machine" that is used to advance the mathmatics but never actually built. And then Stonebreaker (sp??) wrote Ingres. and 30 or so years later here we are.