Wednesday, March 7, 2012

Fast way to get values from previous record

Hello,

I know that I've seen this question asked on here before, but I can't
find an answer that gives me the performance that I need.

I have a table that stores events for users:

CREATE TABLE Lead_Action_History (
lead_action_seq INT IDENTITY NOT NULL,
lead_action_date DATETIME NOT NULL,
lead_seq INT NULL,
operator_id VARCHAR(20) NOT NULL,
call_time INT NOT NULL,
CONSTRAINT PK_Lead_Action_History PRIMARY KEY (lead_action_seq) )
GO

The table has a foreign key to another table through the lead_seq
column:

CREATE TABLE Lead_Master (
lead_seq INT IDENTITY NOT NULL,
state CHAR(2) NOT NULL,
CONSTRAINT PK_Lead_Master PRIMARY KEY (lead_seq) )
GO

I need to write a query that will give me a sum of call_time broken
down by a column that is in the table joined through the lead_seq.
However, if the lead_seq for a row is NULL then I need to use the
lead_seq for the previous row (based on lead_action_date) for the same
operator.

This is what I came up with:

SELECT LM.state, SUM(call_time)
FROM Lead_Action_History LAH
INNER JOIN Lead_Master LM ON (LM.lead_seq = LAH.lead_seq)
OR (LAH.lead_seq IS NULL
AND LM.lead_seq = (SELECT TOP 1
LAH2.lead_seq
FROM
Lead_Action_History LAH2
WHERE
LAH2.operator_id = LAH.operator_id
AND LAH2.lead_seq
IS NOT NULL
ORDER BY
LAH2.lead_action_date DESC))
GROUP BY LM.state

The problem is that Lead_Action_History has millions of records and
any solution that I've found involves one or more subqueries on it
which kills performance. I am going to look at using a covering index
with the solution above, but I thought that someone here might have
another way of doing this.

I can't really change the structure, but I can play with the indexing.
I would still be curious though how other people model this type of
temporal data in a way that makes it easy to work with.

Thanks!
-Tom.tom_hummel@.hotmail.com (Thomas R. Hummel) wrote in message news:<a2c0eeb8.0401140902.723b743e@.posting.google.com>...
> Hello,
> I know that I've seen this question asked on here before, but I can't
> find an answer that gives me the performance that I need.

<snip
Could you post some sample data (INSERT statements) and the expected
result set? It's not totally clear (at least to me) what you require
when the Lead_Seq is NULL - your code suggests you want to count the
previous call_time twice?

Also, if you're having performance issues, you may want to post the
execution plan for your current query.

Simon|||Hi Tom,

Try this <disclaimer> completely untested </disclaimer>...

SELECT STATE
, SUM(CALL_SUM)
FROM (SELECT LM.state
, SUM(LAH.call_time) AS CALL_SUM
FROM Lead_Master LM
INNER JOIN
(SELECT ,L1.lead_action_seq
,L1.lead_action_date
,L2.lead_seq
,L1.operator_id
,L1.call_time
FROM Lead_Action_History L1
INNER JOIN
(SELECT operator_id
, MAX(lead_seq) as max_opr_ls
FROM lead_action_history
WHERE lead_seq IS NOT NULL
GROUP BY
operator_id) L2
ON L1.operator_id = L2.operator_id
AND L1.lead_seq IS NULL) LAH
ON LM.lead_seq = LAH.lead_seq
GROUP BY LM.state
UNION ALL
SELECT LM.state
, SUM(call_time) AS CALL_SUM
FROM Lead_Action_History LAH
INNER JOIN
Lead_Master LM
ON (LM.lead_seq = LAH.lead_seq)
GROUP BY LM.state) AS T
GROUP BY STATE;

Christian.|||>> I can't really change the structure, but I can play with the
indexing. I would still be curious though how other people model this
type of temporal data in a way that makes it easy to work with. <<

Then you're probably dead, because the problem is in the temporal
model. The idea of a time is that it is a contimuum and not a set of
discrete points. This goes back to Zeno, then to Einstein. You can
get some good SQL temporal programming tricks in "Developing
Time-Oriented Database Applications in SQL" by Rick Snodgrass.

The principle is that you need to model events as durations. So we
might re-write your table like this:

>I can't really change the structure, but I can play with the
indexing. I would still be curious though how other people model this
type of temporal data in a way that makes it easy to work with. <<

Then you're probably dead, because the problem is in the temporal
model. The idea of a time is that it is a contimuum and not a set of
discrete points. This goes back to Zeno, then to Einstein. You can get
some good SQL temporal programming tricks in "Developing Time-Oriented
Database Applications in SQL" by Rick Snodgrass.

The principle is that you need to model events as durations. So we
might re-write your table like this:

tom_hummel@.hotmail.com (Thomas R. Hummel) wrote in message news:<a2c0eeb8.0401140902.723b743e@.posting.google.com>...
Hello,

I know that I've seen this question asked on here before, but I can't
find an answer that gives me the performance that I need.

I have a table that stores events for users:

CREATE TABLE Lead_Action_History
(operator_id VARCHAR(20) NOT NULL
REFERENCES Personnel (operator_id)
ON UPDATE CASCADE,
action_start_date DATETIME NOT NULL,
action_finish_date DATETIME, null means ongoing
action_code CHAR(5) NOT NULL, -- needs constraints
CHECK (action_start_date <= action_finish_date)
state CHAR(2) DEFAULT '??' NOT NULL
CHECK (state IN ('Al', 'AK', )),
PRIMARY KEY (operator_id, action_start_date));

By definition, IDENTITY can never be a key. Your data integrity is
probably a mess because of that design flaw. Why does this table not
have an action code of some kind? The length of a call can be
computed from its start and finish times, so that column is redundant
and subject to errors imprecision.

The join to the LeadMAster table seems redundant since it is just a
list of two letter codes. Why not have a '??' token for the unknown
and avoid NULLs?

>> I need to write a query that will give me a sum of call_time broken
down by state <<

That is now very easy:

SELECT state, SUM(DATEDIFF(SS, action_start_date,
COALESCE(action_finish_date, CURRENT_TIMESTAMP))
AS total_time
FROM Lead_Action_History
GROUP BY state;

If you do most of the reports by dates, then use PRIMARY KEY
(action_start_date, operator_id) so the hidden index will be built in
that order. You can also use UNIQUE (action_start_date,
action_finish_date, operator_id) to get a covering index.|||And after proof reading...

SELECT STATE
, SUM(CALL_SUM)
FROM (SELECT LM.state
, SUM(LAH.call_time) AS CALL_SUM
FROM Lead_Master LM
INNER JOIN
(SELECT L1.lead_action_seq
,L1.lead_action_date
,max_opr_ls AS lead_seq
,L1.operator_id
,L1.call_time
FROM Lead_Action_History L1
INNER JOIN
(SELECT operator_id
, MAX(lead_seq) as max_opr_ls
FROM lead_action_history
WHERE lead_seq IS NOT NULL
GROUP BY
operator_id) L2
ON L1.operator_id = L2.operator_id
AND L1.lead_seq IS NULL) LAH
ON LM.lead_seq = LAH.lead_seq
GROUP BY LM.state
UNION ALL
SELECT LM.state
, SUM(call_time) AS CALL_SUM
FROM Lead_Action_History LAH
INNER JOIN
Lead_Master LM
ON (LM.lead_seq = LAH.lead_seq)
GROUP BY LM.state) AS T
GROUP BY STATE;

Christian.

No comments:

Post a Comment