Friday, March 9, 2012

Faster Remove Duplicate SQL

I have a table containing over 100,000 email addresses. This email table gets duplicates in it, and our customers don't want a second (or third or fourth) copy of our news letter. To prevent this, we run the following SQL to kill the duplicates:

Code Snippet

DELETE FROM _email WHERE _email.eid IN
(
SELECT tbl1.eid FROM _email AS tbl1 WHERE Exists
(
SELECT emailaddress, Count(eid) FROM _email WHERE _email.emailaddress = tbl1.emailaddress GROUP BY _email.emailaddress HAVING Count(_email.eid) > 1
)
)
AND _email.eid NOT IN
(
SELECT Min(eid) FROM _email AS tbl1 WHERE Exists
(
SELECT emailaddress, Count(eid) FROM _email WHERE _email.emailaddress = tbl1.emailaddress GROUP BY _email.emailaddress HAVING Count(_email.eid) > 1
)
GROUP BY emailaddress
);


This query takes about 2hrs to run which is really hurting our server preformance. Is there any way to do this faster?

I am running SQL Server 2000

Thanks in advance

Create a unique non clustered index and turn on IGNORE_DUP_KEY after you run your code to delete all the existing duplicates. Then SQL Server will not insert duplicates to the column, try the link below for index options.

http://msdn2.microsoft.com/en-us/library/ms186869.aspx

|||

You can try the following queries,

Code Snippet

delete from _email

from _email A

join (select min(eid) eid,email from _email group by email) B

on A.email = B.email

Where

b.eid <> a.eid

Code Snippet

delete from _email

from _email E

Where exists

(

select 1 from _email A

join (select min(eid) eid,email from _email group by email) B

on A.email = B.email

Where

b.eid <> a.eid

and E.eid=A.eid

)

|||

If you are using SS 2005, then try:

;with cte

as

(

select *, row_number() over(partition by emailaddress order by eid ASC) as rn

from _email

)

delete cte

where rn > 1;

AMB

No comments:

Post a Comment