Wednesday, June 2, 2010

Deleting duplicate SQL rows

I know this is not really GWT related, but I ran into a problem with one of my database tables, where I somehow managed to get duplicate rows.

I don't know how it happened, but to prevent it happening again, I wanted to add a unique constraint on the 'username' column.

Before I could do that, however, I had to delete the duplicate entries.

I searched around using Google, and found a few solutions that seemed complicated, and not quite what I wanted. They seemed to involve copying data to another table, deleting dupes, deleting the original table rows, and then copying it back.

So, here is what I think is an easier way to do it, that is a bit more flexible:

  • Find the rows you want to keep
  • Delete the other rows.

Selecting the good rows

My table in question is 'usersettings'. It has 9 columns, but all I care about are the first two, 'id', and 'username'.

To find the rows I want to keep, I am going to select the lowest id per username. To do this, I used aggregation, with the group by clause, and the 'min' operator.


select min(id) as id from usersettings group by username;

This returns the lowest id for all unique usernames. These are the rows I want to keep.
This could easily be modified to take the highest id, using max.
I am sure there are lots of other ways to implement the criteria to select the ids that you want to keep.

We can dump this result set into a temporary table, or even just use it as a subquery expression.
To maximize database compatibility, I will just use a subquery.


Deleting the bad rows

Okay, first we will just 'select' the rows to be deleted, to make sure we are deleting the proper rows:
select * from usersettings where id not in (select min(id) as id from usersettings group by username);


Sure enough, this returns a result set containing the duplicate rows that have higher ids.

So, now that we know we are selecting the correct data, I can go ahead and delete it:
delete from usersettings where id not in (select min(id) as id from usersettings group by username);

Using a temp table

If you want to use a temporary table instead, with postgres you could do the following:

select min(id) as id into temp table goodids from usersettings group by username;
delete from usersettings where id not in (select * from goodids);


As always, you should really back up your database before deleting stuff!!!

Now that my usersettings.username only contains unique data, I am able to apply the unique constraint to it.

No comments: