Interesting questions: Can you delete 20 Million Records in SQL?
Mastering Massive Data Deletions: Safely Handling the Deletion of 20 Million Records in SQL
Introduction
Hi folks. This week I have got an interesting question from one of the interviewees I have attended. After a small intro section. I got some basic backend questions. Then an interviewer asked me How I would handle the deletion of the user activity log from the user_logs table which has 20 million rows & its Mysql database.
In this article, we will explore exciting ways to delete a huge data from a database from a production database.
Do you know the best way to delete records comment your answers.
Intuition - 1
At first glance, it may sound easy like just executing a simple delete query with a where clause.
DELETE FROM user_logs WHERE 1;
While data in a table is less. It works fine without any issues. But The problem statement gets interesting when the volume of records that we are going to delete is 20 million rows. The query would take a long time to complete & start creating transaction logs to stuck.
Intuition - 2
Now, you would think. Fine, why we can delete by chucks using limit or stored procedure?
DELETE FROM `table`
WHERE (whatever criteria)
ORDER BY `id`
LIMIT 1000
The stored procedure may seem complex. The basic idea behind is chucking the data & deleting it.
Intuition - 3
Now, you may have an idea if we can go with truncating a table. So, As per the interviewer's question. We want to delete the complete user_logs
table rows.No constraints or conditions to delete.
TRUNCATE TABLE user_logs;
Finally, I thought I had a solution I believed. Then the interviewer asked another question What is the difference between TRUNCATE & DELETE?
What is a DELETE Command?
Delete is a DML or Data Manipulation Command that is used to delete particular records from a table. You can use the WHERE clause with the DELETE command to delete specific records from the table.
What is a TRUNCATE Command?
The Truncate statement is a DDL or Data Definition Language command that is used to delete the complete data from the table without deleting the table structure. You cannot use the WHERE clause with this command.so, you cannot filter the records.
The Key Difference Between Delete and Truncate
DELETE | TRUNCATE |
The DELETE command is used to delete particular records from a table. | The TRUNCATE command is used to delete the complete data from the table. |
The DELETE command acquires the lock on every deleting record; thus, it requires more locks and resources. | The TRUNCATE command requires fewer locks and resources before deleting the data page because it acquires the lock on the data page |
Its speed is slow as it makes operations in rows and records it in transaction logs | Its speed is fast as it only records data pages in transaction logs. |
It records all the deleted data rows in the transaction log. | It records only the deleted data pages in the transaction log. |
What would be the right approach?
There is no doing write way in software engineering. Most all the intuitions that we discussed above are the key points to understanding a database & its internal operations. Let's dissect the solutions to understanding the delete operation.
Safely Handling the Deletion of 20 Million Records in SQL.
SQL delete statement is one of the database manipulation commands (DML) that is used to remove rows from a table. When we execute a delete statement, many activities will be performed by the database engine but some factors affect the locking and performance behavior of SQL delete statements.
Number of indexes in the table
Poor indexed foreign keys
Lock escalation mode
Isolation levels
Temporal tables
The solution depends upon the scope of the project. But The simplest way to delete a million records from an SQL database is by following these steps.
Created a temporary table filtering only unique
Truncated the original table
Inserted back to the original table from the temporary table.
After making sure the data was correct,
Then delete the temporary table.
Example :
CREATE TABLE user_logs_temp AS SELECT * FROM user_logs WHERE 1;
TRUNCATE TABLE user_logs;
INSERT INTO user_logs SELECT * FROM user_logs_temp;
You may have a question about why a select operation from user_logs
a table doesn't create a performance issue. To understand this you should be aware of Locks in databases. Check out this medium article about Locks to understand.
Any delete operation will cause locking - on the row level. But once your transaction has more than 5000 row-level locks, SQL Server will do a lock escalation and lock the entire table in exclusive mode.
In conclusion
In conclusion, I hope you learned something new. Learn new things daily to become a good software engineer.
Share & comment your thoughts on it for more content like this.