SQL Query Optimization Framework Pt.2
This article is the second in a multi-part series focusing on optimizing SQL queries for structure, performance, and readability.
Performance Optimization
Proper Indexing
Running queries on large tables without proper indexing leads to slow performance. Always check if indexes on JOIN and WHERE conditions are needed.
Below is a scenario that demonstrates proper indexing in SQL. The goal is to efficiently retrieve customer orders using indexes to speed up the query.
Scenario:
You have two tables in a relational database:
customers (id, name, email)
orders (id, customer_id, order_date, total_amount)
You frequently run a query to fetch customers who placed orders within the last 30 days.
Step 1: Create Tables with Proper Indexing
The customers.id and orders.id columns are already indexed as primary keys.
The orders.customer_id should be indexed for efficient JOIN operations.
The orders.order_date should be indexed for fast date-based queries.
Step 2: Assign Indexes
Now lets review the optimized query that can be run using the newly created indexes.
Why Is This Efficient?
orders.customer_id being indexed makes the JOIN operation faster.
orders.order_date being indexed enables quick filtering without scanning the whole table.
ORDER BY o.order_date DESC works efficiently by using the existing index to sort the results.
2. Proper SELECT Usage
Good practice in SQL is to avoid SELECT * and explicitly specify only the necessary columns. This improves performance, reduces memory usage, and makes the query more readable.
Scenario:
You have two tables in a relational database:
customers (id, name, email)
orders (id, customer_id, order_date, total_amount)
You need to retrieve the customer name, email, order ID, and total amount for all orders placed in the last 30 days.
Example of Bad Query Using SELECT *:
Why Is This Inefficient?
Using SELECT * Retrieves unnecessary columns (e.g., customers.created_at, orders.order_date) wasting memory.
Upstream changes to the source tables, like adding new columns, can cause unintended side effects to downstream processes.
Example of Proper Query:
Why Is This Efficient?
The query is faster due to only retrieving necessary columns reducing memory usage and data transferred.
Showing the necessary columns in the query improves readability.
When upstream schemas change they are not automatically pushed to downstream tables which improves control over what gets retrieved.
3. Proper WHERE and HAVING Usage
In a SQL statement WHERE and having are both used to filter results of a query based on specific conditions. These statements are used differently in that WHERE filters a query before aggregation and HAVING filters aggregated results after a GROUP BY is applied.
Scenario:
You have a table called orders with the following structure:
You want to find customers who have spent more than $500 in total, only considering orders placed in the last 30 days.
Example of Improper Use of HAVING:
Example of Proper Use of HAVING:
Why is this correct?
WHERE filters out irrelevant orders (only recent ones) before aggregation.
HAVING ensures only customers with total spending > $500 appear in the results.
Filtering early (WHERE) reduces the number of rows processed in aggregation.
4. Correct Usage of JOINs
1. INNER JOIN
Returns only the matching rows from both tables. If there's no match, the row is excluded.
Example:
2. LEFT JOIN (or LEFT OUTER JOIN)
Returns all rows from the left table and only matching rows from the right table. If no match is found, NULLs are returned for the right table.
Example:
3. RIGHT JOIN (or RIGHT OUTER JOIN)
Returns all rows from the right table and only matching rows from the left table. If no match is found, NULLs are returned for the left table.
Example:
4. FULL JOIN (or FULL OUTER JOIN)
Returns all rows when there is a match in either table. If there's no match, NULLs are returned for the missing values.
Example:
5. CROSS JOIN
Returns the Cartesian product (every possible combination) of rows from both tables. It does not require a join condition.
Example:
6. SELF JOIN
A table is joined with itself, usually using aliases to differentiate instances.
Example:
Which Join Should You Use?
Use INNER JOIN when you only want matching records.
Use LEFT JOIN when you need all records from the left table.
Use RIGHT JOIN when you need all records from the right table.
Use FULL JOIN when you need all records from both tables.
Use CROSS JOIN only when every combination of rows is required.
Use SELF JOIN for hierarchical relationships.
5. Understanding and optimizing subqueries vs. CTEs (Common Table Expressions)
Subquery vs. CTE in SQL
Both Subqueries and Common Table Expressions (CTEs) allow you to break down complex queries by structuring intermediate results, but they function differently.
1. Subquery (Nested Query)
A subquery is a query inside another query. It can be used in SELECT, FROM, or WHERE clauses.
Example 1: Subquery in SELECT (Scalar Subquery)
Returns a single value per row.
Example 2: Subquery in WHERE Clause
Filters results based on another query.
Example 3: Subquery in FROM Clause (Derived Table)
Used as a temporary table.
2. Common Table Expression (CTE)
A CTE is a named temporary result set that exists only for the duration of the query. It is defined using the WITH keyword and can be referenced multiple times in the main query.
Example 1: Basic CTE
Example 2: Recursive CTE
Used for hierarchical or tree-structured data.
Subquery vs. CTE: When to Use What?
Which One Should You Use?
Use a Subquery for simple cases where you don’t need to reuse the result.
Use a CTE when:
The query is complex and needs better readability.
You need to reference the result multiple times.
You're working with recursive queries (e.g., hierarchical data).