Query Parsing Pipeline: Soft vs Hard Parse
The Parsing Pipeline
Consider this query: SELECT name, email FROM users WHERE status = 'active' AND created_at > '2024-01-01'. Lexical analysis breaks this into tokens: keywords (SELECT, FROM, WHERE), identifiers (name, email, users, status, created_at), operators (=, >, AND), and literals ('active', '2024-01-01'). Syntax analysis builds an Abstract Syntax Tree (AST), a hierarchical structure where the root is SELECT, with child nodes for columns (name, email), table (users), and filter conditions. Semantic analysis validates that the users table exists, that name, email, status, and created_at are valid columns, and that the date comparison is type compatible.
Soft Parse vs Hard Parse
The critical performance distinction is between soft parse and hard parse. When you execute SELECT * FROM orders WHERE user_id = 12345, the database normalizes the query text, computes a hash, and checks the plan cache. A soft parse finds a cached plan and returns in 0.1 to 0.5 milliseconds. A hard parse requires full parsing and optimization, taking 5 to 50 milliseconds for complex queries. In OLTP (Online Transaction Processing) systems handling 50,000 queries per second, hard parsing 5% of queries wastes significant CPU.
Parameterized Queries
Instead of embedding literals directly like SELECT * FROM orders WHERE user_id = 12345, use parameters: SELECT * FROM orders WHERE user_id = ? or SELECT * FROM orders WHERE user_id = $1. All queries with the same structure share one cached plan. A system executing 10,000 variations of the same query parses once and reuses 9,999 times, reducing parse overhead by 99.99%. This also prevents SQL injection: malicious input like '; DROP TABLE users; -- becomes a literal string value rather than executable SQL.