When Using Bind Variables is not Enough: Dynamic IN Lists

Spread the love
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

In a previous blog post, I wrote about why you should (almost) always default to using bind variables. There are some exceptions, which I will cover in another follow-up post, but by default, bind variables are the right choice, both from a performance and from a security perspective.
In this article, I will show an example where regrettably, bind variables are not enough, and you can still run into significant performance issues in production. That’s when you create dynamic IN lists.
What’s the problem with dynamic IN lists
Again, this blog post is applicable for most databases, but I’ll use Oracle as an example, only.
An IN predicate that takes a list is a handy abbreviation for an equivalent OR predicate. The following two queries are semantically equivalent:

— IN predicate
SELECT *
FROM actor
WHERE actor_id IN (1, 2, 3)

— OR predicate
SELECT *
FROM actor
WHERE actor_id = 1
OR actor_id = 2
OR actor_id = 3

The execution plans are, respectively:
IN list

———————————————————
| Id | Operation | Name | Rows |
———————————————————
| 0 | SELECT STATEMENT | | 3 |
| 1 | INLIST ITERATOR | | |
| 2 | TABLE ACCESS BY INDEX ROWID| ACTOR | 3 |
|* 3 | INDEX UNIQUE SCAN | PK_ACTOR | 3 |
———————————————————

Predicate Information (identified by operation id):
—————————————————

3 – access(“ACTOR_ID”=1 OR “ACTOR_ID”=2 OR “ACTOR_ID”=3)

OR predicate

———————————————————
| Id | Operation | Name | Rows |
———————————————————
| 0 | SELECT STATEMENT | | 3 |
| 1 | INLIST ITERATOR | | |
| 2 | TABLE ACCESS BY INDEX ROWID| ACTOR | 3 |
|* 3 | INDEX UNIQUE SCAN | PK_ACTOR | 3 |
———————————————————

Predicate Information (identified by operation id):
—————————————————

3 – access(“ACTOR_ID”=1 OR “ACTOR_ID”=2 OR “ACTOR_ID”=3)

We get the exact same plan. And the plan is also optimal. When looking for 3 actors, we’ll run 3 primary key lookups and get a cardinality estimate of 3. Perfect, what could be wrong about this?
Remember execution plan caching
In that previous blog post, we’ve seen the detrimental effects we can get under load when execution plans are no longer cached in the execution plan cache (or in Oracle: cursor cache). But even when we’re using bind variables, we can run into these issues. Consider the following dynamic SQL queries:

SELECT count(*) FROM actor WHERE actor_id IN (?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?)
SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)

Now, let’s generate these SQL queries using dynamic SQL in PL/SQL. PL/SQL is not a very dynamic SQL friendly language as can be seen in the following snippet:

SET SERVEROUTPUT ON
DECLARE
v_count NUMBER;

FUNCTION binds (n NUMBER) RETURN VARCHAR2 IS
v_result VARCHAR2(1000);
BEGIN
FOR i IN 1..n LOOP
IF v_result IS NULL THEN
v_result := ‘:b’ || i;
ELSE
v_result := v_result || ‘, :b’ || i;
END IF;
END LOOP;

RETURN v_result;
END binds;

FUNCTION vals (n NUMBER) RETURN VARCHAR2 IS
v_result VARCHAR2(1000);
BEGIN
FOR i IN 1..n LOOP
IF v_result IS NULL THEN
v_result := i;
ELSE
v_result := v_result || ‘, ‘ || i;
END IF;
END LOOP;

RETURN v_result;
END vals;
BEGIN
FOR i IN 1..5 LOOP
EXECUTE IMMEDIATE ‘
DECLARE
v_count NUMBER;
BEGIN
EXECUTE IMMEDIATE ”
SELECT count(*)
FROM actor
WHERE actor_id IN (‘ || binds(i) || ‘)

INTO v_count
USING ‘ || vals(i) || ‘;

:v_count := v_count;
END;

USING OUT v_count
;

dbms_output.put_line(v_count);
END LOOP;
END;
/

Running this snippet yields, as expected:

1
2
3
4
5

There are a few noteworthy things to say about the above:
The functions BINDS() and VALS() generate strings like :b1, :b2, :b3 and 1, 2, 3. We need these to generate dynamically a list of bind variables for the IN predicates, as well as a list of bind values for the USING clause of the nested EXECUTE IMMEDIATE statement
The EXECUTE IMMEDIATE .. USING clause does not take dynamically sized bind value lists, so we have to generate that as well, in a nested EXECUTE IMMEDIATE statement
An appropriate gif at this point when nesting EXECUTE IMMEDIATE like we did is this one:

Or also:

So, now that we got this covered…
… let’s look at execution plans:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
dbms_xplan.display_cursor (
s.sql_id, s.child_number, ‘ALLSTATS LAST’
)
) p
WHERE lower(s.sql_text) LIKE ‘%actor_id in %’
AND lower(s.sql_text) NOT LIKE ‘%v$sql%’
AND lower(s.sql_text) NOT LIKE ‘%execute_immediate%’;

And the result is (some additional formatting applied):

SQL_ID cwm09k8zqp2at, child number 0
————————————-
SELECT count(*) FROM actor WHERE actor_id IN (:b1)

Plan hash value: 1971314150

————————————————
| Id | Operation | Name | E-Rows |
————————————————
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | INDEX UNIQUE SCAN| PK_ACTOR | 1 |
————————————————

Predicate Information (identified by operation id):
—————————————————

2 – access(“ACTOR_ID”=:B1)

SQL_ID 1wypurx1x0mrp, child number 0
————————————-
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2)

Plan hash value: 577114894

————————————————-
| Id | Operation | Name | E-Rows |
————————————————-
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | INLIST ITERATOR | | |
|* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 2 |
————————————————-

Predicate Information (identified by operation id):
—————————————————

3 – access((“ACTOR_ID”=:B1 OR “ACTOR_ID”=:B2))

SQL_ID 2y06nwgpxqttn, child number 0
————————————-
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3)

Plan hash value: 577114894

————————————————-
| Id | Operation | Name | E-Rows |
————————————————-
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | INLIST ITERATOR | | |
|* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 3 |
————————————————-

Predicate Information (identified by operation id):
—————————————————

3 – access((“ACTOR_ID”=:B1 OR “ACTOR_ID”=:B2 OR “ACTOR_ID”=:B3))

SQL_ID d4nn8qf9n22yt, child number 0
————————————-
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4)

Plan hash value: 577114894

————————————————-
| Id | Operation | Name | E-Rows |
————————————————-
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | INLIST ITERATOR | | |
|* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 4 |
————————————————-

Predicate Information (identified by operation id):
—————————————————

3 – access((“ACTOR_ID”=:B1 OR “ACTOR_ID”=:B2 OR “ACTOR_ID”=:B3 OR
“ACTOR_ID”=:B4))

SQL_ID 4n9b4zgxr5cwj, child number 0
————————————-
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4, :b5)

Plan hash value: 577114894

————————————————-
| Id | Operation | Name | E-Rows |
————————————————-
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | INLIST ITERATOR | | |
|* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 5 |
————————————————-

Predicate Information (identified by operation id):
—————————————————

3 – access((“ACTOR_ID”=:B1 OR “ACTOR_ID”=:B2 OR “ACTOR_ID”=:B3 OR
“ACTOR_ID”=:B4 OR “ACTOR_ID”=:B5))

So, again, we’re down to having 5 different SQL_IDs and statements in the execution plan cache / cursor cache, even if they all derive from a single logical statement. Now, in PL/SQL, this is so hard to do, you’re probably not doing it at all.
But if you’re using a query builder (or wrote your own), like JPA’s Criteria Query, or jOOQ, then you will be writing something like this (using jOOQ, because in fact, using the Criteria Query API, dynamic SQL is about as hard to do as in PL/SQL):

for (int i = 1; i = N and M(X – 1) < N The latter approach would, instead of generating the original queries: SELECT count(*) FROM actor WHERE actor_id IN (?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?) … generate these ones: SELECT count(*) FROM actor WHERE actor_id IN (?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?, ?, ?, ?) Where the last bind variable is always repeated (padded) until the list is filled up to a length that is a power of 2 (and in Oracle, of course, we’d split lists longer than 1000 items into several lists). This padding is what jOOQ already offers:https://www.jooq.org/doc/latest/manual/sql-building/dsl-context/custom-settings/settings-in-list-padding/ And maybe, a future Hibernate will offer it too, after I’ve suggested the feature to Vlad Mihalcea from the Hibernate team:https://hibernate.atlassian.net/browse/HHH-12469 Here’s the benchmark logic. It’s a bit more complex than previously: SET SERVEROUTPUT ON ALTER SYSTEM FLUSH SHARED_POOL; ALTER SYSTEM FLUSH BUFFER_CACHE; CREATE TABLE results ( run NUMBER(2), stmt NUMBER(2), elapsed NUMBER ); DECLARE v_ts TIMESTAMP WITH TIME ZONE; v_repeat CONSTANT NUMBER := 20000; v_first_name actor.first_name%TYPE; v_last_name actor.last_name%TYPE; FUNCTION binds (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS v_result VARCHAR2(32767); v_n_pad NUMBER(10) := power(2, ceil(ln(n)/ln(2))); BEGIN FOR i IN 1..n LOOP IF v_result IS NULL THEN v_result := ':b' || i; ELSIF mod(i, 1000) = 0 THEN v_result := v_result || ') OR actor_id IN (:b' || i; ELSE v_result := v_result || ', :b' || i; END IF; END LOOP; IF pad THEN FOR i IN n + 1 .. v_n_pad LOOP IF mod(i, 1000) = 0 THEN v_result := v_result || ') OR actor_id IN (:b' || i; ELSE v_result := v_result || ', :b' || i; END IF; END LOOP; END IF; RETURN v_result; END binds; FUNCTION vals (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS v_result VARCHAR2(32767); v_n_pad NUMBER(10) := power(2, ceil(ln(n)/ln(2))); BEGIN FOR i IN 1..n LOOP IF v_result IS NULL THEN v_result := i; ELSE v_result := v_result || ', ' || i; END IF; END LOOP; IF pad THEN FOR i IN n + 1 .. v_n_pad LOOP v_result := v_result || ', ' || n; END LOOP; END IF; RETURN v_result; END vals; PROCEDURE run(i NUMBER, pad BOOLEAN) IS BEGIN EXECUTE IMMEDIATE ' DECLARE v_count NUMBER; BEGIN EXECUTE IMMEDIATE '' SELECT count(*) FROM actor WHERE actor_id IN (' || binds(i, pad) || ') '' INTO v_count USING ' || vals(i, pad) || '; END; '; END run; BEGIN -- Repeat the whole benchmark several times to avoid warmup penalty FOR r IN 1..5 LOOP v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP run(mod(i, 100) + 1, FALSE); END LOOP; INSERT INTO results VALUES (r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP run(mod(i, 100) + 1, TRUE); END LOOP; INSERT INTO results VALUES (r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); END LOOP; FOR rec IN ( SELECT run, stmt, CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio FROM results ) LOOP dbms_output.put_line('Run ' || rec.run || ', Statement ' || rec.stmt || ' : ' || rec.ratio); END LOOP; END; / DROP TABLE results; Bear in mind the following: We run the benchmark 5 times In each run, we run 20000 queries with each technique Those 20000 queries have between 1 and 100 elements in the IN list The results show that both approaches fare equally well: Run 1, Statement 1 : 1.16944 Run 1, Statement 2 : 1.11204 Run 2, Statement 1 : 1.06086 Run 2, Statement 2 : 1.03591 Run 3, Statement 1 : 1.03589 Run 3, Statement 2 : 1.03605 Run 4, Statement 1 : 1.33935 Run 4, Statement 2 : 1.2822 Run 5, Statement 1 : 1 Run 5, Statement 2 : 1.04648 This means that there is no significant overhead caused by the padding in this benchmark. That’s good news. Running the following query shows the number of executions per statement: SELECT executions, regexp_count(sql_text, ':'), sql_text FROM v$sql WHERE lower(sql_text) LIKE '%actor_id in %' AND lower(sql_text) NOT LIKE '%v$sql%' AND lower(sql_text) NOT LIKE '%execute_immediate%' ORDER BY sql_text; The result being: EXECS BINDS SQL_TEXT 2000 1 SELECT count(*) FROM actor WHERE actor_id IN (:b1) 2000 2 SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2) 1000 3 SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3) 3000 4 ... 1000 5 ... 1000 6 ... 1000 7 ... 5000 8 ... 1000 9 ... 1000 10 ... 1000 11 ... 1000 12 ... 1000 13 ... 1000 14 ... 1000 15 ... 9000 16 ... ... 1000 100 ... 36000 128 ... The version that creates exactly sized IN lists is always executed 1000 times. The version that has padded IN lists sized to the next power of 2 (M = 2 ^ X) is executed 1000 * X times (plus another 1000 times because there’s also an IN list of that exact size) At the end, there are some “lone” 128-element sized IN lists. This is expected. The number of different SQL statements is: 100 for the exactly sized IN lists 7 for the padded IN lists (128 = 2 ^ 7) So, thus far, we haven’t gained anything, but also not lost much. What if the lists get longer The real problem isn’t the individual query execution time, which is fine in both cases, but the effect that dynamic IN lists have on a system. Let’s generate IN lists of up to size 4096, and bearing in mind that Oracle won’t allow this, the lists will be of the form: SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, ..., :b1000) OR actor_id IN (:b1001, :b1002, ..., :b2000) OR actor_id IN (:b2001, ...) Now, with a maximum size of 4096, there are quite a few distinct statements that need to be parsed. As opposed to only 12 when we use the padded in lists. The padded size grows logarithmically, so we’ll never really run into this same problem again, at least not for a single IN list. How does the comparison fare in another benchmark? We must be careful not to overdo things. Running 2x 20000 queries with variable sized IN lists can easily run for more than an hour on Oracle. Perhaps, it’s Oracle’s way of punishing you for doing something you shouldn’t: Having super-long IN lists. I’ve played around with different sets of parameters to the benchmark and the clear message here is that having long bind variable lists creates a bottleneck per se (i.e. the binding of the variables takes time), so the difference in parse times isn’t as drastic as it was before, for individual query executions. The overall improvement is this: Run 1, Statement 1 : 1.42696 Run 1, Statement 2 : 1 So, we get around a 1.5x increase of performance when repeating the bind values. Yet still, the fact that we have far less cached execution plans does matter, especially when we start exhausting the cursor cache, because with this approach, at least this exhaustion is not going to happen. Alternative options The above shows that if you really cannot get around an IN list, a simple, viable workaround is to pad the lists. But often, you can get around them. In databases like Oracle and PostgreSQL, that offer array types, you can consider using those once your lists start growing. I’ve blogged about this in a previous blog post:https://blog.jooq.org/2017/03/30/sql-in-predicate-with-in-list-or-with-array-which-is-faster If your lists become huge, a bulk/batch inserting the IDs into a temporary table might be a far better option, and then semi-join that one instead: SELECT count(*) FROM actor WHERE actor_id IN ( SELECT id FROM temp_table ) And ideally, if your list is really not dynamic, but originates from some other table, then please, stop passing around IDs and semi-join the original query that produced the IDs in the first place: SELECT count(*) FROM actor WHERE actor_id IN ( SELECT id FROM other_table WHERE ... ) That last option should always be your first choice as it will very likely outperform all the others.

X ITM Cloud News

Emily

Next Post

Truth First, or Why You Should Mostly Implement Database First Designs

Sun Nov 24 , 2019
Spread the love          In this much overdue article, I will explain why I think that in almost all cases, you should implement a “database first” design in your application’s data models, rather than a “Java first” design (or whatever your client language is), the latter approach leading to a long road […]
X- ITM

Cloud Computing – Consultancy – Development – Hosting – APIs – Legacy Systems

X-ITM Technology helps our customers across the entire enterprise technology stack with differentiated industry solutions. We modernize IT, optimize data architectures, and make everything secure, scalable and orchestrated across public, private and hybrid clouds.

This image has an empty alt attribute; its file name is x-itmdc.jpg

The enterprise technology stack includes ITO; Cloud and Security Services; Applications and Industry IP; Data, Analytics and Engineering Services; and Advisory.

Watch an animation of  X-ITM‘s Enterprise Technology Stack

We combine years of experience running mission-critical systems with the latest digital innovations to deliver better business outcomes and new levels of performance, competitiveness and experiences for our customers and their stakeholders.

X-ITM invests in three key drivers of growth: People, Customers and Operational Execution.

The company’s global scale, talent and innovation platforms serve 6,000 private and public-sector clients in 70 countries.

X-ITM’s extensive partner network helps drive collaboration and leverage technology independence. The company has established more than 200 industry-leading global Partner Network relationships, including 15 strategic partners: Amazon Web Services, AT&T, Dell Technologies, Google Cloud, HCL, HP, HPE, IBM, Micro Focus, Microsoft, Oracle, PwC, SAP, ServiceNow and VMware

.

X ITM