Performance: Clustering Data

In my previous post I spoke about using TABLE_CACHED_BLOCKS to make the CLUSTERING_FACTOR for indexes more realistic by having Oracle assume there was more than 1 block from the table in the buffer cache, helping the optimizer select indexes more appropriately.

But a way to seriously improve performance is to keep the data in the table ordered by how you intend to access it. Before I start, I need to say that doing this automatically is somewhat limited in it’s scope, and generally better suited to Data Warehouse/Lake-type operations than OLTP systems.

I have used a simple table to demonstrate the advantage of keeping data sorted, consisting of a primary key, some data, and an unindexed column to force the query to always read the table (i.e. so it’s not a covering index)

You may notice a “new” clause on the CREATE TABLE command:
CLUSTERING BY LINEAR ORDER (data) WITHOUT MATERIALIZED ZONEMAP;

This tells Oracle to store data in this table ORDERED BY the DATA column (rather than randomly – but approximately by the ID Primary Key column).

However, this table constraint is ONLY honoured for DIRECT PATH INSERTS. It is ignored for “normal” inserts so, given I have populated the table using normal inserts it will have made NO DIFFERENCE! Lets see:

			
DROP TABLE cost_tab_lo PURGE;
DROP SEQUENCE cost_tab_lo_seq;
CREATE SEQUENCE cost_tab_lo_seq;
CREATE TABLE cost_tab_lo (id number NOT NULL, data number NOT NULL, force_tab_lookup_col VARCHAR2(20) NOT NULL)
       CLUSTERING BY LINEAR ORDER (data) WITHOUT MATERIALIZED ZONEMAP;
ALTER  TABLE cost_tab_lo ADD CONSTRAINT cost_tab_lo_pk PRIMARY KEY (ID);
CREATE INDEX cost_tab_lo_IX_data ON cost_tab_lo (data);
CREATE OR REPLACE PROCEDURE cost_tab_lo_data as
 begin
   for i in 1..1000000 loop
    insert into cost_tab_lo values (cost_tab_lo_seq.nextval, mod(i,10000), 'need_tab_access');
    commit;
   end loop;
 end;
/
I then, as in the previous post, run EXEC cost_tab_lo_data in 3 sessions simultaneously

		

			
> info COST_TAB_LO
TABLE: COST_TAB_LO
         LAST ANALYZED:2026-04-23 10:42:34.0
         ROWS         :3000000
         SAMPLE SIZE  :3000000
         INMEMORY     :DISABLED
         COMMENTS     :
Columns
NAME                   DATA TYPE           NULL  DEFAULT    COMMENTS
*ID                    NUMBER              No
 DATA                  NUMBER              No
 FORCE_TAB_LOOKUP_COL  VARCHAR2(20 BYTE)   No
Indexes
INDEX_NAME                  UNIQUENESS    STATUS    FUNCIDX_STATUS    COLUMNS
___________________________ _____________ _________ _________________ __________
NEIL.COST_TAB_LO_PK         UNIQUE        VALID     ~                 ID
NEIL.COST_TAB_LO_IX_DATA    NONUNIQUE     VALID     ~                 DATA

		

If we look at the DATA column and which block it’s in, and it’s CLUSTERING_FACTOR, it’s not an efficient index

			
SELECT id
       ,data
       ,DBMS_ROWID.ROWID_BLOCK_NUMBER    (rowid) blk
  FROM cost_tab_lo
 WHERE data = 1
 ORDER BY id
 FETCH FIRST 20 ROWS ONLY;
        ID       DATA        BLK
---------- ---------- ----------
         1          1       2455
      4165          1      25010
     12882          1      25503
     17684          1      25530
     28772          1      25380
     43341          1     191734
     47524          1     191693
     59239          1     191654
     74165          1     193507
     76088          1     193461
     88874          1     193443
    105552          1     194225
    105780          1     194219
    119101          1     194213
    133620          1     195962
    138328          1     195909
    148491          1     195850
    163820          1     196307
    167800          1     196281
    179250          1     196257

		

			
> SELECT table_name,num_rows,blocks,avg_row_len,floor(8192/avg_row_len) rows_per_block
    FROM user_tab_statistics
   WHERE table_name = 'COST_TAB_LO';
TABLE_NAME        NUM_ROWS     BLOCKS AVG_ROW_LEN ROWS_PER_BLOCK
--------------- ---------- ---------- ----------- --------------
COST_TAB_LO        3000000      13157          26            315
> SELECT table_name,index_name,num_rows,leaf_blocks,clustering_factor
    FROM user_ind_statistics
   WHERE table_name = 'COST_TAB_LO'
   ORDER BY INDEX_NAME;
TABLE_NAME      INDEX_NAME             NUM_ROWS CLUSTERING_FACTOR
--------------- -------------------- ---------- -----------------
COST_TAB_LO     COST_TAB_LO_IX_DATA     2946777           2942233
COST_TAB_LO     COST_TAB_LO_PK          2917994           2234534

		

For index COST_TAB_LO_IX_DATA, the CLUSTERING_FACTOR (2942233) is almost identical to the NUM_ROWS (2946777) – very inefficient to the optimizer.

			
> set autotrace traceonly
> SELECT id,data,force_tab_lookup_col 
    FROM cost_tab_lo 
  WHERE data BETWEEN 1 AND 20;
6000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3091393023
---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |  6301 |   159K|  3581   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COST_TAB_LO |  6301 |   159K|  3581   (1)| 00:00:01 |
---------------------------------------------------------------------------------

		

And we can see that a query retrieving only 6,000 of the 3,000,000 rows (0.2%) is using an FULL TABLE SCAN as, given the layout of the data on disk, it’s the most efficient method (remember, full table scans and index fast full scans use MULTIBLOCK reads, which reads 1MB from disk at a time and it much more efficient at reading large quantities of data than single block reads – see system statistics for the ratio used by the optimizer)

So, what if I MOVE the table to re-org it and re-gather statistics?
This is a DIRECT PATH operation, and we could see very different results…

			
> ALTER TABLE cost_tab_lo MOVE ONLINE;
Table altered.
  
> BEGIN
    DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER
                                 ,TABNAME=>'COST_TAB_LO'
                                 ,ESTIMATE_PERCENT=>DBMS_STATS.AUTO_SAMPLE_SIZE
                                 ,METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');
  END;
  /
    
PL/SQL procedure successfully completed.
NOTE: the method_opt is just being used to ensure no histograms and keep everything a little cleaner 

		

Oh my, it worked and the table has been reorganized by the DATA column

			
SELECT id
      ,data
      ,DBMS_ROWID.ROWID_BLOCK_NUMBER    (rowid) blk
  FROM cost_tab_lo
 WHERE data = 1
 ORDER BY id
 FETCH FIRST 20 ROWS ONLY;
        ID       DATA        BLK
---------- ---------- ----------
         1          1     243484
      4165          1     243484
     12882          1     243484
     17684          1     243484
     28772          1     243484
     43341          1     243484
     47524          1     243484
     59239          1     243484
     74165          1     243484
     76088          1     243484
     88874          1     243484
    105552          1     243484
    105780          1     243484
    119101          1     243484
    133620          1     243484
    138328          1     243484
    148491          1     243484
    163820          1     243484
    167800          1     243484
    179250          1     243484

		

			
> SELECT table_name,index_name,num_rows,leaf_blocks,clustering_factor
    FROM user_ind_statistics
   WHERE table_name = 'COST_TAB_LO'
   ORDER BY INDEX_NAME;
TABLE_NAME      INDEX_NAME             NUM_ROWS CLUSTERING_FACTOR
--------------- -------------------- ---------- -----------------
COST_TAB_LO     COST_TAB_LO_IX_DATA     3121912             18498
COST_TAB_LO     COST_TAB_LO_PK          2757883           2680218

		

The index COST_TAB_LO_IX_DATA now has a clustering factor of 18498 – much closer to the number of blocks in the table than the number of rows. It’s now a much more efficient index.

And what if we run the query against the DATA column again

			
> set autotrace traceonly
>
> SELECT id,data,force_tab_lookup_col 
    FROM cost_tab_lo 
   WHERE data BETWEEN 1 AND 20;
6000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2093323550
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                     |  6301 |   159K|    55   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| COST_TAB_LO         |  6301 |   159K|    55   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | COST_TAB_LO_IX_DATA |  6301 |       |    16   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

		

We are not using the INDEX, and the query cost had reduced from 3581 to 55.

Quite brilliant, wildly more efficient.

However, the efficiency of every other index on the table will have changed. You can only order by a few columns in a compound way. The data will primarily only be in the order of 1 column, and sorted within that for subsequent columns specified. In this small test case, if the Primary Key index is only used in equality predicates, this won’t matter. In more normal scenario’s, you are probably querying by several columns with several indexes. Changing the order of 1 column to make query against that column efficient may well make indexes on other columns inefficient.

Also, in this scenario it would probably have made sense to have specified ordering by 2 columns, so we could also get some clustering of ID

			
CREATE TABLE cost_tab_lo2 (id number NOT NULL, data number NOT NULL, force_tab_lookup_col VARCHAR2(20) NOT NULL)
       CLUSTERING BY LINEAR ORDER (data,id) WITHOUT MATERIALIZED ZONEMAP;

But, it usually isn’t overly practical to keep re-orging the table/partition. You may have an occasional window where it’s possible. The ordering will slowly degrade as new rows are inserted conventionally until the next re-org but the benefits may outweigh the resource usage to perform the re-org, and work for your application.

In Data Warehouse-type applications there are frequently staging operations with CREATE TABLE AS SELECT (CTAS), which are DIRECT PATH. There’s direct path SQL*Loader DIRECT=TRUE operations, Inserts using /*+ APPEND */ hints, and more.

The real value in this type of data clustering constraint is related to performing direct path operations, and also when you have a lot of partitions to play with. You can re-org older partitions independently, so keeping the majority of data in the order you want.

Keeping data ordered “on disk” can have huge performance benefits.

But as with practically every option you implement, there’s no free lunch…

There is a price to pay in the DIRECT PATH operations. There is a SORT step during the re-org. Sorting is expensive, making the re-org slower. This could be a very heavy operation for a large table/partition. If I explain a DIRECT_PATH insert, we can see line 3 is a SORT ORDER BY and takes a lot of resources to achieve.

			
> truncate table cost_tab_lo;
Table truncated.
> insert /*+ APPEND */ into cost_tab_lo SELECT * from cost_tab_lo_sav;
3000000 rows created.
Execution Plan
----------------------------------------------------------
Plan hash value: 3009853467
------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name            | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |                 |  3000K|    74M|       | 25660   (1)| 00:00:02 |
|   1 |  LOAD AS SELECT                  | COST_TAB_LO     |       |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |                 |  3000K|    74M|       | 25660   (1)| 00:00:02 |
|   3 |    SORT ORDER BY                 |                 |  3000K|    74M|   103M| 25660   (1)| 00:00:02 |
|   4 |     TABLE ACCESS FULL            | COST_TAB_LO_SAV |  3000K|    74M|       |  3505   (1)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

		

Clustering Data by a particular column may be practical for you, and may provide amazing benefits but it is a rare implementation – few companies use it (partly because it’s not a very well known facility!) but those who can will get major performance rewards.