In my previous post I spoke about using TABLE_CACHED_BLOCKS to make the CLUSTERING_FACTOR for indexes more realistic by having Oracle assume there was more than 1 block from the table in the buffer cache, helping the optimizer select indexes more appropriately.
But a way to seriously improve performance is to keep the data in the table ordered by how you intend to access it. Before I start, I need to say that doing this automatically is somewhat limited in it’s scope, and generally better suited to Data Warehouse/Lake-type operations than OLTP systems.
I have used a simple table to demonstrate the advantage of keeping data sorted, consisting of a primary key, some data, and an unindexed column to force the query to always read the table (i.e. so it’s not a covering index)
You may notice a “new” clause on the CREATE TABLE command:
CLUSTERING BY LINEAR ORDER (data) WITHOUT MATERIALIZED ZONEMAP;
This tells Oracle to store data in this table ORDERED BY the DATA column (rather than randomly – but approximately by the ID Primary Key column).
However, this table constraint is ONLY honoured for DIRECT PATH INSERTS. It is ignored for “normal” inserts so, given I have populated the table using normal inserts it will have made NO DIFFERENCE! Lets see:
DROP TABLE cost_tab_lo PURGE;DROP SEQUENCE cost_tab_lo_seq;CREATE SEQUENCE cost_tab_lo_seq;CREATE TABLE cost_tab_lo (id number NOT NULL, data number NOT NULL, force_tab_lookup_col VARCHAR2(20) NOT NULL) CLUSTERING BY LINEAR ORDER (data) WITHOUT MATERIALIZED ZONEMAP;ALTER TABLE cost_tab_lo ADD CONSTRAINT cost_tab_lo_pk PRIMARY KEY (ID);CREATE INDEX cost_tab_lo_IX_data ON cost_tab_lo (data);CREATE OR REPLACE PROCEDURE cost_tab_lo_data as begin for i in 1..1000000 loop insert into cost_tab_lo values (cost_tab_lo_seq.nextval, mod(i,10000), 'need_tab_access'); commit; end loop; end;/I then, as in the previous post, run EXEC cost_tab_lo_data in 3 sessions simultaneously
> info COST_TAB_LOTABLE: COST_TAB_LO LAST ANALYZED:2026-04-23 10:42:34.0 ROWS :3000000 SAMPLE SIZE :3000000 INMEMORY :DISABLED COMMENTS :ColumnsNAME DATA TYPE NULL DEFAULT COMMENTS*ID NUMBER No DATA NUMBER No FORCE_TAB_LOOKUP_COL VARCHAR2(20 BYTE) NoIndexesINDEX_NAME UNIQUENESS STATUS FUNCIDX_STATUS COLUMNS___________________________ _____________ _________ _________________ __________NEIL.COST_TAB_LO_PK UNIQUE VALID ~ IDNEIL.COST_TAB_LO_IX_DATA NONUNIQUE VALID ~ DATA
If we look at the DATA column and which block it’s in, and it’s CLUSTERING_FACTOR, it’s not an efficient index
SELECT id ,data ,DBMS_ROWID.ROWID_BLOCK_NUMBER (rowid) blk FROM cost_tab_lo WHERE data = 1 ORDER BY id FETCH FIRST 20 ROWS ONLY; ID DATA BLK---------- ---------- ---------- 1 1 2455 4165 1 25010 12882 1 25503 17684 1 25530 28772 1 25380 43341 1 191734 47524 1 191693 59239 1 191654 74165 1 193507 76088 1 193461 88874 1 193443 105552 1 194225 105780 1 194219 119101 1 194213 133620 1 195962 138328 1 195909 148491 1 195850 163820 1 196307 167800 1 196281 179250 1 196257
> SELECT table_name,num_rows,blocks,avg_row_len,floor(8192/avg_row_len) rows_per_block FROM user_tab_statistics WHERE table_name = 'COST_TAB_LO';TABLE_NAME NUM_ROWS BLOCKS AVG_ROW_LEN ROWS_PER_BLOCK--------------- ---------- ---------- ----------- --------------COST_TAB_LO 3000000 13157 26 315> SELECT table_name,index_name,num_rows,leaf_blocks,clustering_factor FROM user_ind_statistics WHERE table_name = 'COST_TAB_LO' ORDER BY INDEX_NAME;TABLE_NAME INDEX_NAME NUM_ROWS CLUSTERING_FACTOR--------------- -------------------- ---------- -----------------COST_TAB_LO COST_TAB_LO_IX_DATA 2946777 2942233COST_TAB_LO COST_TAB_LO_PK 2917994 2234534
For index COST_TAB_LO_IX_DATA, the CLUSTERING_FACTOR (2942233) is almost identical to the NUM_ROWS (2946777) – very inefficient to the optimizer.
> set autotrace traceonly> SELECT id,data,force_tab_lookup_col FROM cost_tab_lo WHERE data BETWEEN 1 AND 20;6000 rows selected.Execution Plan----------------------------------------------------------Plan hash value: 3091393023---------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |---------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 6301 | 159K| 3581 (1)| 00:00:01 ||* 1 | TABLE ACCESS FULL| COST_TAB_LO | 6301 | 159K| 3581 (1)| 00:00:01 |---------------------------------------------------------------------------------
And we can see that a query retrieving only 6,000 of the 3,000,000 rows (0.2%) is using an FULL TABLE SCAN as, given the layout of the data on disk, it’s the most efficient method (remember, full table scans and index fast full scans use MULTIBLOCK reads, which reads 1MB from disk at a time and it much more efficient at reading large quantities of data than single block reads – see system statistics for the ratio used by the optimizer)
So, what if I MOVE the table to re-org it and re-gather statistics?
This is a DIRECT PATH operation, and we could see very different results…
> ALTER TABLE cost_tab_lo MOVE ONLINE;Table altered. > BEGIN DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER ,TABNAME=>'COST_TAB_LO' ,ESTIMATE_PERCENT=>DBMS_STATS.AUTO_SAMPLE_SIZE ,METHOD_OPT=>'FOR ALL COLUMNS SIZE 1'); END; / PL/SQL procedure successfully completed.NOTE: the method_opt is just being used to ensure no histograms and keep everything a little cleaner
Oh my, it worked and the table has been reorganized by the DATA column
SELECT id ,data ,DBMS_ROWID.ROWID_BLOCK_NUMBER (rowid) blk FROM cost_tab_lo WHERE data = 1 ORDER BY id FETCH FIRST 20 ROWS ONLY; ID DATA BLK---------- ---------- ---------- 1 1 243484 4165 1 243484 12882 1 243484 17684 1 243484 28772 1 243484 43341 1 243484 47524 1 243484 59239 1 243484 74165 1 243484 76088 1 243484 88874 1 243484 105552 1 243484 105780 1 243484 119101 1 243484 133620 1 243484 138328 1 243484 148491 1 243484 163820 1 243484 167800 1 243484 179250 1 243484
> SELECT table_name,index_name,num_rows,leaf_blocks,clustering_factor FROM user_ind_statistics WHERE table_name = 'COST_TAB_LO' ORDER BY INDEX_NAME;TABLE_NAME INDEX_NAME NUM_ROWS CLUSTERING_FACTOR--------------- -------------------- ---------- -----------------COST_TAB_LO COST_TAB_LO_IX_DATA 3121912 18498COST_TAB_LO COST_TAB_LO_PK 2757883 2680218
The index COST_TAB_LO_IX_DATA now has a clustering factor of 18498 – much closer to the number of blocks in the table than the number of rows. It’s now a much more efficient index.
And what if we run the query against the DATA column again
> set autotrace traceonly>> SELECT id,data,force_tab_lookup_col FROM cost_tab_lo WHERE data BETWEEN 1 AND 20;6000 rows selected.Execution Plan----------------------------------------------------------Plan hash value: 2093323550-----------------------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |-----------------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 6301 | 159K| 55 (0)| 00:00:01 || 1 | TABLE ACCESS BY INDEX ROWID BATCHED| COST_TAB_LO | 6301 | 159K| 55 (0)| 00:00:01 ||* 2 | INDEX RANGE SCAN | COST_TAB_LO_IX_DATA | 6301 | | 16 (0)| 00:00:01 |-----------------------------------------------------------------------------------------------------------
We are not using the INDEX, and the query cost had reduced from 3581 to 55.
Quite brilliant, wildly more efficient.
However, the efficiency of every other index on the table will have changed. You can only order by a few columns in a compound way. The data will primarily only be in the order of 1 column, and sorted within that for subsequent columns specified. In this small test case, if the Primary Key index is only used in equality predicates, this won’t matter. In more normal scenario’s, you are probably querying by several columns with several indexes. Changing the order of 1 column to make query against that column efficient may well make indexes on other columns inefficient.
Also, in this scenario it would probably have made sense to have specified ordering by 2 columns, so we could also get some clustering of ID
CREATE TABLE cost_tab_lo2 (id number NOT NULL, data number NOT NULL, force_tab_lookup_col VARCHAR2(20) NOT NULL) CLUSTERING BY LINEAR ORDER (data,id) WITHOUT MATERIALIZED ZONEMAP;
But, it usually isn’t overly practical to keep re-orging the table/partition. You may have an occasional window where it’s possible. The ordering will slowly degrade as new rows are inserted conventionally until the next re-org but the benefits may outweigh the resource usage to perform the re-org, and work for your application.
In Data Warehouse-type applications there are frequently staging operations with CREATE TABLE AS SELECT (CTAS), which are DIRECT PATH. There’s direct path SQL*Loader DIRECT=TRUE operations, Inserts using /*+ APPEND */ hints, and more.
The real value in this type of data clustering constraint is related to performing direct path operations, and also when you have a lot of partitions to play with. You can re-org older partitions independently, so keeping the majority of data in the order you want.
Keeping data ordered “on disk” can have huge performance benefits.
But as with practically every option you implement, there’s no free lunch…
There is a price to pay in the DIRECT PATH operations. There is a SORT step during the re-org. Sorting is expensive, making the re-org slower. This could be a very heavy operation for a large table/partition. If I explain a DIRECT_PATH insert, we can see line 3 is a SORT ORDER BY and takes a lot of resources to achieve.
> truncate table cost_tab_lo;Table truncated.> insert /*+ APPEND */ into cost_tab_lo SELECT * from cost_tab_lo_sav;3000000 rows created.Execution Plan----------------------------------------------------------Plan hash value: 3009853467------------------------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |------------------------------------------------------------------------------------------------------------| 0 | INSERT STATEMENT | | 3000K| 74M| | 25660 (1)| 00:00:02 || 1 | LOAD AS SELECT | COST_TAB_LO | | | | | || 2 | OPTIMIZER STATISTICS GATHERING | | 3000K| 74M| | 25660 (1)| 00:00:02 || 3 | SORT ORDER BY | | 3000K| 74M| 103M| 25660 (1)| 00:00:02 || 4 | TABLE ACCESS FULL | COST_TAB_LO_SAV | 3000K| 74M| | 3505 (1)| 00:00:01 |------------------------------------------------------------------------------------------------------------
Clustering Data by a particular column may be practical for you, and may provide amazing benefits but it is a rare implementation – few companies use it (partly because it’s not a very well known facility!) but those who can will get major performance rewards.









Leave a comment