In this blog post, I will make a case for why you should specify a DISTRIBUTION KEY clause for a table that is about to be created on a single partition in a DPF database. This doesn’t make sense.. Does it? For me, this idea did not make any sense at all.. However, when I started exploring more on this topic, it made some sense and when I did more research, I realized that there is a big benefit in taking this approach. Hence the blog post 🙂 Please read on …
The whole idea of picking a distribution key for a table is to influence physical distribution of rows across multiple partitions. So, why would we want to specify the DISTRIBUTION KEY clause even when creating a table on a single partition?
There are multiple advantages of doing this.. However, here is a scenario that is common in most shops:
Suppose there is a developer who created this table in DEV. Data volume is less in DEV and so the developer chooses to simply create the table in a single database partition. When it is time for you (the DBA) to promote the table to QA/Production, you realize that the table has to be spread across multiple database partitions because, in a nut shell, this makes a better physical design compared to creating the table on a single database partition.
Assuming DBA’s work is being done over a weekend, the developer is not available. Now, you are not sure which column(s) to pick as distribution (hash) keys. Usually, I hash on columns that make up the Primary Key but what if we wanted to co-locate two tables? Picking hash keys in co-location case is not going to be straight forward as you need to know workload across multiple tables.
How about eliminating the guess work and planning for this right from the development stage?
The added advantage of taking this approach is that you are NOT going to let DB2 to pick the default distribution key (column) – which may or may not be optimum for performance. Non-optimal distribution keys result in data skew across db partitions. If you follow the approach suggested in this blog post, you would be planning for data growth and fixing problems caused by data skew before even they arise.
Here is an example to illustrate this:
Part 1) Let us create a test table db2inst1.test_hash on a single db partition with DISTRIBUTE BY HASH clause. For distribution key, I will pick the column ‘col2’ as it is a primary key (high cardinality).
db2 "create table db2inst1.test_hash (col1 int, col2 int) DISTRIBUTE BY HASH (col2)" DB20000I The SQL command completed successfully.
Generating the DDL for this table, we get:
#db2look -d $DEVDB -e -z db2inst1 -t test_hash CREATE TABLE "DB2INST1"."TEST_HASH" ( "COL1" INTEGER , "COL2" INTEGER ) DISTRIBUTE BY HASH("COL2") IN "TBSP_NO_HASH" ;
DB2 has a list of things that it checks before it picks the default tablespace. I will skip those details in this blog post. Apparently, DB2 picked a default tablespace ‘TBSP_NO_HASH’ for our table.
From the below SQL, it is obvious that the tablespace ‘TBSP_NO_HASH’ is a single-partitioned tablespace (i.e., it is only on a single db partition)
#db2 "select char(tbsp_name, 20) tablespace, char(DBPGNAME, 20) nodegroup, DBPARTITIONNUM from sysibmadm.tbsp_utilization where tbsp_name = 'TBSP_NO_HASH' with ur" TABLESPACE NODEGROUP DBPARTITIONNUM -------------------- --------------- -------------- TBSP_NO_HASH SDPG 0 1 record(s) selected.
Part 2) We will promote this table to QA. Our goal is to create this table on multiple database partitions. We will do this in 2 steps:
a) We will first create the table in QA exactly as it is on DEV and then
b) Use the ADMIN_MOVE_TABLE procedure in QA to move table into a new tablespace that spans across multiple database partitions.
Step a) is straight forward. Extract DDL from db2look and then execute the DDL against the QA database.
For Step b), we will initially create a tablespace in a partition-group (here it is shown as: PDPG) that spans across multiple partitions and later use stored procedure to do the job for us.
db2 "create tablespace test_hash_tbsp in pdpg_dv" DB20000I The SQL command completed successfully. db2 "call sysproc.admin_move_table ('DB2INST1', 'TBSP_NO_HASH', 'TBSP_HASH', 'TBSP_HASH', 'TBSP_HASH','','','','','','MOVE')" Result set 1 --------------------- KEY VALUE ------------------------------------------------------------ STATUS COMPLETE AUTHID DB2INST1 VERSION 09.07.0007 INIT_START 2015-04-19-23.03.58.089715 INDEXSCHEMA INDEXNAME CLEANUP_END 2015-04-19-23.03.59.912148 INIT_END 2015-04-19-23.03.59.069356 COPY_START 2015-04-19-23.03.59.121358 COPY_OPTS ARRAY_INSERT,NON_CLUSTER COPY_TOTAL_ROWS 0 COPY_END 2015-04-19-23.03.59.152560 REPLAY_START 2015-04-19-23.03.59.152967 REPLAY_TOTAL_ROWS 0 REPLAY_TOTAL_TIME 0 REPLAY_END 2015-04-19-23.03.59.580873 SWAP_START 2015-04-19-23.03.59.594976 SWAP_END 2015-04-19-23.03.59.766892 SWAP_RETRIES 0 CLEANUP_START 2015-04-19-23.03.59.772106 20 record(s) selected.
That is it.. We do not have to modify the DDL (in DEV) either to add distribution keys or edit the tablespace name in QA. How easy is that?! No chance for typos and the process could easily be scripted and is fail-safe.
Now, let us confirm that the table is actually in the desired tablespace (TBSP_HASH) in QA. We can use db2look to do this for us:
#db2look -d $QADB -e -z DB2INST1 -t TEST_HASH -e CREATE TABLE "DB2INST1"."TEST_HASH2" ( "COL1" INTEGER , "COL2" INTEGER ) DISTRIBUTE BY HASH("COL2") IN "TBSP_HASH" ;
We could also confirm that the tablespace ‘TBSP_HASH’ is on multiple database partitions as expected.
#db2 "select char(tbsp_name, 20) tablespace, char(DBPGNAME, 20) nodegroup, DBPARTITIONNUM from sysibmadm.tbsp_utilization where tbsp_name = 'TBSP_HASH' with ur" TABLESPACE NODEGROUP DBPARTITIONNUM ----------- --------------- -------------- TBSP_HASH PDPG 1 TBSP_HASH PDPG 2 TBSP_HASH PDPG 3 TBSP_HASH PDPG 4 4 record(s) selected.
No ‘DISTRIBUTE BY’ clause in DEV:
What if the developer did not include the DISTRIBUTE BY HASH clause in the DDL for table in DEV? Remember that this clause is not mandatory when creating a table.
In this case, the DDL would be like below:
CREATE TABLE "DB2INST1"."TEST_HASH" ( "COL1" INTEGER , "COL2" INTEGER ) IN "TBSP_NO_HASH" ;
Do you have to drop and re-create this table to add a distribution key? No. Not really. Fortunately, DB2’s alter statement allows you to add a distribution key.
db2 "select count(*) from db2inst1.no_hash" 1 ----------- 5509 1 record(s) selected. alter table db2inst1.no_hash add distribute by hash (col1) DB20000I The SQL command completed successfully. ## Let us see if we could read from the table after this alter db2 "select count(*) from db2inst1.no_hash" 1 ----------- 5509 1 record(s) selected.
Now, let us see if we could write into this table after we hash it. Just checking if the table has been put in an in-operative mode.
db2 "insert into db2inst1.no_hash (select colcount, colcount from syscat.tables)" DB20000I The SQL command completed successfully. db2 "select count(*) from db2inst1.no_hash" 1 ----------- 11018 1 record(s) selected.
Hope I made a case for why you should consider adding the DISTRIBUTE BY clause to your table’s DDL right from the development stage. Hope this helps! Thanks for stopping by db2talk.com