Why you should specify DISTRIBUTION KEY for tables in a single database partition in a DB2 DPF database?!?

In this blog post, I will make a case for why you should specify a DISTRIBUTION KEY clause for a table that is about to be created on a single partition in a DPF database. This doesn’t make sense.. Does it? For me, this idea did not make any sense at all.. However, when I started exploring more on this topic, it made some sense and when I did more research, I realized that there is a big benefit in taking this approach. Hence the blog post 🙂 Please read on …

The whole idea of picking a distribution key for a table is to influence physical distribution of rows across multiple partitions. So, why would we want to specify the DISTRIBUTION KEY clause even when creating a table on a single partition?

There are multiple advantages of doing this.. However, here is a scenario that is common in most shops:

Suppose there is a developer who created this table in DEV. Data volume is less in DEV and so the developer chooses to simply create the table in a single database partition. When it is time for you (the DBA) to promote the table to QA/Production, you realize that the table has to be spread across multiple database partitions because, in a nut shell, this makes a better physical design compared to creating the table on a single database partition.

Assuming DBA’s work is being done over a weekend, the developer is not available. Now, you are not sure which column(s) to pick as distribution (hash) keys. Usually, I hash on columns that make up the Primary Key but what if we wanted to co-locate two tables? Picking hash keys in co-location case is not going to be straight forward as you need to know workload across multiple tables.

How about eliminating the guess work and planning for this right from the development stage?

The added advantage of taking this approach is that you are NOT going to let DB2 to pick the default distribution key (column) – which may or may not be optimum for performance. Non-optimal distribution keys result in data skew across db partitions. If you follow the approach suggested in this blog post, you would be planning for data growth and fixing problems caused by data skew before even they arise.

Here is an example to illustrate this:

Part 1) Let us create a test table db2inst1.test_hash on a single db partition with DISTRIBUTE BY HASH clause. For distribution key, I will pick the column ‘col2’ as it is a primary key (high cardinality).

db2 "create table db2inst1.test_hash (col1 int, col2 int) 
DISTRIBUTE BY HASH (col2)"
DB20000I  The SQL command completed successfully.

Generating the DDL for this table, we get:


#db2look -d $DEVDB -e -z db2inst1 -t test_hash

CREATE TABLE "DB2INST1"."TEST_HASH"  (
                  "COL1" INTEGER ,
                  "COL2" INTEGER )
                 DISTRIBUTE BY HASH("COL2")
                   IN "TBSP_NO_HASH" ;

DB2 has a list of things that it checks before it picks the default tablespace. I will skip those details in this blog post. Apparently, DB2 picked a default tablespace ‘TBSP_NO_HASH’ for our table.

From the below SQL, it is obvious that the tablespace ‘TBSP_NO_HASH’ is a single-partitioned tablespace (i.e., it is only on a single db partition)


#db2 "select char(tbsp_name, 20) tablespace, 
char(DBPGNAME, 20) nodegroup, 
DBPARTITIONNUM 
from sysibmadm.tbsp_utilization 
where tbsp_name = 'TBSP_NO_HASH' with ur"

TABLESPACE           NODEGROUP        DBPARTITIONNUM 
-------------------- ---------------  -------------- 
TBSP_NO_HASH               SDPG                 0
  1 record(s) selected.

Part 2) We will promote this table to QA. Our goal is to create this table on multiple database partitions. We will do this in 2 steps:

a) We will first create the table in QA exactly as it is on DEV and then
b) Use the ADMIN_MOVE_TABLE procedure in QA to move table into a new tablespace that spans across multiple database partitions.

Step a) is straight forward. Extract DDL from db2look and then execute the DDL against the QA database.
For Step b), we will initially create a tablespace in a partition-group (here it is shown as: PDPG) that spans across multiple partitions and later use stored procedure to do the job for us.


db2 "create tablespace test_hash_tbsp in pdpg_dv"
DB20000I  The SQL command completed successfully.

db2 "call sysproc.admin_move_table ('DB2INST1',
'TBSP_NO_HASH',
'TBSP_HASH',
'TBSP_HASH',
'TBSP_HASH','','','','','','MOVE')"

  Result set 1
  ---------------------
  KEY                              VALUE
  ------------------------------------------------------------
  STATUS                           COMPLETE
  AUTHID                           DB2INST1
  VERSION                          09.07.0007
  INIT_START                       2015-04-19-23.03.58.089715
  INDEXSCHEMA
  INDEXNAME
  CLEANUP_END                      2015-04-19-23.03.59.912148
  INIT_END                         2015-04-19-23.03.59.069356
  COPY_START                       2015-04-19-23.03.59.121358
  COPY_OPTS                        ARRAY_INSERT,NON_CLUSTER
  COPY_TOTAL_ROWS                  0
  COPY_END                         2015-04-19-23.03.59.152560
  REPLAY_START                     2015-04-19-23.03.59.152967
  REPLAY_TOTAL_ROWS                0
  REPLAY_TOTAL_TIME                0
  REPLAY_END                       2015-04-19-23.03.59.580873
  SWAP_START                       2015-04-19-23.03.59.594976
  SWAP_END                         2015-04-19-23.03.59.766892
  SWAP_RETRIES                     0
  CLEANUP_START                    2015-04-19-23.03.59.772106

  20 record(s) selected.

That is it.. We do not have to modify the DDL (in DEV) either to add distribution keys or edit the tablespace name in QA. How easy is that?! No chance for typos and the process could easily be scripted and is fail-safe.

Now, let us confirm that the table is actually in the desired tablespace (TBSP_HASH) in QA. We can use db2look to do this for us:

#db2look -d $QADB -e -z DB2INST1 -t TEST_HASH -e
CREATE TABLE "DB2INST1"."TEST_HASH2"  (
                  "COL1" INTEGER ,
                  "COL2" INTEGER )
                 DISTRIBUTE BY HASH("COL2")
                   IN "TBSP_HASH" ;

We could also confirm that the tablespace ‘TBSP_HASH’ is on multiple database partitions as expected.

#db2 "select char(tbsp_name, 20) tablespace,
char(DBPGNAME, 20) nodegroup,
DBPARTITIONNUM
from sysibmadm.tbsp_utilization
where tbsp_name = 'TBSP_HASH' with ur"

TABLESPACE    NODEGROUP       DBPARTITIONNUM
----------- --------------- --------------
TBSP_HASH       PDPG                1
TBSP_HASH       PDPG                2
TBSP_HASH       PDPG                3
TBSP_HASH       PDPG                4
  4 record(s) selected.

No ‘DISTRIBUTE BY’ clause in DEV:

What if the developer did not include the DISTRIBUTE BY HASH clause in the DDL for table in DEV? Remember that this clause is not mandatory when creating a table.

In this case, the DDL would be like below:

CREATE TABLE "DB2INST1"."TEST_HASH"  (
                  "COL1" INTEGER ,
                  "COL2" INTEGER )
                   IN "TBSP_NO_HASH" ;

Do you have to drop and re-create this table to add a distribution key? No. Not really. Fortunately, DB2’s alter statement allows you to add a distribution key.


db2 "select count(*) from db2inst1.no_hash"
1
-----------
       5509
  1 record(s) selected.

alter table db2inst1.no_hash add distribute by hash (col1)
DB20000I  The SQL command completed successfully.

## Let us see if we could read from the table after this alter
db2 "select count(*) from db2inst1.no_hash"
1
-----------
       5509
  1 record(s) selected.

Now, let us see if we could write into this table after we hash it. Just checking if the table has been put in an in-operative mode.


db2 "insert into db2inst1.no_hash (select colcount, colcount from syscat.tables)"
DB20000I  The SQL command completed successfully.

db2 "select count(*) from db2inst1.no_hash"
1
-----------
      11018
  1 record(s) selected.

Hope I made a case for why you should consider adding the DISTRIBUTE BY clause to your table’s DDL right from the development stage. Hope this helps! Thanks for stopping by db2talk.com

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s