Dynamic SQL

I must confess I never heard of Dynamic SQL. But suddenly around me everyone started talking on Dynamic SQL. What is dynamic SQL?
Dynamic SQL is a SQL statement that is created at runtime. Such is the definition.
I realised that I had already created quite some dynamic SQL without realising that such construction is Dynamic SQL. As an example, I have created a data warehouse that is fed via SQL statements. A number of these statements are created at runtime only. As an example, one may think of a construction like:

set $targettable = XX

insert into $targettable
  select * from source

In the example above, the SQL that is fired will be “insert into XX select * from source”. However if $targettable is set at YY, the generated SQL will be “insert into YY select * from source”.

Hence dynamic SQL allows you to write one skeleton SQL statement that can be used in different circumstances. This allows you to write smaller scripts that are easier to maintain.

Another situation where dynamic SQL can be used, is the handling of user input.

Suppose one has input fields where a user may insert some data. Such data are then stored in a variable that is subsequently inserted into the database. To accomplish this, the user input is captured in a variable that can be subsequently inserted into the database. Something like:

set $input = "data from input form"

insert into targettable
  inputfield ($input)

Around me, some pushback on dynamic SQL, could be heard. The main problem is that dynamic SQL might be difficult to maintain as one may not know what SQL is exactly generated. I am inclined to see that as sloppy programming. A good programmer can always capture the generated SQL and show this to the user. This could be handy when the script must be debugged.
Another issue was that the statements that generate the SQL, can be quite difficult. This can certainly the case. I have seen scripts that were needless complicated to generate a SQL statement. However, good programming techniques should prevent this.

MOLAP and ROLAP

I currently work in an organisation that has a debate on whether to use MOLAP or ROLAP. But first of all: what is discussed here?
ROLAP and MOLAP are two different techniques to store data that are meant for OLAP analysis. In ROLAP the data are stored in tables in a relational database and each data search in OLAP is translated into a SQL query that is fired upon the relational database. In MOLAP, the data are stored in a proprietary structure that is fully optimised to return data from OLAP investigations.
Both system have their advantages. The big advantage of ROLAP is that one takes full advantage of an existing DBMS. The data are stored in a DBMS. Any SQL query may then use the computing power of a DBMS without the need of installing yet another programme. It is also true that in a modern data warehouse environment with an existing data warehouse, one may directly use the tables that reside in a data warehouse. Hence a ROLAP doesn’t need to copy the data in a proprietary structure to allow retrieving OLAP outcomes.
The big advantage of MOLAP is that a separate proprietary data structure is created that geared towards fast returning OLAP outcomes from an OLAP question. To do so, the data structure needs to anticipate on every possible OLAP question. This implies that with an increasing number of different outcomes, the proprietary data structure increases in size.
This shows that one may have a case to advocate MOLAP if:
- the number of different outcomes from OLAP is not too big (hence the proprietary data structure is not too big in size)

In a ROLAP structure, each OLAP question is translated into a SQL query. If the same OLAP is raised multiple times, multiple SQL queries will be fired. As these SQL queries are identical, it might be that the DBMS undertakes the same actions multiple times (assuming results are not cached). In a MOLAP structure, the possible OLAP outcomes are pregenerated in a proprietary data structure. Hence, complex calculations are not only doable, as the are pregenerated they return quickly, once they are stored in the MOLAP structure. An update can be done after a fresh data load in the data warehouse. Once the data structure is ready, each OLAP question can be answered from that data structure. If the same OLAP question is raised, the same data are retrieved from that proprietary datastructure.
This shows that one has a case to advocate MOLAP if:
- the queries in the database are complicated; examples of such complex queries are

  • queries with a complex CASE statement,
  • CASE in the group by,
  • queries that generate running totals,
  • queries that process large chunks of data
  • queries with complex joins
  • etc.

- the same OLAP question is repeated multiple times
- the data are refreshed at large time intervals

One uses an OLAP structure to undertake an analysis that summarises data and, if required, allocates data into lower levels of detail. I realise that an OLAP structure is also used to create a framework to retrieve a series of reports. Within these reports, no summarisations or allocations are done. Hence, one may say that only a part of the OLAP functionality is used. That also means that the proprietary data structure that is created in MOLAP is only partly used if no summarisations are undertaken. In that case a MOLAP might not be the best solution.
This shows that one has a case to advocate MOLAP if:
- one undertakes many OLAP movements, like summarisation, consolidation and detailing

Duplicate records in teradata tables

Teradata offers the user the choice whether of not a check is made on duplicate records. Let’s first look at some code that allows duplicate records to be inserted.
The code below has two elements that enables duplicate records:

  1. it contains a primary index that is not unique.
  2. the table is a multiset table.
CREATE MULTISET  TABLE SAN_D_FAAPOC_01.EMPLOYEE
(
EMP_NUM INTEGER NOT NULL,
EMP_NAME CHAR(30) NOT NULL,
DEPT_NO INTEGER NOT NULL
)
 PRIMARY  INDEX(EMP_NUM);
 
INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123456,'VINAY',101);
INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);
INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);

select * from SAN_D_FAAPOC_01.EMPLOYEE;

Two identical records can be inserted.
The table is a so-called multiset table. This type table allows duplicate records. Note that this is not a Teradata standard. The standard is a table that doesn’t have duplicate records. The standard would be: “CREATE SET TABLE …”. Such a table doesn’t allow duplicate records.

CREATE SET  TABLE SAN_D_FAAPOC_01.EMPLOYEE
(
EMP_NUM INTEGER NOT NULL,
EMP_NAME CHAR(30) NOT NULL,
DEPT_NO INTEGER NOT NULL
)
 PRIMARY  INDEX(EMP_NUM);
 
INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123456,'VINAY',101);
INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);
INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);

select * from SAN_D_FAAPOC_01.EMPLOYEE;

This set of inserts generates an error as an attempt is made to insert two identical records. However it takes time to do such a check. Hence in a data warehouse environment, such a check with its accompanying overhead costs, might be too expensive. This explains why a multiset is preferred above the standard Teradata standard.
Finally, in most cases the primary index is set to being unique. As the attribute that is used in the primary key in unique, the while record is always unique as well. In that case an additional check whether the record is unique is not performed. Hence this definition will lead to unique records as the primary index is unique. Again, this uniqueness is enforced without an explicit test on the records.

CREATE MULTISET  TABLE SAN_D_FAAPOC_01.EMPLOYEE
(
EMP_NUM INTEGER NOT NULL,
EMP_NAME CHAR(30) NOT NULL,
DEPT_NO INTEGER NOT NULL
--,CONSTRAINT FOREIGN_EMP_DEPT FOREIGN KEY ( DEPT_NO)  REFERENCES  WITH NO CHECK OPTION DEPARTMENT(DEPT_NO)   
)
UNIQUE  PRIMARY  INDEX(EMP_NUM);

The Teradata answer on materialised views

Teradata has a feature that is designed to increase the performance of queries. This feature is called the “join index”. Such a join index is a structure that stores the outcomes from a query. These outcomes are stored permanently and they wait for the moment when they are called.
The syntax of such a join index is straightforward:

CREATE JOIN INDEX SAN_D_FAAPOC_01.EMP_DEPT
AS
SELECT
DEPT.DEPT_NO,
DEPT.DEPT_NAME,
EMP.EMP_NUM,
EMP.EMP_NAME
FROM
SAN_D_FAAPOC_01.EMPLOYEE EMP
INNER JOIN   
SAN_D_FAAPOC_01.DEPARTMENT DEPT
ON
DEPT.DEPT_NO = EMP.DEPT_NO
PRIMARY INDEX(EMP_NUM); 

The syntax contains the definition of a query. These outcomes are permanently stored in the database as an object that is called here as “EMP_DEPT”.

If a record gets added to one of the base tables that are used in the join index, the join index gets updated to reflect the new situation. This is the downside of the join index: each update in base table is automatically followed by an update of the join index. This involves extra overhead processing time.
Hence the creation of a join index must be weighted against the costs of a continuous update of the join index as result from changes in the base tables.

It can be decided to create a permanent table that contains the query outcomes. Such a table can be updated at fixed time intervals (in stead of an automatic update as with the join index).

Two final remark.
One. One can not address a join index directly. A statement like “select * from join_index” will only return an error code. One only uses such a join index indirectly. It is used if a query is written that looks similar to the join index. In that case, the optimiser decides that join index reduces the time to return the outcomes.
Two. One might know that a join index is used when the explain plan is read. This explain plan is show if the query is preceded by the keyword “explain”. In that case a well-written text shows how the optimiser works. One may then see whether the join index is used or not.

Soft RI in Teradata

Teradata has the concept of “Soft RI”. In this concept, a foreign key is created but its restriction is not enforced.
What happens in that situation?

Let’s look at the normal situation of referential integrity. Suppose, we have two tables. One table is referred to by a second table. If a foreign key is created, we have a limitation on which records we may insert into the table under normal referential integrity. We cannot insert records that have a foreign key for which no relational record exist in the referring table. That is the normal foreign integrity constraint.
Example: we have an “employee” table that has a foreign key “dept_no” that refers to a record in a “department” table. Suppose the department table has records for department 101, 102 and 103. If a foreign key is created, one cannot add records in the employee table that refer to department 104, as this doesn’t exist in the department table.

To enforce this referential integrity, each insert must be followed by a check as to whether the inserted record complies to the referring table. This check costs time. Hence an insert incurs additional processing time.

In a data warehouse environment, this overhead may be prohibitive. Moreover it might not be necessary as we retrieve the records from a source that has already enforced the referential integrity.

In that situation, we may apply “Soft RI”. In that case a foreign key relationship is created but its referential integrity is not enforced during loading.
In that case, we avoid the costs of the check of referential integrity. This leads to more records that can be loaded in a given time frame.

DROP TABLE SAN_D_FAAPOC_01.EMPLOYEE;
DROP TABLE SAN_D_FAAPOC_01.DEPARTMENT;


CREATE SET TABLE SAN_D_FAAPOC_01.DEPARTMENT ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      DEPT_NO INTEGER NOT NULL,
      DEPT_NAME VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
      DEPT_LOC VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC)
UNIQUE PRIMARY INDEX ( DEPT_NO );



INSERT INTO SAN_D_FAAPOC_01.DEPARTMENT VALUES (101,'SALES','MUMBAI');
INSERT INTO SAN_D_FAAPOC_01.DEPARTMENT VALUES (102,'ACCOUNTS','MUMBAI');
INSERT INTO SAN_D_FAAPOC_01.DEPARTMENT VALUES (103,'HUMAN RESOURCES','MUMBAI');


CREATE  TABLE SAN_D_FAAPOC_01.EMPLOYEE
(
EMP_NUM INTEGER NOT NULL,
EMP_NAME CHAR(30) NOT NULL,
DEPT_NO INTEGER NOT NULL
,CONSTRAINT FOREIGN_EMP_DEPT FOREIGN KEY ( DEPT_NO)  REFERENCES  WITH NO CHECK OPTION DEPARTMENT(DEPT_NO)   
)
UNIQUE PRIMARY INDEX(EMP_NUM);


INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123456,'VINAY',101);
INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',104);

SEL
DEPT.DEPT_NO,
EMP.EMP_NUM,
EMP.EMP_NAME
FROM
SAN_D_FAAPOC_01.EMPLOYEE EMP
INNER JOIN   ----> HERE INNER JOIN DOES SOFT REFERENTIAL INTEGRITY & PICKS ONLY MATCHING COLUMNS
SAN_D_FAAPOC_01.DEPARTMENT DEPT
ON
DEPT.DEPT_NO = EMP.DEPT_NO;

Different codepages in SQL Server

It is possible in SQL Server to create columns in a table that have different codepage. This can be shown in the table below. That table has two attributes: char_unicode with a unicode codepage and char_latin that has a Latin codepage with West European characters only.

codepage

The char_latin cannot contain characters like ł, İ or the Φ. Such characters are not part of the Latin codepage as this codepage only contans characters that are in use in Western European languages. To store such characters, we might use the unicode codepage. Let us show how such characters are represented.

Chars

The column char_unicode shows the characters İ, Φ , И and the ł. In the columns char_latin, these characters are inserted in the column but (obviously) couldn’t be represented. It is interesting to see what happens: the İ is replaced by an I; a Φ is replaced by the F, the И is replaced by an unknown and ł is replaced by l.
This allows an easy logical translation from, say İstanbul into Istanbul. This is a very interesting property as we could easy store Polish, Turkish in local characters and store them in a logical understandable way if a Latin codepage is used. Nice.

Codepages

Recently, I encountered the issue of codepages again. The problem was that the name of a Turkish city (İstanbul) was not represented correctly. Again a codepage issue. The character İ was not part of the application.

How did these codepages come around?

In the early days of Windows 95, we had ASCII. This had 127 different characters that were represented with 7 bit. For some network reason an additional bit was necessary which led to the situation that 8 bits (a full byte) was necessary to represent an ASCII character. The set of ASCII characters contains the normal a,b, c etc, both in lower case and upper case. This is added with numerics (0,1,2,3, ..) and signs like the question mark, comma, dot etc. We miss in the ASCII the language specific characters like È, ç characters.
To cover for the Western European languages an additional series of 128 characters was created. The original ASCII plus the West European special characters ( È, ç) could be represented in 8 bits. Hence one byte was sufficient to display all characters in the Western European languages. This list is indicated by extended ASCII, Codepage 1252 or ISO -8859-1 or Latin1. Roughly these codepages allow us to represent West European texts like French, English, Spanish, German and Dutch.

However if one would like to write İstanbul, one encounters the limitations of these codepages. Hence other codepages are created to represent Baltic, Turkish or Czech. In most cases the alphabets that are derived from Latin contain the original ASCII list plus a series of characters (İ, Ď ) that are specific to these languages. The idea is that one byte represents one character.
Likewise, other languages like Russian, Hebrew and Arabic have created a list of characters whereby one byte is used to represent a character.

However, one might run into problems if an attribute should contain both Turkish and French names. What codepage to use? To solve this a new codepage is created: the unicode. Here, roughly two bytes are used which allows to store a nearly infinite set of characters. Standard characters are represented with one byte; more exotic characters use 2, 3 or 4 characters. One might even store Japanese, Korean in unicode.

Unicode certainly has the future. As we speak, one sees many migration documents where a ASCII codepage is migrated to unicode. I am also informed that more and more dbms recommend unicode as their default codepage.

Running scripts in teradata

Let us assume, we have a data warehouse in teradata. Let us suppose that this data warehouse is loaded with a set of scripts. In principle it could be done. The question then arises how the scripts are run. An example could clarify this. The example looks like:

.logon oTDD001.s2.ms.****.com/TOM.VAN-MAANEN, pau26688
.export report file=C:\Users\TOM.VAN-MAANEN\phi.txt
.set retlimit 20
SELECT	'"'||trim(Ident)||'";"'|| trim(Serial)||'";"'|| trim(Node)||'"' FROM	SAN_D_FAAPOC_01.TestUnicode;
.export reset;
.quit

The first line of this script creates a connection to a Teradata machine. The second line allocates a file that will be used to write data to. The fourth line is standard example where records from a table are exported. The last lines finishes the export and finishes the export from a table.
This scripts is called by a piped command where the script is sent to the bteq:

C:\Users\TOM.VAN-MAANEN>bteq < bteq.txt
BTEQ 13.10.00.01 Thu May 22 11:28:08 2014

+---------+---------+---------+---------+---------+---------+---------+----
.logon oTDD001.s2.ms.unilever.com/TOM.VAN-MAANEN,

 *** Logon successfully completed.
 *** Teradata Database Release is 14.00.06.05
 *** Teradata Database Version is 14.00.06.05
 *** Transaction Semantics are BTET.
 *** Session Character Set Name is 'ASCII'.

 *** Total elapsed time was 1 second.

+---------+---------+---------+---------+---------+---------+---------+----
.export report file=C:\Users\TOM.VAN-MAANEN\phi.txt
 *** To reset export, type .EXPORT RESET
+---------+---------+---------+---------+---------+---------+---------+----
.set retlimit 20
+---------+---------+---------+---------+---------+---------+---------+----
insert into SAN_D_FAAPOC_01.TestUnicode(ident,node,serial)
        values('Test├╣─ÖSummary','Test├╣─ÖSummary',' 19');

 *** Insert completed. One row added.
 *** Total elapsed time was 1 second.


+---------+---------+---------+---------+---------+---------+---------+----
SELECT '"'||trim(Ident)||'";"'|| trim(Serial)||'";"'|| trim(Node)||'"'
FROM SAN_D_FAAPOC_01.TestUnicode;

 *** Query completed. 16 rows found. One column returned.
 *** Total elapsed time was 1 second.


+---------+---------+---------+---------+---------+---------+---------+----
.export reset;
 *** Output returned to console.
+---------+---------+---------+---------+---------+---------+---------+----
.quit
 *** You are now logged off from the DBC.
 *** Exiting BTEQ...
 *** RC (return code) = 0

C:\Users\TOM.VAN-MAANEN>

PIN code

A few days ago, I got the suggestion to protect a transaction with a PIN number. But how does that mechanism work? It looks as if this mechanism is used to describe several situations. One situation (the simplest) looks a bit like a password. On the client side, a user enters a PIN code. This PIN code is translated into another value that is sent to the server. For the moment, one may think of a multiplication of the PIN code by the daynumber. This product is sent to the server. The server undertakes a similar process. It multiplies a stored PIN code with the daynumber. If this result equals the number that is received, the entered PIN code is assumed to be entered by the authorised user. After such verification the user may proceed to the required transaction.
Whenever the network traffic is monitored, the monitor only reveals the calculated code. Suppose the PINcode is multiplied by the daynumber, someone sees a number like 337518. If someone monitors the network with the idea to hack the protected site, such information is worthless. The hacker doesn’t know the algoritm (multiplicating the pincode by the daynumber), he can’t know what PIN code to file out. He might try to use the number 337518 as an attempted PINcode but his attempt will be rejected. From the traffic, he doesn’t get the PINcode that he needs to use.

Strange characters

In some cases, you get unexpected weird results being returned from your database like: Testù�Summary. This may be expected as one inserted TestùęSummary. Apparently symbols like ę were not recognised and were subsequently translated into �.
A likely reason is that the so-called codepage is wrong. Characters like ę are not included in the common characterset and an extended characterset (like unicode) must be used.
Fortunately, most DBMS support the unicode. As an example, we take an example from Teradata. Look as this code:

CREATE SET  TABLE SAN_D_FAAPOC_01.TestUnicode ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
Ident VARCHAR(255) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
Serial INTEGER,
Node VARCHAR(64) CHARACTER SET UNICODE NOT CASESPECIFIC)
PRIMARY INDEX (Serial);

insert into SAN_D_FAAPOC_01.TestUnicode(ident,node,serial)
values('TestùęSummary','TestùęSummary',1235);

SELECT	Ident, Serial, Node
FROM	SAN_D_FAAPOC_01.TestUnicode;

The results are:

	Ident	Serial	Node
1	Testù�Summary	1,235	TestùęSummary

The nice thing about Teradata is that columns can be defined as unicode-columns. Hence nothing extra needs to be done to store such unicode characters.
A similar situation exists with MySQL. Also in that DBMS, we may store data in columns that are defined as being unicode. As an example, one may use this code snippet:

CREATE TABLE t1
(
    col0 CHAR(10),
    col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci
);

insert into t1 values('TestùęSumm','TestùęSumm');

Also here, we have an illustration of the purpose on Unicode. It is an extension of the standard ASCII characterset to include all characters from all living languages. I understand that even Gothic and Music characters are included in unicode. A subset of unicode is the ASCII set. On top of that characters are included that are not within the ASCII dataset.
I understand we have different version of unicode. One such version is UTF-8. This version uses one byte to store the common latin characters such as ‘A’, ‘B’,’1′ etc. For the more exotic characters, more byte are used. An example is recently introduced “€” that takes 3 bytes. Other characters use 4 bytes.
On average a western text is stored quite efficiently in UTF-8. As most characters only use 1 byte, we end up with a file size (in terms of bytes) that equals the number of characters.
Another implementation is UTF-16 that uses 2 bytes per character. In that case, the file size, in terms of bytes, is double that of the number of characters. A western text, written in UTF-16 is then double as big as it would have been in UTF – 8.
As an example, I include two texts, one written in ASCII and one in UTF-8:
utf