Requirements analysis

Recently I read a very interesting book on requirements analysis “Mastering the Requirements Process: Getting Requirements Right” by Suzanne Robertson and James Robertson. They wrote a practical book that lead to a template for such a document
They advocate to spend enough time on requirements analysis, which can be described shortly as “getting to know what the user wants”. The idea is that a project should be based upon the users’ requirements. If one understands these requirements, one knows the scope of the project. Moreover, the priority within the project can be set better as it is based on user’ sensitivities.
Formulating the requirements is fundamentally different from writing a design. A design provides a solution and is formulated in technical terms. Requirements provides a list of user wants and it is formulated in business terms. Hence a discussion on databases doesn’t belong in a requirements document as it is part of a solution. The same holds for a user interface; this is part of a solution; it is not a requirement.

The requirements fall into three groups:

  • Functional requirements that indicate what functionalities (data, business logic) is delivered by the system. It indicates what the system must do.
  • Non-functional requirements that indicate how the data/ business logic is delivered. One may include security requirements here, performance requirements. This indicates a property of the system.
  • Constraints that indicate to which rules a project must comply. An example is the toolset that might be used to implement the solution.

The requirements must comply to a certain format:

  • They must be numbered as to make them traceable.
  • They must be testable. Hence they must be formulated in a quantified way. Examples of requirements that are formulated wrong are: “The application must generate an exciting user experience. ” Any experience may be labelled as exciting. Hence it is not testable. Better is: the application can be understood by a user within 30 seconds, or the report must be generated in 30 seconds.

The requirements should fall within the scope of the project. This scope is identified with a context diagramme. See my other post. A requirement falls within the circle that in shown in the context diagramme and which describes the project. This context diagramme also provides a nice starting point for further requirements analysis. The context diagramme displays dataflows to and fro the system. These dataflows are indicated by arrows. For each of these arrows, logic must be created in the systems that handles these input- and output flows. This logic is derived from business requirements.

Context diagrammes – understanding a scope

I came accross a very nice technique that can be used to show the scope of a project.
It is called the Context Diagramme.
The idea is that the system is shown as one item. The internal structure within the system is not displayed. Instead the diagramme shows the interaction from the system with external interfaces. The end result is an overview where the external interfaces are displayed clearly with the type of interaction that they have with the system. The nice thing is that a graphical overview is given of the project that can be used for discussion with the end-user. In that discussion, the boundaries of the system will become clearer to all. Moreover, the external interfaces will be made explicit. As the end-user will be involved in discussions on these external interfaces, it is good to be able to inform him upfront on these interfaces and why they are needed.

If the external interfaces are listed explicitly with their interaction with the system, the scope of the system can be made clear. The system should be able to deal with all stimuli and data that stem from the external interfaces. In other words: the system should be able to consume the datastreams, or generate the dataflows that are indicated by the diagramme. Such streams are shown explicitly in the diagramme. The scope should then include all handling of such streams. In principle, the scope is indicated by the circle in the middle of the diagramme. The external interfaces are outside the scope. Whether the arrows between the external interfaces and the system are within scope is subject of negotiation between the project and others that are responsible for the interfaces.

TERADATA_FALLBACK

Such a context diagramme uses three types of symbols:

  • A circle that contains the system. The system is understood as one entity where the internal working is not displayed.
  • A block that displays the external party. Think of customers as an external party.
  • An arrow that shows the interaction between an external interfaces and the system.
  • The external interfaces can be manifold. Let me give a few examples:

  • Users that actively use the system. An example is a user who issues an order to the system. Likewise, a user who required a certain report is an example of an external party.
  • Users who depend on the system. An example is an accountant who receives a certain plan that is created by the system. The user doesn’t explicitly require such a plan; he only receives such a plan.
  • An independent body who influences the system. One may think of a body that issues standards that need to be followed. Such standards may influence the system that is described in the diagramme.
  • A system that sends data to the system. The system may have a role as data-provider. In that case the external system doesn’t benefit from the system that is described; it only must provide data.
  • A system that cooperates with the system in question. An example might be an internet provides who provides the infra structute that is used by the system.
  • But one may also think of abstract types of interfaces, such as:

  • A time dependent event, such as the end of a month when sales overviews must be provided.
  • An event, such as the introduction of a new product.
  • The relations are mostly expressed as datastreams. As an example, one may think of a request for a report. The data that are conveyed to the system are: a reportindicator, a period for which the report is generated, a possible filter condition etc. The idea is that these diagrammes are mostly used for software engineering and communication between actors can always described with data.
    This diagramme will be used in the discussion with the end-user. Hence the language that is used must be business terminology. Hence, system names should be avoided as such usage may lead to unnecessary discussions. It is better to use business names for the interfaces involved as the picture can be understood more easily.

    Context_FALLBACK

    I took the example of a data warehouse project. This project uses Balance Sheet data, Profit & Loss data, Inventory data and Cost data that are uploaded on a monthly base to a data warehouse. From that data warehouse, data are sent to external systems and the data that are used in a reporting environment.

    This context diagramme is often seen as part of an hierarchy. In a related diagramme, the circle that indicates the system is detailed. In that related diagramme, the internal working within the system is shown. When such a hierarchy is used, the context diagramme is labelled as level-0 Data Flow Diagramme. The diagrammes that show the internal workings of the system are labelled as level-1 or level-2 Data Flow Diagrammes. In the example above, I labelled the items to allow later decomposition. As an example, in a level-1 Data Flow Diagramme, one might have P1.1, P1.2, P1.3 that are part of the P1 process.
    More or less the same holds for time-depencies. The context diagramme cannot be used to indicate time dependencies. For that purpose, other techniques can be used, such as sequence diagrammes. If one would like to show the time-depencies, the context diagrammes could be expanded into sequence diagrammes.

    Yet another checklist

    I often come across checklists. Such checklists are used to assess the quality of, say, a database. See here for an example of such a checklist. In this case, we have a checklist that is created to assess the quality of a Teradata database.
    Such checklist are very interesting. They are generally based on a very good understanding of the subject area. Moreover, such knowledge is written in a very concise form which provides the reader with a quick insight as to whether the list might benefit the reader.

    Teradata: what columns to choose as Primary Index?

    A major question with Teradata tables is what columns to choose when a primary index must be created.
    I understand that 4 different arguments might be used:
    1: Is the column often accessed? If it is often accessed, best usage is made of the distibution of records over the different amps.
    2: Is the column used in joins: in that case a similar argument holds. If the column is used in a join condition, usage is made of the distribution of records over the different amps.

     
    TERADATA_FALLBACK

     
    3: Is the column often updated. If a culumn is updated, the record might be re-allocated to another amp. Hence, the less updates, the better it is to have it included in a primary index.
    4: Is the cardinality high. In that case, it is possible to distribute the records over many amps, enabling Teradata to make optimal usage of the primary index capability.

    Teradata and fall back

    I understand Teradata has a so-called fall back option. The idea is that data are stored twice; each record being stored to two different amps.

    I saw a nice picture that descibes the situation. Each record (Row 1, Row 2 etc) are linked to an amp. It is shown in below scheme as yellow boxes. A seperate copy is made of all records. The copy is shown in an orange box. The records are all linked to another amp as the original record.

    TERADATA_FALLBACK

    Having a fall back, certainly has advantages. If an amp fails, a query may use copies of the original records.

    However, there is a flipside to this. It requires:

    • more table space as copies must be stored
    • more IO time as the copies must be written to disc
    • more CPU time as he copies must be created by the system

    Hence, this option is only used for critical tables where the additional costs are justified. Moreover, this option is advised to be used at table level only. This allows to make a fine distinction between tables for whih such additional costs are warrented and other tables where additional costs are not justified.

     

    Dynamic SQL

    I must confess I never heard of Dynamic SQL. But suddenly around me everyone started talking on Dynamic SQL. What is dynamic SQL?
    Dynamic SQL is a SQL statement that is created at runtime. Such is the definition.
    I realised that I had already created quite some dynamic SQL without realising that such construction is Dynamic SQL. As an example, I have created a data warehouse that is fed via SQL statements. A number of these statements are created at runtime only. As an example, one may think of a construction like:

    set $targettable = XX
    
    insert into $targettable
      select * from source
    

    In the example above, the SQL that is fired will be “insert into XX select * from source”. However if $targettable is set at YY, the generated SQL will be “insert into YY select * from source”.

    Hence dynamic SQL allows you to write one skeleton SQL statement that can be used in different circumstances. This allows you to write smaller scripts that are easier to maintain.

    Another situation where dynamic SQL can be used, is the handling of user input.

    Suppose one has input fields where a user may insert some data. Such data are then stored in a variable that is subsequently inserted into the database. To accomplish this, the user input is captured in a variable that can be subsequently inserted into the database. Something like:

    set $input = "data from input form"
    
    insert into targettable
      inputfield ($input)
    

    Around me, some pushback on dynamic SQL, could be heard. The main problem is that dynamic SQL might be difficult to maintain as one may not know what SQL is exactly generated. I am inclined to see that as sloppy programming. A good programmer can always capture the generated SQL and show this to the user. This could be handy when the script must be debugged.
    Another issue was that the statements that generate the SQL, can be quite difficult. This can certainly the case. I have seen scripts that were needless complicated to generate a SQL statement. However, good programming techniques should prevent this.

    MOLAP and ROLAP

    I currently work in an organisation that has a debate on whether to use MOLAP or ROLAP. But first of all: what is discussed here?
    ROLAP and MOLAP are two different techniques to store data that are meant for OLAP analysis. In ROLAP the data are stored in tables in a relational database and each data search in OLAP is translated into a SQL query that is fired upon the relational database. In MOLAP, the data are stored in a proprietary structure that is fully optimised to return data from OLAP investigations.
    Both system have their advantages. The big advantage of ROLAP is that one takes full advantage of an existing DBMS. The data are stored in a DBMS. Any SQL query may then use the computing power of a DBMS without the need of installing yet another programme. It is also true that in a modern data warehouse environment with an existing data warehouse, one may directly use the tables that reside in a data warehouse. Hence a ROLAP doesn’t need to copy the data in a proprietary structure to allow retrieving OLAP outcomes.
    The big advantage of MOLAP is that a separate proprietary data structure is created that geared towards fast returning OLAP outcomes from an OLAP question. To do so, the data structure needs to anticipate on every possible OLAP question. This implies that with an increasing number of different outcomes, the proprietary data structure increases in size.
    This shows that one may have a case to advocate MOLAP if:
    - the number of different outcomes from OLAP is not too big (hence the proprietary data structure is not too big in size)

    In a ROLAP structure, each OLAP question is translated into a SQL query. If the same OLAP is raised multiple times, multiple SQL queries will be fired. As these SQL queries are identical, it might be that the DBMS undertakes the same actions multiple times (assuming results are not cached). In a MOLAP structure, the possible OLAP outcomes are pregenerated in a proprietary data structure. Hence, complex calculations are not only doable, they return quickly, once they are stored in the MOLAP structure, as the are pregenerated. An update can be done after a fresh data load in the data warehouse. Once the data structure is ready, each OLAP question can be answered from that data structure. If the same OLAP question is raised, the same data are retrieved from that proprietary datastructure.
    This shows that one has a case to advocate MOLAP if:
    - the queries in the database are complicated; examples of such complex queries are

    • queries with a complex CASE statement,
    • CASE in the group by,
    • queries that generate running totals,
    • queries that process large chunks of data
    • queries with complex joins
    • etc.

    - the same OLAP question is repeated multiple times
    - the data are refreshed at large time intervals

    One uses an OLAP structure to undertake an analysis that summarises data and, if required, allocates data into lower levels of detail. I realise that an OLAP structure is also used to create a framework to retrieve a series of reports. Within these reports, no summarisations or allocations are done. Hence, one may say that only a part of the OLAP functionality is used. That also means that the proprietary data structure that is created in MOLAP is only partly used if no summarisations are undertaken. In that case a MOLAP might not be the best solution.
    This shows that one has a case to advocate MOLAP if:
    - one undertakes many OLAP movements, like summarisation, consolidation and detailing.

    The case for ROLAP can be made if we have:

    • ad hoc reports or infrequently used reports,
    • near real time reports,
    • high cardinal dimension tables
    • a request for non-summary reports
    • reports that are only require a simple SQL that returns results quickly

     

    Duplicate records in teradata tables

    Teradata offers the user the choice whether of not a check is made on duplicate records. Let’s first look at some code that allows duplicate records to be inserted.
    The code below has two elements that enables duplicate records:

    1. it contains a primary index that is not unique.
    2. the table is a multiset table.
    CREATE MULTISET  TABLE SAN_D_FAAPOC_01.EMPLOYEE
    (
    EMP_NUM INTEGER NOT NULL,
    EMP_NAME CHAR(30) NOT NULL,
    DEPT_NO INTEGER NOT NULL
    )
     PRIMARY  INDEX(EMP_NUM);
     
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123456,'VINAY',101);
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);
    
    select * from SAN_D_FAAPOC_01.EMPLOYEE;
    

    Two identical records can be inserted.
    The table is a so-called multiset table. This type table allows duplicate records. Note that this is not a Teradata standard. The standard is a table that doesn’t have duplicate records. The standard would be: “CREATE SET TABLE …”. Such a table doesn’t allow duplicate records.

    CREATE SET  TABLE SAN_D_FAAPOC_01.EMPLOYEE
    (
    EMP_NUM INTEGER NOT NULL,
    EMP_NAME CHAR(30) NOT NULL,
    DEPT_NO INTEGER NOT NULL
    )
     PRIMARY  INDEX(EMP_NUM);
     
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123456,'VINAY',101);
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',102);
    
    select * from SAN_D_FAAPOC_01.EMPLOYEE;
    

    This set of inserts generates an error as an attempt is made to insert two identical records. However it takes time to do such a check. Hence in a data warehouse environment, such a check with its accompanying overhead costs, might be too expensive. This explains why a multiset is preferred above the standard Teradata standard.
    Finally, in most cases the primary index is set to being unique. As the attribute that is used in the primary key in unique, the whole record is always unique as well. In that case an additional check whether the record is unique is not performed. Hence this definition will lead to unique records as the primary index is unique. Again, this uniqueness is enforced without an explicit test on the records.

    CREATE MULTISET  TABLE SAN_D_FAAPOC_01.EMPLOYEE
    (
    EMP_NUM INTEGER NOT NULL,
    EMP_NAME CHAR(30) NOT NULL,
    DEPT_NO INTEGER NOT NULL
    --,CONSTRAINT FOREIGN_EMP_DEPT FOREIGN KEY ( DEPT_NO)  REFERENCES  WITH NO CHECK OPTION DEPARTMENT(DEPT_NO)   
    )
    UNIQUE  PRIMARY  INDEX(EMP_NUM);
    

    Hence, we have two rules that may enforce unicity of records:

    • using the standard set option that prevent duplicate inserting duplicate rows
    • using a unique primary index conditions that prevents usage of duplicate values for attributes that form the primary index.

    It is easier to check the unicity of a primary index as a primary index is only defined on a part of the record. Hence the combination multiset / non unique primary index is most easy to check. In that case, nothing needs to be checked on duplicates. On the other hand, set/ non unique is most difficult to check as the whole record must be checked.

    The Teradata answer on materialised views

    Teradata has a feature that is designed to increase the performance of queries. This feature is called the “join index”. Such a join index is a structure that stores the outcomes from a query. These outcomes are stored permanently and they wait for the moment when they are called.
    The syntax of such a join index is straightforward:

    CREATE JOIN INDEX SAN_D_FAAPOC_01.EMP_DEPT
    AS
    SELECT
    DEPT.DEPT_NO,
    DEPT.DEPT_NAME,
    EMP.EMP_NUM,
    EMP.EMP_NAME
    FROM
    SAN_D_FAAPOC_01.EMPLOYEE EMP
    INNER JOIN   
    SAN_D_FAAPOC_01.DEPARTMENT DEPT
    ON
    DEPT.DEPT_NO = EMP.DEPT_NO
    PRIMARY INDEX(EMP_NUM); 
    

    The syntax contains the definition of a query. These outcomes are permanently stored in the database as an object that is called here as “EMP_DEPT”.

    If a record gets added to one of the base tables that are used in the join index, the join index gets updated to reflect the new situation. This is the downside of the join index: each update in base table is automatically followed by an update of the join index. This involves extra overhead processing time.
    Hence the creation of a join index must be weighted against the costs of a continuous update of the join index as result from changes in the base tables.

    It can be decided to create a permanent table that contains the query outcomes. Such a table can be updated at fixed time intervals (in stead of an automatic update as with the join index).

    Two final remark.
    One. One can not address a join index directly. A statement like “select * from join_index” will only return an error code. One only uses such a join index indirectly. It is used if a query is written that looks similar to the join index. In that case, the optimiser decides that join index reduces the time to return the outcomes.
    Two. One might know that a join index is used when the explain plan is read. This explain plan is show if the query is preceded by the keyword “explain”. In that case a well-written text shows how the optimiser works. One may then see whether the join index is used or not.

    Soft RI in Teradata

    Teradata has the concept of “Soft RI”. In this concept, a foreign key is created but its restriction is not enforced.
    What happens in that situation?

    Let’s look at the normal situation of referential integrity. Suppose, we have two tables. One table is referred to by a second table. If a foreign key is created, we have a limitation on which records we may insert into the table under normal referential integrity. We cannot insert records that have a foreign key for which no relational record exist in the referring table. That is the normal foreign integrity constraint.
    Example: we have an “employee” table that has a foreign key “dept_no” that refers to a record in a “department” table. Suppose the department table has records for department 101, 102 and 103. If a foreign key is created, one cannot add records in the employee table that refer to department 104, as this doesn’t exist in the department table.

    To enforce this referential integrity, each insert must be followed by a check as to whether the inserted record complies to the referring table. This check costs time. Hence an insert incurs additional processing time.

    In a data warehouse environment, this overhead may be prohibitive. Moreover it might not be necessary as we retrieve the records from a source that has already enforced the referential integrity.

    In that situation, we may apply “Soft RI”. In that case a foreign key relationship is created but its referential integrity is not enforced during loading.
    In that case, we avoid the costs of the check of referential integrity. This leads to more records that can be loaded in a given time frame.

    DROP TABLE SAN_D_FAAPOC_01.EMPLOYEE;
    DROP TABLE SAN_D_FAAPOC_01.DEPARTMENT;
    
    
    CREATE SET TABLE SAN_D_FAAPOC_01.DEPARTMENT ,NO FALLBACK ,
         NO BEFORE JOURNAL,
         NO AFTER JOURNAL,
         CHECKSUM = DEFAULT,
         DEFAULT MERGEBLOCKRATIO
         (
          DEPT_NO INTEGER NOT NULL,
          DEPT_NAME VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
          DEPT_LOC VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC)
    UNIQUE PRIMARY INDEX ( DEPT_NO );
    
    
    
    INSERT INTO SAN_D_FAAPOC_01.DEPARTMENT VALUES (101,'SALES','MUMBAI');
    INSERT INTO SAN_D_FAAPOC_01.DEPARTMENT VALUES (102,'ACCOUNTS','MUMBAI');
    INSERT INTO SAN_D_FAAPOC_01.DEPARTMENT VALUES (103,'HUMAN RESOURCES','MUMBAI');
    
    
    CREATE  TABLE SAN_D_FAAPOC_01.EMPLOYEE
    (
    EMP_NUM INTEGER NOT NULL,
    EMP_NAME CHAR(30) NOT NULL,
    DEPT_NO INTEGER NOT NULL
    ,CONSTRAINT FOREIGN_EMP_DEPT FOREIGN KEY ( DEPT_NO)  REFERENCES  WITH NO CHECK OPTION DEPARTMENT(DEPT_NO)   
    )
    UNIQUE PRIMARY INDEX(EMP_NUM);
    
    
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123456,'VINAY',101);
    INSERT INTO SAN_D_FAAPOC_01.EMPLOYEE VALUES (123457,'SACHIN',104);
    
    SEL
    DEPT.DEPT_NO,
    EMP.EMP_NUM,
    EMP.EMP_NAME
    FROM
    SAN_D_FAAPOC_01.EMPLOYEE EMP
    INNER JOIN   ----> HERE INNER JOIN DOES SOFT REFERENTIAL INTEGRITY & PICKS ONLY MATCHING COLUMNS
    SAN_D_FAAPOC_01.DEPARTMENT DEPT
    ON
    DEPT.DEPT_NO = EMP.DEPT_NO;