A nice utility to investigate files on Unix

Today, I worked with the Unix’ awk utility. This is an extremely potent utility to investigate text files on a Unix platform. It can be invoked from the terminal command line. The command must start with awk.

The keyword awk is followed by a script that is positioned between quotes. After the quotes, the textfile is mentioned (say ww-ii-data.txt).

When some items need to initialised, we have the begin clause. The beginclause is positioned between brackets {}.
After that a selection can be made on lines with a selection between slashes. The actions on the line are then also positioned between brackets. Finaly after the END, an end-clause may be included. We then have:

awk ‘BEGIN {} /selection/ {} END {}’ file.

As an example:

awk ' 
BEGIN {count=0;max=0}
/33206/ {
   temp = substr($0,37,3) + 0;
   count++;
   if (max< temp)
      max=temp
        }
END {print "regels:  ", count," max in Celcius", (5/9)*(max-32);}
'   ww-ii-data.txt

I noticed that variables can be used. No declaration is needed. Nice.

Hadoop

Everyone talks about big data and Hadoop. Someone even compared it to teenage sex: everone talks about it, everyone knows someone who does it but no-one yet does it. I just tried hadoop to see what it is all about.
I made two attempts to install hadoop.
One attempt was about installing Hadoop 1.0.3. I relied on a paper from Michael Noll ( http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ ). I noticed it was really important to use the correct versions of the jdk; I first tried to install Hadoop 1.0.3 with a recent version of the Jdk but that failed. A subsequent attempt was successfull. This could be established with help of the interface ( localhost:50070 ).
A second attempt was about installing Hadoop 2.4. I used a blog from Matthew Sharpe ( http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/ ) to get the necessary information. After the installion, I created a small example to check if this worked. It did work.

Hence, I ended up with two working examples of Hadoop.

Yahoo-hadoop-cluster_OSCON_2007

The problem is that hadoop is now alive and kicking. I verified this by executing a small example. This worked out ok. But how to continue. Working with hadoop isn’t trivial. My next step will be is finding a means to work with Hadoop.

Slowly Changing Dimensions Type 2

Just to get myself acquainted with the new Informatica version, I created a mapping in which SCD 2 was inplemented. The mapping is shown below.
Informatica.
In the first step, the input data are read.Let us assume that these records are read. The records contain a number and a name:

number Name
1      Tom
2      ine
3      paula
4      stella
5      Bart

The records may be different in two respects to the records that are tored in the database; the records may be completely new or the records might be an update to the existing records. Let us assume the database contains subsequent records:

PK number name     DATE                DATE_TO
1  2      'ine'    18-11-2014 16:23:10 NULL
2  3      'paula'  18-11-2014 16:23:10 NULL
3  4      'stella' 18-11-2014 16:23:10 NULL
5  5      'Bartje' 18-11-2014 16:23:10 NULL

The records contain a unique key, the number, a name, a data on which the records are read and a date until which the data are valid.

Comparison between the records that are added and records that already stored show that we have one record that is new (1, Tom) and one record that must be updated (5, Bart). Such comparison is done in the Informatica flow by a lookup, that is followed by a procedure where it is established as to whether a record is unchanged (ine, paula, stella), a record is new (Tom) and a record that is changed (Bart).

PK	NEW	OLD	CHANGE	NEW	DATE
1.0	ine	ine	0	0	2014-11-21 15:11:09.0
2.0	paula	paula	0	0	2014-11-21 15:11:09.0
3.0	stella	stella	0	0	2014-11-21 15:11:09.0
	Tom		0	1	2014-11-21 15:11:09.0
5.0	Bart	Bartje	1	1	2014-11-21 15:11:09.0

A record that is new, must be inserted in the database with a new key. It is established that such number must be higher than 5. Hence “Tom” will be inserted in the database.

A record that is updated leads to two actions: the existing record will be updated as to indicate that it is not valid anymore and a new record will be inserted to store “Bart”.

In total,

  • we have an update action: the old “Bartje” record will be updated as to indicate that it is not valid anymore.
  • we have an insert action that comes from:
    • the update on Bart that generates a new record
    • the insert of Tom.

 
We then have this result:

PK      number    name           DATE                    DATE_TO
1	2	 'ine'	         '18-11-2014 16:23:10.'	 NULL	
2	3	 'paula'	 '18-11-2014 16:23:10.'	 NULL	
3	4	 'stella'	 '18-11-2014 16:23:10.'	 NULL	
5	5	 'Bartje'	 '18-11-2014 16:23:10.'	 '21-11-2014 15:52:49.'	
9	1	 'Tom'	         '21-11-2014 15:52:49.'	 NULL	
10	5	 'Bart'	         '21-11-2014 15:52:49.'	 NULL	

Use Case, Business Events and Time Events

In a previous post, I showed the context diagramme. I then continued by saying that each of the arrows that flow to and fro the bubble in the middle can be translated as use cases.

But one may take a slightly different view: each of these arrows are either business events or time events. A business event is then understood as an action from outside the system that invokes a flow of information (=an arrow to/fro the bubble in the context diagramme). Likewise a time event is understood as a time event that invokes an information stream to/fro the bubble.

Hence in a business event or a time event, one explicitly mentions the reason, why the event starts.

Suppose, the time vent is (e.g. a monthly moment ) whereby data are sent to the reporting enevironment.

One has:

  1. Story, begin condition. The time indicates that data must be sent to the reporting environment
    1. step1: Collect data in data warehouse.
    2. step2: System sends data to Reporting Environment.
    3. step3: Reporting System stores the data in its own environment.
    4. end condition. Reporting System contains the data for reporting environment.

One sees: it a slightly different approach from a use case. The reason, why the event starts is better indicated.

Why is it important to stress the reason for starting the business event/ time event. Because it shows the boundaries of the system. One is forced to think of what starts a certain flow of information. Is it time that starts a flow of information from the system to the reporting environment. Or is it the availability of data that can be sent to the reporting environment? Or is it a signal of the end user to have the data? Or.. etc..

When this done, one gets a:

  • clear list of the actors (a user who requires a report, a system that delivers data etc)
  • a clear list of actions that must be covered by the work process (store data, provide reports etc).

 

Access odbc

Just a little note.
I discovered that one may access the 32 bit ODBC drivers on a 64 bit platform via: C:\Windows\SysWOW64\odbcad32.exe and the 64 bit ODBC drivers via: C:\Windows\System32\odbcad32.exe

This may be handy if one must access a 32 bit application (like Access 2007) via ODBC in a 64 bit environment. The standard ODBC application doesn’t allow you to directly access the 32 bit ODBC connections on a 64 bit platform. It is bit awkward and one could circumvent this issue via a direct call to the 32 ODBC driver environment.

One should always use the appropriate version of the ODBC Administrator tool. If you build and then run an application as a 32-bit application on a 64-bit operating system, you must create the ODBC data source by using the ODBC Administrator tool in %windir%\SysWOW64\odbcad32.exe. The 64 bit ODBC drivers can be accessed directly using C:\Windows\System32\odbcad32.exe. To indicate the type of DSN, you can add “_32″ to the 32-bit user DSNs and “_64″ to the 64-bit user DSNs.

The 64-bit ODBC Administrator tool can also be invoked from Control Panel to manage user DSNs and system DSNs that are used by 64-bit processes. On a 64-bit operating system, the 32-bit ODBC Administrator tool is used for Windows on Windows 64 (WOW64) processes. You must directly invoke the 32-bit ODBC Administrator tool from the SysWoW64 folder. You can use the 32-bit ODBC Administrator tool to manage user DSNs and system DSNs that are used by WOW64 processes.

On a 32 bit system, the 64-bit drivers cannot be installed (of course). The 32 bit ODBC driver are the only drivers that can be found; they can be accessed via the Control Panel.

So, what do we have?
In a scheme:

64-bit Windows 32-bit Windows
64-bit Administrator C:\Windows\odbcad32.exe
or
Start -> Control Panels -> Data Sources (ODBC)
N.A.
32-bit Administrator C:\Windows\SysWoW64\odbcad32.exe C:\Windows\odbcad32.exe
or
Start -> Control Panels -> Data Sources (ODBC)

Use Case

A Use Case is something that can be written after a context diagramme is written. A context diagramme provides an overview of the whole system with its interfaces to the outside world. The use cases are written on base of each of the individual interactions between the system an one of its external systems.
TERADATA_FALLBACK.
As an example, one may take the interaction between the system and the reporting environment for which an interface exists. Such interaction is developed into a use case. A use case is either a diagramme or a written overview how the interaction works.
The use case has next elements:

  • It has a title that summarises the purpose of the use case.
  • It indicates the main actor that is connected to the interface.
  • It has a story that contains a set of testable statements on how the interaction works. Such a story always starts with the begin condition (what needs to be ready in order to allow the interaction to start), one or several steps and finally an end condition that shows how the system looks after the interaction.

The main story is a main success story – sometimes called the sunny day story. It assumes that the interactions reaches its goal. One may also feel the need to describe a situation where the action is cancelled. Such an additional situation may be described in a so-called extension.

Let me try to provide an example.

  • Title: provide data for reporting purposes.
  • Main actor: Reporting System.
  • Story, begin condition. The data must be loaded in the data warehouse.
  • Story: step1: Collect data in data warehouse.
  • Story: step2: System sends data to Reporting Environment.
  • Story: step3: Reporting System stores the data in its own environment.
  • Story, end condition. Reporting System contains the data for reporting environment.

 

The most important limitation of the use case is that the use case does not yet contain a product description. It is focussed on understanding the user interpretation of the interaction between the future system and its external environment.
The product description is a next step. The product can be developped once the use cases are known and understood. In that case, the optimal solution can be given that provides an answer to the demands that are raised from the use cases.

Requirements analysis

Recently I read a very interesting book on requirements analysis “Mastering the Requirements Process: Getting Requirements Right” by Suzanne Robertson and James Robertson. They wrote a practical book that lead to a template for such a document
They advocate to spend enough time on requirements analysis, which can be described shortly as “getting to know what the user wants”. The idea is that a project should be based upon the users’ requirements. If one understands these requirements, one knows the scope of the project. Moreover, the priority within the project can be set better as it is based on user’ sensitivities.
Formulating the requirements is fundamentally different from writing a design. A design provides a solution and is formulated in technical terms. Requirements provides a list of user wants and it is formulated in business terms. Hence a discussion on databases doesn’t belong in a requirements document as it is part of a solution. The same holds for a user interface; this is part of a solution; it is not a requirement.

The requirements fall into three groups:

  • Functional requirements that indicate what functionalities (data, business logic) is delivered by the system. It indicates what the system must do.
  • Non-functional requirements that indicate how the data/ business logic is delivered. One may include security requirements here, performance requirements. This indicates a property of the system.
  • Constraints that indicate to which rules a project must comply. An example is the toolset that might be used to implement the solution.

The requirements must comply to a certain format:

  • They must be numbered as to make them traceable.
  • They must be testable. Hence they must be formulated in a quantified way. Examples of requirements that are formulated wrong are: “The application must generate an exciting user experience. ” Any experience may be labelled as exciting. Hence it is not testable. Better is: the application can be understood by a user within 30 seconds, or the report must be generated in 30 seconds.

The requirements should fall within the scope of the project. This scope is identified with a context diagramme. See my other post. A requirement falls within the circle that in shown in the context diagramme and which describes the project. This context diagramme also provides a nice starting point for further requirements analysis. The context diagramme displays dataflows to and fro the system. These dataflows are indicated by arrows. For each of these arrows, logic must be created in the systems that handles these input- and output flows. This logic is derived from business requirements.

Context diagrammes – understanding a scope

I came accross a very nice technique that can be used to show the scope of a project.
It is called the Context Diagramme.
The idea is that the system is shown as one item. The internal structure within the system is not displayed. Instead the diagramme shows the interaction from the system with external interfaces. The end result is an overview where the external interfaces are displayed clearly with the type of interaction that they have with the system. The nice thing is that a graphical overview is given of the project that can be used for discussion with the end-user. In that discussion, the boundaries of the system will become clearer to all. Moreover, the external interfaces will be made explicit. As the end-user will be involved in discussions on these external interfaces, it is good to be able to inform him upfront on these interfaces and why they are needed.

If the external interfaces are listed explicitly with their interaction with the system, the scope of the system can be made clear. The system should be able to deal with all stimuli and data that stem from the external interfaces. In other words: the system should be able to consume the datastreams, or generate the dataflows that are indicated by the diagramme. Such streams are shown explicitly in the diagramme. The scope should then include all handling of such streams. In principle, the scope is indicated by the circle in the middle of the diagramme. The external interfaces are outside the scope. Whether the arrows between the external interfaces and the system are within scope is subject of negotiation between the project and others that are responsible for the interfaces.

TERADATA_FALLBACK

Such a context diagramme uses three types of symbols:

  • A circle that contains the system. The system is understood as one entity where the internal working is not displayed.
  • A block that displays the external party. Think of customers as an external party.
  • An arrow that shows the interaction between an external interfaces and the system.

 

The external interfaces can be manifold. Let me give a few examples:

  • Users that actively use the system. An example is a user who issues an order to the system. Likewise, a user who required a certain report is an example of an external party.
  • Users who depend on the system. An example is an accountant who receives a certain plan that is created by the system. The user doesn’t explicitly require such a plan; he only receives such a plan.
  • An independent body who influences the system. One may think of a body that issues standards that need to be followed. Such standards may influence the system that is described in the diagramme.
  • A system that sends data to the system. The system may have a role as data-provider. In that case the external system doesn’t benefit from the system that is described; it only must provide data.
  • A system that cooperates with the system in question. An example might be an internet provides who provides the infra structute that is used by the system.

But one may also think of abstract types of interfaces, such as:

  • A time dependent event, such as the end of a month when sales overviews must be provided.
  • An event, such as the introduction of a new product.

The relations are mostly expressed as datastreams. As an example, one may think of a request for a report. The data that are conveyed to the system are: a reportindicator, a period for which the report is generated, a possible filter condition etc. The idea is that these diagrammes are mostly used for software engineering and communication between actors can always described with data.
This diagramme will be used in the discussion with the end-user. Hence the language that is used must be business terminology. Hence, system names should be avoided as such usage may lead to unnecessary discussions. It is better to use business names for the interfaces involved as the picture can be understood more easily.

Context_FALLBACK

I took the example of a data warehouse project. This project uses Balance Sheet data, Profit & Loss data, Inventory data and Cost data that are uploaded on a monthly base to a data warehouse. From that data warehouse, data are sent to external systems and the data that are used in a reporting environment.

The workpackage is above diagramme consists of collecting and storing data. The reporting environment is outside the workpackage. Hence the interface reporting (O2) is displayed outside the circle W1.

If reporting is part of the workpackage, the diagramme must be changed into:
Context_FALLBACK

In this diagramme, the work package W1 consists of reporting as well. The external interface is then directly with the end users who send paramters and receive reports.

This context diagramme is often seen as part of an hierarchy. In a related diagramme, the circle that indicates the system is detailed. In that related diagramme, the internal working within the system is shown. When such a hierarchy is used, the context diagramme is labelled as level-0 Data Flow Diagramme. The diagrammes that show the internal workings of the system are labelled as level-1 or level-2 Data Flow Diagrammes. In the example above, I labelled the items to allow later decomposition. As an example, in a level-1 Data Flow Diagramme, one might have P1.1, P1.2, P1.3 that are part of the P1 process.
More or less the same holds for time-depencies. The context diagramme cannot be used to indicate time dependencies. For that purpose, other techniques can be used, such as sequence diagrammes. If one would like to show the time-depencies, the context diagrammes could be expanded into sequence diagrammes.

Yet another checklist

I often come across checklists. Such checklists are used to assess the quality of, say, a database. See here for an example of such a checklist. In this case, we have a checklist that is created to assess the quality of a Teradata database.
Such checklist are very interesting. They are generally based on a very good understanding of the subject area. Moreover, such knowledge is written in a very concise form which provides the reader with a quick insight as to whether the list might benefit the reader.

Teradata: what columns to choose as Primary Index?

A major question with Teradata tables is what columns to choose when a primary index must be created.
I understand that 4 different arguments might be used:
1: Is the column often accessed? If it is often accessed, best usage is made of the distibution of records over the different amps.
2: Is the column used in joins: in that case a similar argument holds. If the column is used in a join condition, usage is made of the distribution of records over the different amps.

 
TERADATA_FALLBACK

 
3: Is the column often updated. If a culumn is updated, the record might be re-allocated to another amp. Hence, the less updates, the better it is to have it included in a primary index.
4: Is the cardinality high. In that case, it is possible to distribute the records over many amps, enabling Teradata to make optimal usage of the primary index capability.