Security certificates

I got interested in security certificates. What are these? How does it work? It is possible to use it on your own website? Those were my questions.
The principle is somewhat confusing, but is works. The idea is that a security certificate exists of two elements. One element is a so-called private key. This private key rests with the webserver and it is used to decrypt a message that is sent to the server. Likewise, it encrypts the message that is sent by the server. The private key is also used to generate a public key. This public key is sent to any browser. The browser is then able to do the opposite; it is able to encrypt a message sent to the server and decrypt the message received. The result is that encrypted traffic can be exchanged between browser and webserver.
In a sequence, one gets:
[1]: the browsers requests a secure page via https://…
[2]: the webserver sends the public key to the browser with the signed certificate
[3]: the browser checks the certificate. If the certificate cannot be correctly identified, the browser issues a warning. If one creates the certificates yourself, you get such a warning.
[4]: the browser uses the public key to encrypt the url that is asked
[5]: the webserver uses the private key to decrypt the url.
[6]: the webserver encrypts the page that is sent to the browser
[7]: the browser decrypts the page that is recieved.

One may buy a set of certificates, but is more fun to create a set of certificates oneself.
To do so, one needs an opensource software programme “openssl”. It is available for free. But I must admit, I had to search for it. It was not really easy to find. Maybe, it help in you know the exact name; in my case it was openssl-0.9.8h-1-setup.exe.
The programme is meant to be used in Linux. There are windows versions, but then one should then set a parameter: set OPENSSL_CONF=C:\Program Files\GnuWin32\share\openssl.cnf, in my case. It depends of course on where the config file resides.
The creation of the the private file is then done with a series of commands:

openssl genrsa -des3 -out tom.key 2048 #generate private key
openssl req -new -key tom.key -out tom.csr #generates signing request
copy tom.key tom.key.orig
openssl rsa -in tom.key.orig -out tom.key #write rsa key, private signed key

Subsequently, the public key is generated with:

openssl x509 -req -days 365 -in tom.csr -signkey tom.key -out tom.crt #delivers public signed key

But, how do we apply these certificates to the webserver. Somehow, we must tell the webserver that such certificates exist. I now use nginx as webserver and this is how my config file looks like:

worker_processes  1;
events {
          worker_connections  1024;
       }
http {
    include       mime.types;
    server {
            listen       80;
            server_name 192.168.5.14;
            return https://$server_name$request_uri;
           }
    # HTTPS server
    #
    #
    server {
        listen       443 ssl;
        server_name  192.168.5.14;

        ssl_certificate      tom.crt;
        ssl_certificate_key  tom.key;


        location / {
            root   html;
            index  index.html index.htm;
            error_page 404 /404.html;
                   }
            location ~ .php$ {
                               root           html;
                               fastcgi_pass   127.0.0.1:9000;
                               fastcgi_index  index.php;
                               fastcgi_param  SCRIPT_FILENAME c:/nginx/html/$fastcgi_script_name;
                               include        fastcgi_params;
                             }
          }

     }

I would say that this config file looks rather straightforward. In the first part, any request is translated into an https request. In the second part, you see that the certificates are shown to the server.

Security and compliance

Security and compliance has many different aspects. In the list below, I try to provide an overview on these aspects. We have seen:

  • Authentication. Is the person really the person that he says he is.
  • Principal Propagation. If someone from system A issues a command that system B will start, system B will need to know if that someone has the privilege to do so.
  • Encryption. Is the file or the table encrypted. This creates another layer of security. If someone gets access to these file or table without proper privileges, he is still not able to read the data without the encryption key.
  • Compliance. Are legal requirements followed?
  • Authorization. This indicates the privileges someone has.
  • Single Sign On. Some sets of applications only need one log on procedure. Once logged on, the other applications rely on the first log on procedure.
  • User management. The maintenance of the list of persons and system accounts and their privileges.
  • Logging. A document or file that contains a list of the events that have taken place.
  • Archiving. Storage of obsolete files.
  • Auditing. The process of verification of a proper working of the business process.

A check list for the assessment of a data warehouse

 

Yet another checklist

Regularly, one comes across an assignment whereby one has to assess a data warehouse. (at least in my type of work). I found a nice checklist that allows a quick overview on a data warehouse. It is composed of several sections, each one covering a certain area. Going through this list, then provides the reader with a blue print on how the data warehouse is doing.

 

Requirements

The first area is “requirements”. It is composed of 4 items:

  1. Do we have a concise architecture that shows the main components of the data warehouse? It should show what the major data blocks are and how the data blocks are related to the solution that is required from the business. This architecture can be created on one A4 that may be used in discussions on the data warehouse.
  2. What are the major security aspects? Who is given access?
  3. What functional requirements do we have? This could show a list of reports that are required.
  4. What non-functional requirements do we have? This covers timing issues, latency etc. Questions are raised on the performance of loading, reporting, extracting etc. Here is may be stated that the data should cover a ten year period, with most recent data being the data of last month.

These items must be agreed upon by the business. They need to recognise the requirements,and they should have the idea that these lists offer a comprehensive overview.

 

The sources

Another area is a list of items on usage of the data warehouse as a source. These items are concentrated around the questions on whether the data warehouse is fit for purpose to supply the necessary data.

  1. (If the data warehouse is used as a source), do we have approval from the owner of the data warehouse? Data in the data warehouse can be used for a variety of purposes. One should have a formal approval for this usage: it should not interfere with other processes in the data warehouse.
  2. Is the purpose of the application that uses data from the data warehouse in the realm of reporting and/ or planning? A data warehouse is created for reporting purposes. The architecture reflects this. If applications use data from the data warehouse they should appreciate such architecture. The purpose of data usage should be in accordance with such architecture. As an example: a real time transaction system can not use the data from a data warehouse.
  3. Is the latency in the data warehouse compliant with the reporting requirement? It takes some time before data from a source system are reflected in the reports. The business requirements for the reports timeliness must be in accordance with these technical constraints.
  4. Does the reporting system need conformed data? A data warehouse is set up to provide a single version of the facts As an example one may think of a customer that is represented in a single version from the data warehouse, despite the fact that multiple versions can be stored in the source systems. The translation from multiple versions in the source systems to a single version in done in the data warehouse. This is an important purpose of this data warehouse. The reporting requirements should realise this mechanism. If their requirements do not need such a single version, then one may question as to whether the reports actually need to use the data warehouse as source. A different source might be more applicable. As an example: if the business request an overview  of purchases from one business process, one may question as to whether a data warehouse is the most appropriate source.
  5. The service levels of the reporting systems should be lower than the service levels of the data warehouse. The data within the data warehouse should only be used for purposes that are aligned with the stated service levels of the data warehouse. An example may clarify this. Suppose the data warehouse has a service level agreement that only allows for maintenance during office hours. One may not use the data warehouse as a source for an application that requires 7*24 support (such as an ATM application). Intuitively, it does make sense not to allow a data house to act as source for on-line banking. This item formally deals with this issue.
  6. Do we have an integrate usage of fact and dimension data from a data warehouse? Most likely, so-called fact data are stored with foreign keys that point to dimensions within the data warehouse. One needs to use both the facts and dimensions from the data warehouse. Most likely, it is almost impossible to use facts from the data warehouse in combination with dimensions from other sources.
  7. Does the reporting application require only dimensions that are generally used? In some cases, a reporting application requires dimensions that are only maintained in individual source systems. Most likely such local dimensions are not propagated into the data warehouse. It might even be that storage of such localised dimensions is not compliant with data warehouse architecture. Hence when these requirements exist, one my question as to whether the data warehouse is the most appropriate source.
  8. Do the technical means whereby data from the data warehouse are retrieved, comply with the general mechanisms employed? One would like to have a single mechanism whereby data are retrieved from the data warehouse. As an example: a general mechanism might be that data are pushed out as flat CSV files. If that general mechanism is employed, an individual reporting project should not deviate from it.
  9. Is the role of the data warehouse that of “System Of Record”? As an example, one may think of a data warehouse that has gotten the role of an overall general ledger as it acquires several accounting systems and it is used for financial reporting, It that case the data warehouse is used as a “System Of Record”. To elaborate this item, the definition of “System of Record”  is repeated here: “A system of record (SOR) or Source System of Record (SSoR) is .. the authoritative data source for a given data element or piece of information.

    ” Definition from Wikipedia. The most important word in this definition is authoritative. If the data get that role, we may have legal requirements that are related to this role.

 

Allocations

  1. An approval for allocation must exist. Such approval is necessary to avoid performance issues that may result from allocation processes.  The idea is that allocation is used to allocate records over lower levels of detail with other tables in the data warehouse. This may have effects on performance, capacity etc. An approval from the owner of the data warehouse is therefore necessary.
  2. The allocation process can not be overly complicated. This is also directed at avoiding performance issues. As the data warehouse is a shared resource, each process must be limited to avoid performance and capacity bottlenecks. This then also holds for allocation processes.
  3. The allocation process must be decoupled from other processes. The idea is that the processes in the data warehouses should not depend on other processes in order to be able to take responsibility for the outcomes of such processes. Hence the allocation process should not rely on tables that are outside the data warehouse.
  4. Is the allocation process a scheduled / repeatable process? The idea is that all processes in the data warehouse should be scheduled and repeatable. No room exists for regular manual interventions as such processes may lead to non-traceable data; this then also holds for the allocations within the data warehouse.
  5. As the data warehouse is created to support reporting processes. Hence, the allocated data can only be used for reporting purposes.
  6. It is possible to retrieve allocation factors from an external source. In that case the allocation factors is a normal source that should comply to normal source requirements.
  7. It is possible to use allocated data from the data warehouse as a source for, say, reporting purposes. In that case the same rules as using the data warehouse as a source apply.
  8. It is possible that the allocated data get the role as “System Of Record”. In that case legal requirement pertain that are related to “System Of Record”.

 

Data Acquisition

  1. Is the source identified? This refers to the situation that new data that are necessary for the data warehouse are known, identified and a source is known. The data source must also be investigated: a so-called profile must be created. This shows what the values were that were encountered.
  2. What is the impact of data capture on the source system? Is such impact in accordance with other service level agreements on this system?
  3. The decision to load new data must be considered against the possibility of re-use existing data in the data warehouse. It might well be that data that are required, are either already in the data warehouse or such data are already loaded in the staging area. Only if the data are not already in either of these environments, a new acquisition is necessary.
  4. Do we recognise special techniques that are needed to load dimension data? Dimension data are special data as most likely most dimension data exist both in the data warehouse and in the source system. The techniques that must be employed are directed at finding the differences between source and existing data in the data warehouse along with a subsequent update of such data in the data warehouse.
  5. Do we recognise special techniques that are needed to load fact data? Fact data stem from new events in the source system. Such new events need to be adequately captured and propagated into the data warehouse. It might be that “Change Data Capture” is employed here.

 

ETL

  1. Do we have a dedicated staging area that allows to store the files that come from the sources? The staging area is an exact copy of the files / data structures that are sent by the sources. This is subsequently employed to update the data warehouse.
  2. Is the “System Of Record”  used to retrieve the data? The idea is that data must be retrieved from an authoritative source in order to be used in a data warehouse. As an alternative data from an exact copy can be used, as long as the data do not get changed  between the authoritative source and the copy environment. Doing so, we are sure that we only get reliable data in the data warehouse.
  3. Are the data retrieved via the fastest / best known loaders? Also here, an example may clarify it. Oracle tables may be loaded via SQL insert statements or by a direct loader like the SQL Loader. The decision on whether to use one loader or the other should be a thought over; this item requests such a decision process: the designer should know the different options to load a table and should have a well-informed decision.
  4. Is a standard naming convention being used in the ETL process? As the mappings, jobs, workflows etc all become part of one comprehensive ETL structure, one recognisable naming convention must be adhered to.
  5. Are standard blocks used for standard operations? As an example: Informatica knows so-called “mapplets”  that can be re-used for standard operations.
  6. The specific mappings are project specific. As each project has its specific mappings, we should have a specific mapping to capture this.
  7. Are general standards adhered to? As most organisations have a set of standards on ETL, such standards must be applied in the ETL procedures.
  8. Is error handling properly covered?
  9. Is process control in place?
  10. As far possible, a single tool needs to be used in ETL.
  11. Whenever a choice must be made between processing in the ETL tool or processing in the DBMS, such choice must be carefully made.

 

Data Modelling

  1. Do we have a full mapping on every attribute from the data warehouse model back to the source? In most organisations, we have an Excel document where each row indicates how an attribute from the source is used to populate a data warehouse attribute.
  2. Do we have a document that explains the data model for the business? As the real data model might be too complicated, we should create a separate data model that leaves out unnecessary details. One might think of one A4 that provides a high level overview of the data model.
  3. If revisions have occurred on the data model for the data warehouse, are these revisions implemented in the high level overview as well? One may think of the situation that at some point in time, the high level overview and the data model are perfectly aligned. Each change in either of the two documents must later be aligned to the other document.
  4. If the organisation uses a standard data modeling tool, it must be used to capture the data model of the data warehouse. The usage of this tool must be in conformity with best practices.
  5. When the organisation uses other standard documents, such documents must have been used. This allows documentation that is recognisable for future maintenance.
  6. If the organisation has a central modeling agency, such an organisation must have been informed and engaged.
  7. If codes are used in the model, such usage must have been aligned with other code usage.
  8. If a logical data model is created alongside a physical data model, such data models must be aligned. In some tools, this alignment is supported. Both Sybase Powerdesigner and CA Erwin have a capability where changes in the logical data model are automatically propagated into the physical data model.
  9. If a project only implements part of the total model, such update must be aligned with the comprehensive data model. I could imagine that a project starts with a up-to-date copy of the data model. The project then continues by adding new entities/ tables with their attributes/ fields. If the project is finished, the new elements can be merged into the new model.
  10. The data definition language (DDL) that is used to implement the addition of new tables/fields is generated according to standards that are in use in the organisation. It might be that audit tables must be updated upon new the introduction of new tables; the generated DLL then needs to reflect this.
  11. A new table needs to be performant; some general rules may enhance performance. An example might be a check on having an index on primary and foreign keys in large tables. However other checks may be applicable such as having a primary index in Teradata tables.
  12. Is the standard security on new tables being implemented? I once worked in an organisation when each table had a view. Only the view was accessible by the end-user. This was done for security reasons. Hence is that organisation, a check was done on whether such a view was implemented for any new table.
  13. When a datamart is created, the security that is mandatory, must be adhered to. An example may clarify this. Assume that an organisation requires that tables in the data warehouse can only be accessed via views. In that case, the tables in a datamart can only loaded via a view. Such security needs to be adhered to.
  14. The access privileges need to be explicitly mentioned and these privs need to be compliant to general security policies in the organisation.
  15. Other structures such as triggers, stored procedures, database links etc. must be explicitly mentioned and they must be compliant to the organisation policy.
  16. Naming standards for new tables, attributes, views in the data warehouse are adhered to.
  17. Naming standards for the data mart are adhered to.
  18. If groupings like schemas are used, the grouping standards must be complied to. I might be that dimension tables are positioned in dedicated schemas. The new tables must adhere to this standards.
  19. Derived data can only be used if there is a good reason to have them included. This rule prevents addition of intermediate values that have no value for future usage. This rule is also directed at keeping the data warehouse clean and neat. Also, the derived data must be stored in appropriate locations.
  20. If tables are split, we should properly document such split. As an example: is a table is split according to regions, we should have such regions being reflected in table in the data warehouse.
  21. If Control Tables are used, such Control Tables must be created according to general architecture.
  22. All service accounts that are used to start the ETL flows are identified. No personal account may be used to kick off the ETL flows.

 

Data Mart

The set-up of the Data Mart is quite tool -dependent. The tool determines to a large extent the data model, as quite some data models are not supported.

Generally speaking, most tools require a star-like model: one fact table with some dimension tables. Other data structures are not supported: a recursive relation is not supported; some tools do not support non-hierarchical tables; some tools do not support snow-flakes; some tools are not able to deal with large dimension tables; other tools are not able to handle many dimension tables whereas other tools are not able dimension tables with too many fields.

Hence, many items in any check-list are related to limitations that are put on the data model in the Data Mart.

  1. Are recursive relations avoided?
  2. Are snowflakes avoided?
  3. Are unbalanced hierarchies avoided?
  4. Is the report related to one fact table only?
  5. Is the report related to dimension tables that do no transgress a certain threshold?
  6. Are the names of related attributes the same (as to avoid query cross-joins)?
  7. The number of dimensions should not be too large (recommendations mention 7 as an ideal number).

Other check list items are related to the choice where logic is implemented. We might have three separate locations where to logic:

  • when data are stored in the data mart
  • when data are defined in a cube
  • when the final report is created

The idea that each area has its advantages and we might use this:

  • when data are stored in the data mart, we might use the ETL tools that allows tp use power of the ETL tool or the DBMS power,
  • when data are defined in a cube, we might use the summary possibilities of the cube definition
  • when the final report is created, we might use the formatting possibilities

This implies that a well thought off choice must be made.

 

Tooling and infrastructure

  1. Is the set of tooling compliant with the general architecture? In many organisations, some tools are designated to be use for certain purposes: Informatica is meant for ETL; MicroStrategy for reporting etc. In general, an individual project may not deviate from such general architecture.
  2. Is the hardware compliant to the general architecture? In many organisations, a certain server is dedicated for web reporting; another is designated for file storage. In general, a project should use such designated components.
  3. In many organisations, a reference architecture is created. This reference architecture indicates which logical components exist and where such components are located. This is quite well understandable as a data warehouse is a shared resource that must be used by different projects. A new project then needs to know where his items must be stored and how such items must be integrated in the existing structure.
  4. If components are decommissioned, such components need to be enumerated. The decommissioning needs to be described and the alternatives need to be indicated.
  5. The design needs to contain a list of external interfaces. All relevant sources need to be indicated on an inventory.
  6. An assessment on future support needs to be given. The main question as to whether the solution can be maintained is addressed here.
  7. Likewise the costs of maintenance and monitoring  needs to be assessd.
  8. License costs need to be clear.

 

Security

  1.  Do we have a concise document that shows whether the data are sensitive? This requirement shows in one overview what type of data we have and what data should be accessible to whom.
  2. Do we have a person who has responsibility for security within the project?
  3. When data are retrieved from a 3rd party, is that information stream compliant to security demands as stated by the organisation? As an example, one might think of FTP usage; if FTP usage is prohibited by an organisation, an individual project should refrain from setting up retrieving data via FTP.
  4. The design document should explicitly refer to organisation guidelines on security. This reference should be as explicitly as possible. Such reference demonstrates that an individual project is aware of such guidelines and complies to these.
  5. If sensitive data are used, a design should indicate how the general rules from the organisation on security are implemented in a data warehouse project.
  6. If the general rules on security do not cover all security arrangements within a project, the extension should be described explicitly.
  7. Do we have a mapping document that describes who may access which data?
  8. Which non-users system accounts do we have? Such users are needed for ETL flows and these users should be described along with privileges.
  9. Does the organisation check the implementation of security designs?
  10. If exceptions to policy are necessary, are these agreed upon by a central security department within the organisation?
  11. Are all exceptions to the policy explicitly shown in the design?
  12. Batch applications only use system account; no user accounts are used for this purpose.

 

Usage of dimensions

  1. Are standard dimensions used whenever possible?
  2. If additional dimensions are needed that are not yet available as standard dimensions, will such additional dimensions be available for shared usage.
  3. Is a path foreseen whereby dimensions created within a project, can be promoted to standard dimensions?

 

Data quality

  1. Do we have a data quality environment (tooling/ hard disk space etc.)?
  2. Does the qualitative environment have access to the full data volume?
  3. Do the users have access to the data quality tools?
  4. Is it possible to create the data quality reports?
  5. Do we have a set of data quality business rules that must be checked? An example might be that a certain event must always refer to a certain dimension. This can be translated that a foreign key can not be null. A rule can be created, saying that no records may exist that have a NULL as value in the foreign key. Finally a report can be created that shows the records that transgress such rule. This rule then requires that such a list of business rules exist.
  6. How many times are the data quality rules checked?
  7. Is the format of reporting on data quality rules been agreed upon?
  8. Who will be contacted when data quality issues are discovered?
  9. Is the implementation of data quality rules part of a release (or is a part of an on-going data quality programme)?
  10. Are the data quality reports scheduled?
  11. Is every data quality rule related to source system? After all, the source of an error in the data stems from a certain source. Hence, upon discovery, such transgression must lead to a call to a source in order to improve data quality.
  12. Do we have KPIs that enable us to  monitor data quality? As an example, one might think of a KPI that calculates the average number of wrong records. Such a KPI allows us to monitor the development over time of data quality.

 

Unattended loading

  1. We need an automatic purge mechanism that moves data when such data pass the retention period. Such mechanism needs to be automated.
  2. We need an automated process that copies data from production environment to the development environment. It might well be that only a a subset is copied. This can be done to save space.
  3. We need quality tests to see if the loading process works correctly or not. To this end, we might need to check if we do not have duplicates in the target (due to double loading) or missing loading (due to records not picked up from the source). Such quality check must be automated.
  4. We need quality checks to see if totals from sources can be reconciled with totals from targets. Such tests are closely related to the previous item. The previous item calculated  numbers of records; this item addresses the content of records. This content needs to reconciled between different areas.
  5. Files that sent to the Staging Area and which are loaded into the data warehouse, are subsequently sent to the archive. This should also be an automated process.
  6. A purging mechanism needs to exist for the files that are captured in the archive. After a predefined period, such files can be purged from the archive. This retention period needs to be parameterised to allow for changes in this policy.
  7. Credentials should not be hard coded. Ideally such credentials need to be stored in centrally stored parameters. This allows an easy update of such credentials.
  8. If audit logs and/ or error logs are created, one needs to have a policy on the retention of such files. It might be that such policies already exist on programmatic level; if such policy already exists, each project needs to adhere to it.
  9. All jobs need to be scheduled; the loading process should be an automatic process that only needs manual intervention in case of errors.
  10. The jobs need to be created on a low granular level. Such jobs need to concentrated on modifying one table only. Separate jobs can be combined into a batch to combine actions that involve  updates on different tables.
  11. The tables that are loaded need to be checked as to whether their keys are still contributing to to good performance.
  12. The schedule needs to well documented to allow proper maintenance.

 

Standards

The final set of items refers to standard architectures in an organisations to which a project must adhere to. In most organisations, such architecture is translated into a set of documents/ checklists. We should have an explicit list where such documents are mentioned, along with a check as to whether the designs comply to this document and an accompanying list of deviations from the documents.

In this list, we should also have a proof that an authoritative person has signed off the requirement of compliance.

 

Skeleton Functional Design

Quite often, I have to write a functional design. Such a design can be a tedious job. Part of time is spent is spent on how to represent the findings. One would like to present a nice / concise report. The writing process itself is made easier if a template is available. To this end, I use this skeleton which gives me a head start in the writing process.

Data Warehouse and allocated cost data

In a previous post, I showed that costs data may be allocated to other departments or levels of granularity to enable cost analysis.

The allocated cost data are artificial data: the data are only a part of costs for which the data are collected at other levels of the organisation, These data are then intermediate data that are halfway between original data that are collected in the business process and the final output data that are used in the final profit and loss analysis.

In my opinion, such intermediate data should not be included in the data warehouse as they are artificially created data. If we would store such data in the data warehouse, such data could be misinterpreted as real input data. Someone from another department may use these data as being equivalent to real input data thus leading to unnecessary confusion. Anyone who does so, cannot be blamed. The data warehouse is a central data store that may be used by everyone for reporting purposes. Once included in the data warehouse, one may assume that the data represent a well analysed view upon real business events.

Therefore, to maintain a good data quality, the data warehouse is not a good environment to store such intermediate allocated data.

It is also true that the business rules to create the allocated costs can be quite complex. The owner of these business rules is the accounting department (and not the IT department). If the IT department is lured into trap to calculate such intermediate data, it is exposed at an immensely complicated situation as it must rely on specifications from the accounting department while at the same time it may be held responsible for the quality of such data. We then have a reason not to agree to calculate such allocations within the data warehouse: it is simply too complicated and the IT department does not have the knowledge to do so.

An IT department is good at three things: retrieving data, storing data and propagating data. It is not good at programming complicated business rules that may be not fully known by the owners. Hence, the IT department should not attempt to do so.

 

Cost Allocation

In most organisations, cost data are collected on a different level of granularity or attached to another dimension as we would like to see the cost data.

Two examples.

Cost data, such as advertisements or publicity may be collected at product group level whereas we would like to present the cost data on individual products. This can be understood as advertisements are made for, say, chocolate whereas we would like to see the cost data that are attached to an individual bar. This would allow to compare the sales price to the total cost price of a chocolate bar to see if it is sold at a profitable price or not.

Cost data may also be acquired with cost centres (such as the HR department) whereas we would like to know how much costs is made with the sales data that are collected with the profit centres. In that case we would like to infer if the profit centre sells their products at a price that covers all direct and indirect costs.

We are confronted with the question on how to spread costs to individual products. Or alike: how to divide the costs on, say, HR to sales departments.

I have read on three different methods how to do this. The first method is almost trivial: do not do it. Do not spread the costs, do not allocate the costs. Just look at total turnover, direct costs and total indirect costs and see if profits are made.

The second method is more adequate as then a factor is taken that is used to divide the costs. One may think of sales turnover. For each product and every profit department, one may calculate the percentage turnover in total turnover. The resulting set of factors can then be used in the division process. If a chocolate bar takes 2% of total chocolate sales, we may account 2% of advertisement costs on chocolate as being attributed to indirect advertisements costs for chocolate bars. Likewise: if department A sells 10% of total sales, we take 10% of HR costs as being attributable to department A.

The third method assumes that we have different allocation factors for different cost groups. Hence advertisements may be allocated on base of turn over, HR on base of the number of employees, maintenance on capex outlays etc. The idea is that such allocation is closely related to the actual driver for which costs are incurred. This would then be more acceptable for the managers of a profitcentres who would like to see a close relation between his activities and the internal costs for which he is held attributable. It is not a pleasant idea to receive a large internal costs invoice for warehouse activities if the department does not have activities for which a warehouse is necessary.

A good method for allocation can be immensely important for an organisation as allocation methods have behavioural effects within the organisation. A department may try to avoid attributed indirect costs to improve departmental profit figures. I once worked in an organisation where It costs were allocated according to the number of working network sockets. The idea is good as it may be assumed the IT costs have a relation to the number of network connections. However the negative consequence was that network access was denied to people who needed it for their work to avoid IT costs.

Hence the important point is that cost allocation is important but one should create such a method to avoid unwanted behaviour. This is certainly not a trivial task as it may attribute how activities are assessed. The consequence is that the allocation mechanisme might become quite complex as multiple goals are aimed for: a fair representation of the cost structure without unwanted behavioural consequences.

 

 

XML records in Oracle tables

It is possible to store XML records in an Oracle table. This is one (of the many) way to retrieve XML information from an Oracle environment. It is actually quite simple. An Oracle table may contain columns that allows us to store XML records. An example is:

  CREATE TABLE "OE"."TPERSON_DATA" 
   (	"PERSON_ID" NUMBER(3,0), 
	"PERSON_DATA" "XMLTYPE"
   )

We see that the attribute “Person_Data” in table tPERSON_DATA may contain XML records.
This table can be loaded with standard insert statements like:

insertXML

We have an XML record having “book” in the root. One branch is “title”, another is “price”. Likewise, we also see an author branch and a year branch. This can be queried with SQL statements that have an inline XMLQUERY expression.

The full statement is:

SELECT person_id, XMLQuery ('for $i in /book
                             where $i/price > 30
                             return $i/title'
                  passing by value person_data
                  RETURNING CONTENT) XMLData
FROM oe.tperson_data;

The SQL statement is Select person_id etc, where the XMLQUERY expression is:

for $i in /book
where $i/price > 30
return $i/title'
passing by value person_data
RETURNING CONTENT

This XMLQuery is composed of 3 sections. One section is the apostrophe constrained line that retrieves a part of the XML records. This is an XQuery expression. The second part reads like “passing by value person_data” where person_data is the name of the attribute in which the XML record is stored. The final third part is RETURNING CONTENT which are key words.
The apostrophe constrained line can be understood as “for $i in /book” where $i contains the content of the XML record and /book is the root on the XML record. The second line “where $i/price “refers to one branch that acts as a selection. The third line “return $i/title” refers to another branch that is returned to the query.

This explains the outcome where we will see:

insertXML2

The outcome is “Learning XML” that is retrieved from the XML record with book as root and that complies to the restriction that the price is larger than 30.

Another expression is:

 SELECT EXTRACT(person_data, '/book/title') as "Titel" 
 FROM oe.tperson_data w;

that also provides us with the title:

insertXML3

Product Types

I recently encountered a nice model on product types that I would like to share.
A product type can be understood as an concept of a product, much like the descriptions you might see in a catalogue. Examples like a Peugeot 206 Gentry which is a concept of a car. It is not equivalent to actual cars – it only describes a type of a car of which a lot of actual cars may exist.

Then the model:

productType

 

In the left upper corner, we see the entity “Product Type”. This entity may contain a description such as the Peugeot 206 Gentry. Other examples may be the hard disk WD3000EXDB which stands for a type hard disk.

As we may have hundreds, may be thousands of product type, we may classify these product type into several classes. We may have the class “cars” and a class “hard disks”.

Each of these classes may have (overlapping) sets of attributes that describe the product types. The product type class “hard disk” may have attributes like price, storage amount, delivery term etc. Likewise a product type class may have attributes like price, number of seats. The attributes are listed in an entity “Product Type Attributes”. As we may have overlapping sets, an associative entity is added.

The “Product Type Attributes” and “Product Type” both identify the values. We then know the Peugeot 206 Gentry costs 20000 euro, whereas the WD3000EXDB costs 200 euros.

 

 

XML and XSLT

Several possibilities exist to read XML files. One such possibility is to have it read via XSLT. This procedure is composed of three elements:

  • The XML file that is to be read;
  • A XSLT file that provides us with instructions on how to handle the XML file.
  • An engine that translates the XML file via the instructions as given in the XSLT file into something that is readable.

XMLXSLT

 

An example may clarify this. Suppose, we have a XML file that starts with these lines:
XML

 

For a full XML file see here. Within this XML file, it is indicated that an XSLT file “cdcatalog.xsl” is to used as a file that contains instructions on how to read this XML file. In this case, the set of instructions looks like: XSLT.
If the XML file is then accessed with a programme that contains an engine to apply the instructions as contained in the XSLT, we get the desired result. Every modern web browser contains such engine. Hence if the web browser is used to read the XML file, we already have the desired result. This is how I could see the XML file when I read this with my webbrowser:

 

HTML

Some Javascript to read XML

javascript1
I know that lots and lots of possibilities exist whereby an XML file can be read. One set of possibilities is to use a Javascript. One then stores a script on the webserver that is processed by the web-browser. Upon processing, the XML file in read and can be stored in another format. A quick query provided me with three such scripts where an XML file is read and subsequently presented in an HTML file.
Below, I have included a link where three scripts are executed.

click here

If the reader would like to see the scripts, he could look at the source that can be read from the web browser itself. After all, the script must be read by the web browser as to execute it in the browser.
It is also nice to see that the scripts are interpreted differently by different web browser. The Internet Explorer will give another result as, say, the Chrome browser. This shows that the scripts are interpreted by the web browser and we have the consequences of different web browser and ensuing different results.