Network reaction from Python

I have a php script that runs as cgi on a webserver. The programme is quite simple. First is asks for a userid and password. The userid and password are sent as a parameter. If these value coincide with expected value, the system returns a page where the user may click on a hyperlink to… Read More »

Another Pyspark scripts

In this note, I show yet another Pyspark with slightly different methods to filter. The idea is that file is read in a RDD. Subsequently, it is cleaned. That cleaning process involves a removal of lines that are too long. The lines are split with a character that is on the twentieth position. Then the… Read More »

A python script with many steps

Pyspark is the python language that is applied to spark. It therefore allows a wonderful merge between spark with its possibilities to circumvent the limitation that are set by the mapreduce framework and python that is relatively simple. In the scheme below, some steps are shown that might be used. sc.textFile allow to read a… Read More »

The 1000th wordcount example

I just discovered the 1000th wordcount example. It is based on Pyspark. The idea is actually quite simple. One creates a script. This script can be written in any editor. The programme can then be run from the terminal by spark-submit [programme]. As an example, one may start the programme below with: spark-submit –master yarn-cluster… Read More »

Joining files with Pyspark

Pyspark allows us to process files in a big data/ Hadoop environment. I showed in another post how Pyspark can be started and how it can be used. The concept of Pyspark is very interesting. It allows us to circumvent the limitations of the mapreduce framework. Mapreduce is somewhat limiting as we have two steps:… Read More »

Flume: sending data via stream

It is possible to capture streaming data in HDFS files. A tool to do this is Flume. The idea is that we have 3 elements: sources that provide a stream, a channel that transports the stream and a sink where the stream ends in a file. This can already be seen if we look at… Read More »

Partitioned Table in Hive

It is possible to partition the tables in Hive. Remember the data are stored in files. So we expect the files to be partitioned. This is accomplished by a split of the files over different directories. One directory serves one partition, a second another partition etc. Let us take the example of 7 records that… Read More »

Manipulating Avro

Avro files are binary files that contain data and the description of the files. Thereby it is a very interesting file format. One may send this file to any application that is able to read Avro files. Just as an example: one may write the file is (say) PHP and send it to (say) Java.… Read More »

Parquet format

As we know, we may store table definitions in the metastore. These table definitions then refer to a location where the data are stored. The format of the data might be an ordinary text file or it might be an avro file. Another possibility is a parquet file. This parquet format is an example of… Read More »

Avro format

In Hive, we see a situation where a table definition is stored in a metastore. This table definition is linked to a directory where the data are stored. It is possible to use different formats here. One may think of a text format. But other formats are possible too. One example is the avro format.… Read More »