BitSwan product consists of so called data pumps, which enable to obtain data from various sources and technologies - e.g. Kafka, RabbitMQ or data files in storage. Consequently, the pumps are able to dispatch data to given destinations like ElasticSearch or some other database, over which it is then feasible to create visualisations and dashboards.
Figure: BitSwan product architecture
Solution architecture consists of a file of components, which form a compact platform ecosystem and enable a natural as well as nonviolent system evolution. Components originating in Open Source world are chosen with respect to stability and ensurance of long- term sustainability of system development.
BSPump is a key component of the BitSwan product. It is an implementation of a stream analyzer in Python language which provides a platform for data pumps creation.
Data pumps consist of so called pipelines which determine the way of acquisition, processing and storage of data, connections providing integration with the other systems, platforms and lookup objects. It is possible to trigger any number of instances in pumps and pipelines at the same time, which provides an optimum scaling and redundance tool.
The BitSwan product makes use of BSPump to create and operate data pumps designed to concrete cases of use, e.g. in communication field, logistics or protection and anonymization of data.
ElasticSearch is a schema-less database and a search (indexing) engine which enables storing and fast searching for individual events.
ElasticSearch suitably powers the BitSwan qualities by adding a so called permanent data storage which is highly flexible as regards structure and volume of data being stored. ElasticSearch keeps data in the form of so called JSON documents stored within so called indexes – including their versing. In order to reach maximum scalability, the size of indexes can be modified automatically depending on the size of data or time window.
To store data and create queries over them, ElasticSearch makes use of REST API. It is always recommended to run ElasticSearch in the BSPump environment in a so called cluster. Individual computing junctions determine the master junction all by themselves. ElasticSearch offers the possibility of acquiring information in real time as well as in batches.
By means of a connector BSPump connects to ElasticSearch instance. The connector is primarily used to store individual events. Because of the schema-less principle, events are stored ad-hoc, which means that database scheme is not required as it is known in RDBMS systems. This leads to a dramatically faster adoption and higher flexibility of data stored. Queries are created in a simple query language which combines structured queries (known e.g.in SQL) with fulltext search principles.
Kibana is a visualisation tool (GUI) which enables to search data in ElasticSearch database interactively. It is a web application – it is therefore possible to access it via web browsers like Google Chrome, Mozilla Firefox, Microsoft Explorer etc. Kibana also supports access from mobile phone browsers.
Kibana displays real-time data. Users can easily set time windows as well as other filtres. Kibana is also able to search according to key words through search fields. Kibana enables to easily create visualisations, dashboards and time graphs capable of displaying the development of more complex metrics and aggregations in time. Visualisations include creation of tables, graphs, diagrams and maps and thus make it possible to get fast insight into the nature of information included in data.
Kibana enables to create reports, change analysis and statistic reports over current as well as historical data.
Example of setting a query into the search field:
(Individual queries can be „immersed“ into one another.)
Queries can be exported from Kibana or ElasticSearch as:
- ● raw data
- ● in CSV format
- ● in JSON format
- ● in XML format
You choose the possibility „Management” in the menu. Then choose the possibility “Saved Objects” in the upper menu. Subsequently, choose objects you wish to export (queries, visualisations or dashboards). Finally click on the “Export” button.
BSPump Monitor is a GUI tool used to monitoring of the state of individual data pumps as well as fast detection and repair of possible error states. It is delivered as part of the BitSwan product.
The BitSwan product uses Docker containers to deploy data pumps, Kibana and other components easily and fastly. The containers are administered in bulk by using “Docker compose” tool. The container system is used to distribute new versions of the system. Upgrading consists in a simple and automated downloading of new container images and their deployment by means of the Docker tool.
The Docker containers can be deployed into various IT infrastructures (on-premise, physical hardware, virtual hardware, public cloud, private cloud, hyperconverged stacks in any combination).
In the BitSwan product two types of Docker containers (i.e. their images) are used: (a) generic ones with a concrete pump implementation (b) specific ones with a configuration designed to be used in a concrete environment and concrete server. The generic container is a subset of the specific one, the content of which is different due to added configuration file (files). The source code for a given application is identical in both of the container types.
Note: It is also possible to use LXC containers alternatively.
As an up to date Python platform for real-time data processing, BitSwan is designed to support the best principles in IT world at the present time:
- ● continuous deployment of new versions
- ● unlimited scalability
- ● fast and easy implementation of new function blocks (the so called microservice).
BitSwan makes use of a microservice architecture in which individual function blocks (microservices) are coupled loosely (loose coupling) as far as data flow is concerned. Each microservice is designed to be capable of focusing on one concrete task.
Moreover, microservices are very easy to understand and thus can be easily expanded, tested and optimized for real-time processing and data analysis. The system based on them is therefore highly scalable – from small deployments to big clusters - and resistant to security failures. Its additional advantage is the deployment fastness and implementation of new functions.
Microservices pose an up to date alternative to a monolithic system architecture.
It is possible to integrate BitSwan with the other systems within the IT infrastructure.
There follows the description of the integration of some of them:
Apache Kafka is a distributed streaming platform which makes it possible to gather data within so called topics and enables individual consumer processes connected to read them in publish-subscribe or producer-consumer regimes.
The BSPump project contains a connector (Connection) which can be configured and thus connected to Kafka instantion running within corporate infrastructure. Consequently, by means of this connector Kafka can be used as a source or target of events by BitSwan.
Apache Hadoop is a framework enabling a distributed processing of large datasets across clusters within a corporate infrastructure (so called Big Data).
BitSwan is able to connect to Hadoop solution (e.g. by means of Parquet files) and store processed data from even longer time periods for a later analysis in it.
Apache Parquet is a format designed to data storage based on columns with possible data type definition. It enables to store differently large files depending on the amount of data as well as their fast reading.
Parquet files for instance make a joint between BitSwan and Hadoop. BSPump contains connectors for storage of processed data into Parquet format.
Another example of Parquet files usage is input data storage in full original form without any filtration or normalization. This means that no type of input information is ever lost and can be traced back in its original form throughout the storage time. Input events can thus be stored for an arbitrarily long time, provided disk capacity is accessible. This archive can be stored in an external disk storage, however the data are always accessible instantaneously. Thus the Parquet files granularity enables a more precise management of disk capacity utilization.