Big Data

Data science is an intricate craft and a lot of routine work even before the real work begins. In the following articles, we focus on how to manage your data preceding the endeavor itself, to get the maximum value from your data.

Data collection and the need to transform data "on the fly"

In the beginning of our cooperation, we worked with data that the operator collects every day from automated tests of routers. 

First, the operator tried to display the data as it came in. The data was generated by more than 20,000 modems and routers and it was impossible to make sense of such visualization.

Figure 1. Data displayed as they come in

Moreover, this data source was not the only one. Data from many platforms and technologies had to be linked. And this is where our BitSwan product came to play a major role.

Consolidation of different data sources

BitSwan was connected to many data sources, including data from transmitter stations, data from the end devices (such as routers and modems), and other additional data from various products that are used to monitor the cell site.

Many data sources went hand in hand with other data formats.  Some data was sent in real time in JSON format, while additional data was stored on file servers in CSV or TXT formats. Some data were extracted from databases in AVRO format, and so on.

Unification of different data formats and Types

This is a major topic we tackled in another blog (Data Unification), so feel free to check it out.

Data source enrichment

For a successful analysis, the data must be further enriched with all relevant information. For example, to analyse the signal quality of a transmitting station, it is necessary to know its unique identifier. However, we revealed that the identifier of a transmitting station may not always be unique, and different data sources use different identifiers for the same transmitting station.

2712342 (code) > `YALDFT3a` (name of BTS)

16920421 (serial number) > 712309865 (customer ID)

Final data consolidation

The result of the data consolidation was that we had data from all sources in the right formats, in the right data types, enriched with all relevant information. Only then could the analytical work begin.

Data science is not just browsing through codes. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data and this is where it all begins.