Beginner’s Guide to Big Data

As an industry grows, so does the language surrounding it. With constant advancement – both in technology and the way we perceive it – comes the need to find new ways to describe them. One such advancement is in the field of ‘Big Data’, a term that is impossible to avoid in today’s competitive business environment.

Eventually, for those not working directly in the field itself, it can become daunting to keep up with the constant terminology and abbreviation that pop up on a regular basis. This can make it difficult if you are in a scenario where you need to keep in contact with tech-savvy individuals as part of your job. Consider that you might be in a high-level managerial position and must regularly ask your Data Analytics department for reporting, and you want to make the most out of your requests. Alternatively, you might be a Salesman for a SaaS [Software as a Service] company dealing with Big Data technology, and you need to properly understand the client’s needs and requests to close the deal.

This article will go through some of the most common Big Data terminologies and describe them in a way that a layman can understand. This is not an Encyclopedia for BI terms, however, it can be used as a foundation for you to understand the general terms.  

In order to understand everything, we must start from the basics.  What is Big Data? What differentiates it from any other data? In reality, the answer is as simple as pure volume. Big Data is a self-explanatory term. What sets it apart from other data are the methods which we use to analyse it. Smaller scale data can be analysed using some functions and perhaps even by hand, but the larger your data the more difficult it will become to do that. We require specific tools and methodologies both to properly store the data as well as to get the most value out of it. Big Data is often housed in a Data Warehouse. A Warehouse is a database that will store all the data from your multiple different data sources in a unified location. In order to do that, we need our data to adhere to certain specifications in order for the end result to be more readable by analytical tools. This is where Data Modelling comes into play. In the field of IT, conceptualizing low-level information into a high-level representation, known as Modelling, is a universal practice. Data Models are used to formulate your database’s Schemas, which are a set of rules that decide on how tables in your data are read and processed.

When it comes to warehousing, there are different approaches you can take. You may decide to take the same approach as your traditional data storage and host your Warehouse locally on your servers.  However, as your data volume increases and the processing power needed to properly digest the data goes up, scalability might become very expensive to the point of hindering your analysis. That is why companies often opt for Warehouse as a Service using Cloud technology. Let’s break down what that entails.

Warehouse as a Service is the exact same concept of Software as a Service.  You purchase a subscription-based license from a service provider, generally via Amazon Web Services – or AWS. But where does your data go? It goes on the Cloud.  If you’ve ever uploaded a file on Dropbox or Google Drive, then you know the file is not stored directly on your PC anymore. That same concept will apply to the data stored in the Warehouse. The service provider will handle the database software and the server needed to run it. Should you need to expand, they will offer scalability upgrades. This way, you wouldn’t need to worry about purchasing additional resources for maintaining your Warehouse hardware infrastructure.

Regardless of whichever model you pick, you will need to feed your Warehouse with data. We usually do this through a Pipeline, which is the process that sends your data from your initial sources to the Warehouse. The first step is extracting the data. There are generally two practices that do this: Batches and Streams. Batches will upload your data at regular intervals. These are usually done on a daily basis during non-working hours, so as not to affect any reporting you might be doing. Streams, on the other hand, will upload your data in real time. In certain industries, such as medical, it might be of utmost urgency to have up to date reporting, and as such would find Streams to be much more appealing.

In most cases, you don’t want your data to be uploaded as is. Data tends to be case sensitive, and certain inputs of the same data might be read as different records (ex: John and john are the same person, but the different casing of the J will have it treated as a separate John). Generally, we want to perform an ETL process. ETL stands for Extract, Transform, Load. You may have noticed that in the previous paragraph I used the wording “extracting the data”. This is the Extract part of the ETL. Transforming consists of cleaning the data to fix user errors as well as make sure all the data is uniform, such as the John example. In many cases, ETL is handled in the Pipeline via Python, a common programming language in Data Science. However, for those unfamiliar with Python, there also exist middleman tools such as Talend. Talend is a powerful tool allowing users to set up their own ETL flows via drag and drop components. Talend can connect to your data sources directly and upload them via Batches, or it can connect to your Streams and handle the transformation.  Talend can also load your data into the Warehouse.

With your data properly formatted in your Warehouse, it can then properly be analysed. This is where the Data Science, Business Intelligence and Data Analytics come into play. The difficulty with explaining these terms is they are often used interchangeably, and in reality, there tends to be a lot of overlap. Business Intelligence is the process of generating reports built upon past data. These reports are often used as a retrospective on the business’ past actions and are used so that the business can take that into consideration when moving forward with their decision making. They often employ visualisation tools such as Qlik or Tableau to display the data in a readable manner. Data Analytics instead offers a more predictive approach, taking the data available and using algorithms to generate a potential predictive model of what could happen in the future. Data Science is the most advanced of the data reporting school, often digesting all sorts of data – even untransformed data – and using statistical algorithms and machine learning (a self-learning Artificial Intelligence model) to generate reports.

The industry is always vast and everchanging, and in a matter of months, new technologies and terms may pop up and take root. This high-level explanation has hopefully taught you new information that will help you in your day-to-day and may act as an important stepping stone so that you can understand or research any related terms you may come across in the future.

At iMovo, we work with best-in-class technologies to enable organisations to mine and visualise their information in a meaningful way that helps them to take the right decisions at the right time. As an expert in the field of Business intelligence and Big Data Analytics, we can help your company implement and sustain a product and guide you on how BI can complement your existing environment.  For more information contact us on [email protected].