Getting Started With Data Engineering

ยท

6 min read

What is Data Engineering?

It is a type of software engineering that focuses on designing, developing, testing and maintaining architectures such as databases and large-scale processing systems. It is a process of storing, processing and extracting information from huge datasets.

Responsibilities Of A Data Engineer!

A Data Engineer is responsible for building and maintaining the infrastructure required for data storage, processing, and analysis.
They are responsible for designing, building, and maintaining data pipelines, data warehouses, and data lakes. They also ensure that data is properly stored and accessible to those who need it. Data Engineers often work with Big Data technologies such as Hadoop, Spark, and NoSQL databases. They are expected to have a strong understanding of programming, database management, and cloud computing.

6 V's Of Data Engineering

  • Volume - As we talk about big data, we should be able to make robust scalable systems that can handle extremely huge amounts of data without hiccups. If we take some simple case studies like Phonepe, Facebook etc then these companies generate petabytes of data in a very short period. This post represents the number of transactions Phonepe handles in every quarter over the last few years. This shows us how humongous the data is that engineers at Phonepe handle. This makes the first pillar of Data engineering that is the volume of data is always going to be huge and our pipelines and architecture should handle it.

  • Variety - Now the data that we collect is not always in the same format. Based on different use cases we collect data in a different formats. Generally, the sources are heterogeneous. We might be collecting data in one of the following ways

    • Structured - Tabular data, Excel, SQL (Ranking)

    • Semi-Structured - JSON, XML

    • UnStructured - Images, Music, Videos, PDF

  • Velocity - Not only the volume & variety but the velocity of the data is also an important factor that you need to consider while preparing the architecture of the data processing pipelines. It represents the speed at which accumulate the data responds to the users with that data. Due to a humongous flow of data, it becomes important to prepare a robust mechanism to handle this velocity of data. In big data, the potential of the systems is very high and data gets collected from various sources which further increased the velocity of data. There can be two types of systems based on the velocity of generating data:

    • Real-Time - Google maps can be a good example that keeps on collecting real-time location data from millions of users at the same time.

    • Batch - Swiggy food ordering data can be a good example that how many users order food in the last 1 hour.

  • Value - If the data collected cannot be used to extract meaningful information or to build impactful business solutions then it is of no use. For example, Spotify tracks your data when you're working out, and when are you relaxing and then based on the collected data recommends music that best suits your mood.

  • Veracity - This means, how accurate or truthful the dataset is. It represents how much inconsistency or uncertainty your data has.

  • Variability - This represents how often the data in your data collection mechanism changes.

OLTP - vs - OLAP

Apart from the 6 V's of Data Engineering, One more crucial terminology we need to learn is around the difference between OLAP and OLTP. This is actually a foundational concept and helps us to understand the need to move to a Big Data architecture.

OLTPOLAP
ONLINE TXN PROCESSING - they provide transaction-oriented applicationsONLINE ANALYTICAL PROCESSING - they provide analysis-oriented applications
For example, Phonepe stores payment-related data in RDBMS-based databases.Recommendation engine prepared by giants like FB, youtube uses OLAP systems to get the data
Databases like MySql, Oracle SQL, PgSQL etc are the technologies usedBig Query, Hive, and Redshift are the technologies used
More Writes are doneMore reads are done
Users: DB engineers, Soft engineersUsers: Data analytics, Data Scientists, Data Engineers
User features and productivity is the main focusBusiness Analysis and prediction analysis are the main focus
Update happens quite oftenThe update doesn't happen often
Query time is relatively fastQuery time is slow due to the high volume of data

Use cases of Data Engineering ๐ŸŽ‰

Real-time Data Processing

In many industries such as finance, transportation, and healthcare, it is critical to process data in real-time to make timely decisions. Data engineering is used to build systems that can process large amounts of data in real time and store it for further analysis. For example, a credit card company can use data engineering to process transactions in real time and detect fraudulent activity before it becomes a bigger problem.

Data Warehousing

Data engineering is used to build and maintain data warehouses that are used to store and analyze large amounts of data. Data warehouses can be used to store data from multiple sources and can be used for business intelligence and analytics. For example, a retail company can use data engineering to build a data warehouse that stores information about customer transactions, which can be used to analyze customer behaviour and make better business decisions.

Data Integration

Data engineering is used to integrate data from multiple sources into a unified format that can be used for analysis. For example, a healthcare company can use data engineering to integrate patient data from different sources, such as electronic health records, lab reports, and imaging data, into a single system that can be used to improve patient care.

Data Transformation

Data engineering is used to transform data from one format to another, such as converting data from a legacy system to a new system or converting unstructured data into a structured format. For example, a manufacturing company can use data engineering to transform data from sensors and other sources into a format that can be used to optimize production processes.

Data Pipeline Development

Data engineering is used to build and maintain data pipelines, which are used to move data from one system to another. For example, a media company can use data engineering to build a data pipeline that moves data from social media platforms, such as Twitter and Instagram, into a system that can be used to analyze social media trends and improve marketing campaigns.

So that's a wrap and with that, you completed your first step in the world of Data Engineering. Drop your suggestions in the comments below and let me know what improvements we can do to enhance your learning experience.

.

.

.

.

Lukewarm regards

Sanket Singh

ย