Data Engineering and Tools

- May 29, 2023

What is Data?

Before jumping to Data Engineering, let's start from what actually is Data. Well in today's world data is everything and everything is data. While scrolling social media, playing games online, surfing internet etc., terabytes of data are being processed and stored. Data nowadays are asset to people, So these assets need to be well managed for taking more benefits as we can.

Data Lifecycle

If we take a look at impact of data in every gadget and tech we use, we never actually think how the data is reaching us in such form and from where? Data goes through many processes for it to be used in real world some data are used as it is created like in real time system whereas some are used better with certain time intervals.

First data is collected from various sources then it is ingested to the pipeline and then it is stored . The data is the processed i.e. converted to the form as desired by the users and used for analysis. The processing storing and analyzing steps are iterative depending upon the needs of the company and business requirements.

fig. Diagram to visualize the cycle.

Data Engineering

Well, if we describe data engineering in simple terms, it is a discipline within data management that focuses specially in designing, constructing, and maintaining the systems and infrastructure necessary for the collection, storage, processing, and analysis of large volumes of data. Well as i already mentioned this is period of data or should i say big data, data engineers should be able to handle not less terabytes of data, and as per requirements we have different tools to complete the task.

some typical task involved in data engineering are,

Data Ingestion
Data Transformation
Data Storage
Data Processing
Data Quality
Performance optimization
Data Security

Its quiet amazing to see how data engineers can play vital role in deriving insights form data. Data Engineers use various tools and technique to develop simple to complex pipelines of data.

Tools

AWS Glue
AWS Kinesis
Apache Hadoop
Apache Airflow
Apache Kafka
Apache Spark
Google Cloud
Snowflake
Microsoft Azure Data Factory
MySQL

Well, most of the tool i mentioned are free and open source how ever tools like Glue, Kinesis etc are paid and is based on pay as you use policy. Each tools have similar as well as different purpose. Like Airflow is used for scheduling and monitoring the workflow or task. Kafka on other hand is use to handle Realtime streaming data. Some tools like AWS Glue, AWS Kinesis are fully managed serverless tools provided by amazon, where we can make data pipeline easily by using AWS services. Each tool can be separate topic of discussion as they have variety of use case and working mechanism.

Conclusion

Overall, we can say how data can influence our day-to-day life and how we depend on them. Imagine us getting weather data of each and every country on your mobile screen and only getting filtered weather data of your particular area, which one is more convenient, well this is small example of how handling data in correct way is so important. Well all of these generally come under Data Engineering.

Reference

https://chat.openai.com/

Search This Blog

Data Science and Engineering

Data Engineering and Tools

What is Data?

Data Lifecycle

Data Engineering

Tools

Conclusion

Reference

Comments

Post a Comment

Popular posts from this blog

ML concepts (Regression, Classification, Clustering)

Make Automation

Retrieval-Augmented Generation (RAG)