Skip to main content

Posts

Make Automation

We can create a Makefile in Python which helps to automate tasks like running your Python scripts, testing, or building your project.  A Makefile is a file used by the make build automation tool to define a set of tasks that should be executed. Each task is written as a target, and it can depend on other tasks or files. In general, structure of makefile are: Target Dependencies Commands Ok, lets understand this with help of an example. Makefile (has no extensions) # Define variables for system Python and pip for virtual environment SYSTEM_PYTHON = python3 VENV_PYTHON = myvenv/bin/python VENV_PIP = myvenv/bin/pip # Target: venv # Creates a virtual environment in 'myvenv' directory using system Python. venv : $(SYSTEM_PYTHON) -m venv myvenv # Target: install # Installs dependencies if 'requirements.txt' exists. install : venv $(VENV_PIP) install -r requirements.txt; # Target: all # Runs venv, install, and run targets. all : venv install In example above...
Recent posts

Pyspark Transformation and Actions

 Pyspark Transformation and Actions RDD Transformation Examples Some of the examples of RDD transformations are: flatmap() map() reduceByKey() filter() sortByKey() RDD Actions Example some of the examples of RDD actions are: count() first() max() reduce() take() collect() lets us understand by code how RDD is fault tolerant, immutable, how lazy evaluation works and also understand its distributed nature. lets define a RDD , while converting a data to RDD, it is not directly loaded in memory, first of all  a lineage graph is made for the data, such that even if any node crashes in future, data can be recreated. This lineage graph helps in fault tolerance by allowing Spark to recreate lost data partitions in case of any reason causing node failure. RDDs are distributed across multiple nodes in a cluster, allowing parallel processing of data records. rdd = sc.parallelize([ 1 , 2 , 3 , 4 , 5 ]) the immutability nature of RDD can be understand such that, notice now we didnt changed...

Docker With PostgreSQL

Brief Introduction to Docker img src Docker is a platform as a service (PaaS) for developing, shipping and running applications in containers. By containers, i mean, a lightweight, portable and self-sufficient unit capable of running applications and their dependencies. It is a runnable instance of a Docker image. Docker container runs inside a host machine but completed isolated from host machine. Importance of Docker img src Some of the importance of docker are highlighted as: Dependency Management As data engineers we work on various pipelines and applications. So, to simplify the dependency and make sure we don't end up with conflict across different environment, it is necessary to encapsulate the dependencies of applications in a container and generate an image for it. Portability Docker containers are portable and can be easily moved between different environments, which makes it simpler to deploy and share pipelines and applications across teams or organizations easily. Envi...

All About SQL(Structured Query Language) part one

Introduction SQL i.e. Structured Query Language is a querying language for relational database. SQL has many flavors some of which are MSSQL, MYSQL, SQLite, PostgreSQL etc. All of these are for manipulation and management of relational databases. SQL provides set of commands that help interact with the database such as creating, modifying and querying data. Like any other languages SQL has levels of learning,  i.e. beginning, intermediate and advance. We will talk about thin brief later on. With SQL we can manage our data in more organized form like in table easy for us to query them when ever we need such as deleting, inserting, update.  Well SQL does have many types and variation of its own and each have their own purpose. They are  Data Declarative Language (DDL) --- Create, Alter, Drop Data Manipulation Language (DML) --- Select, Insert, Update, Delete Data Control Language (DCL) --- Grant, Revoke Transaction Control Language (TCL) --- Commit, Rollback, Savepoint Data...

Data Engineering and Tools

What is Data? Before jumping to Data Engineering, let's start from what actually is Data. Well in today's world data is everything and everything is data. While scrolling social media, playing games online, surfing internet etc., terabytes of data are being processed and stored. Data nowadays are asset to people, So these assets need to be well managed for taking more benefits as we can. Data Lifecycle If we take a look at impact of data in every gadget and tech we use, we never actually think how the data is reaching us in such form and from where? Data goes through many processes for it to be used in real world some data are used as it is created like in real time system whereas some are used better with certain time intervals. First data is collected from various sources then it is ingested to the pipeline and then it is stored . The data is the processed i.e. converted to the form as desired by the users and used for analysis. The processing storing and analyzing steps are ...

ETL using AWS Glue

Basic Idea ETL is the core task of Data Engineers. Extracting Data from various sources whether it is streaming data or historical data, then transforming it into suitable form such that it can be used as per business requirements. Finally, data is loaded into suitable storage space.  In this blog we will see the ETL pipeline using AWS Glue.  AWS GLUE Aws Glue is a fully managed serverless ETL pipeline services. Serverless means that developers or users can build ad run applications without having to manage servers. It is totally managed by AWS and follows pay as you use kind of facilities. AWS is use to prepare and transform data for analytics and other processing tasks. It simplifies the process of data cleaning, data transformation into desired format and so on.  Let's understand the architecture of AWS Glue. AWS GLUE ARCHITECTURE ref:  AWS Glue concepts - AWS Glue (amazon.com) Data Stores can be anything depending upon the use case i.e. S3, Redshift etc. We load ...

Data Analysis and Manipulation of Dengue outbreak in Nepal in 2022 using Pandas

Dengue outbreak in Nepal in 2022 was really frightening. A lot of people were infected. That includes my friends and Neighbours as well. In this blog we will analyze the dengue outbreak zones, most infected zones, most infected months and try to find the cause of the outbreak. Brief Background about Dengue Dengue is a disease caused by bite of infected Aedes mosquitoes. Common symptoms of Dengue can be mild-severe fever, severe headache, joint and muscle pain, vomiting, skin rash etc. Dengue transmitting Aedes mosquitoes breed on stagnant water. So, we should remove stagnant water around the Neighbour and keep surrounding clean to prevent such diseases. Data source  The source of data for this project is taken form Government of Nepal, Ministry of Health and Population, Department of Health Services, Epidemiology and Disease Control Division. link 63c63d4f8257e.pdf (edcd.gov.np) Analysis based on District According to the provided data, all the 76 district of Nepal has reported the...