This API gateway built to maintain data manipulation without directly accessing the database to avoid
potentially harming the database performance. The manipulation means to process the key-value json formatted
into column-value to be able to operate it into database
ETL Cloud
Batch Processing
Tech companies used to require having their own on-premise servers to operate ETLs on their data.
Of course, numerous companies would have to face issues on scalability, data loss, hardware failure, etc.
With the appearance of cloud services offered by major tech companies, this was changed, as they provided
shared-computing resources on the cloud which is able to solve most issues found on on-premise servers.
This repo is to set up a Google Cloud Composer environment and solve several batch-processing cases by creating DAGs to run ETL jobs in the cloud.
The data processing consists of ETLs with data going in and from GCS and BigQuery.
Sparkjob on
Google Cloud Dataproc
The problem in this project is “How can we process a huge amount of data automatically without writing a script repeatedly?”
With a huge amount of data from local computer, we should transform the data and store them into BigQuery as the Data Warehouse
HR Analytics
Classification
Currently, the processing & analysis the employees data has been largely manual to filter the eligibility of employees for promotion,
and this leads to delay in transition to the new roles after promotion. With a large time and manual process, processing & analysis
are also may be less accurate. This project is about developing Machine Learning model to help HR team to filter the eligibility of employees for promotion process.
Data can come from everywhere. In this project, I create script to do data scraping about information of Jobs posted from
LinkedIn automatically. The automation build with Selenium and Python. Also I make a simple Exploratory Data Analysis to find the insights from the scraped data.
Customer Segmentation
with RFM
The data I used for this project provide customer and date of transactions for few years.
From this, I do customer segmentation by using RFM analysis.
Along this project, I used the common 2 clustering algorithm: Agglomerative and K-Means.
At the end, the result is customers can be segmented into 3 types: Gold, Silver, and Bronze
In tech companies, data comes in many forms: json, txt, different databases, google spreadsheet and many more.
But in the end, it will be stored the data in a single location as the single source of truth.
In this project I applied the suitable transformations for each type of data, and store the information into a local Data Warehouse.
Fetching Data from
Large JSON File
When dealing with large JSON file, it is common that the JSON may be as a normal JSON or the Nested JSON.
This project is developed to extract and clean data from JSON files with huge JSON files.
The function built in this project allows us to fetch the data from JSON file with some specific fields requirements as needed.