Photo by Lukas Blazek on Unsplash

Business scenario and technical implementation about world wide currency conversion with Redshift and Holistics

Hi there! When conducting a profit related analysis project and creating an visualisation to showcase your results across regions, having a currency filter is often required. Currency conversion is very common across industries such as e-commerce and financial works.

We will need a dimension table to get the universal currency and conversion rate in a daily granularity. The base currency is set to USD, every other currency conversion rate is converted to USD rate at its effective date.

CREATE TABLE dim_currency AS
(base_currency_code VARCHAR(3),
conversion_rate_to_usd NUMERIC,
effective_on DATE)…

Hive query small tips

Photo by Vivek Doshi on Unsplash

This article only talks about the optimisation on DML layer with an explanation on Hadoop MapReduce mechanisms.

Suppose we have two tables: fact_order, dim_customer. fact_order keeps all customer order transaction history, dim_customer is the main dimension table for customer attributes.

The table structures are as below:

hive> desc fact_order;OKorder_id                    string                  Nonecustomer_id                 string                  Noneproduct_name                string                  Nonetransaction_id              string                  Noneorder_status                string                  Nonetransaction_status          string                  Nonecreate_time                 timestamp               Noneupdate_time                 timestamp               Nonehive> desc dim_customer;OKid                    string                  Nonename                  string                  Nonegender                string                  Noneage…

The Elegant Way of Deletion In Database

Photo by Javier Allegue Barros on Unsplash

Manipulating with data is common for IT professionals including database administrators, developers, data analysts and scientists, and playing with data in database is simple and straight forward. SQL is a specialized language for database operations and is very easy to understand and use. Unlike C or any other object-oriented programming, we try not to write SQL logics or function blocks except for built-in logics working with applications. We usually don’t define variables, and we try to avoid looping at all costs. For example, if I want to retrieve some data, I simply type…

Redshift practical knowledge to speed up analytics process

Photo by noor Younis on Unsplash

Time zone is commonly used in Business Intelligence analytics to allow reports to convert different time zones. In this article, I will share about how Redshift handle the time zone conversions.

Get a list of time zone names and utc_offset:

Redshift provides a built-in timezone table with worldwide timezone.

SELECT tz.* FROM pg_timezone_names() tz(name text, abbrev text, utc_offset interval, is_dst boolean);

Demonstration on building a realtime data pipeline using Streamsets

from Google image

StreamSets ia a modern data streaming and integration platform build by company StreamSets, Inc. It is used by many multinational companies such as Shell and Dell.

StreamSets Data Collector (SDC) is an open source data ingestion pipeline as one part of StreamSets DataOps platform, you can download here to try out. Today we will use SDC to demonstrate the real time data transformation from Aurora Postgres to Kinese Firehose.

1. Create a new pipeline by clicking the blue button on the top left of the StreamSets UI, and type in pipeline…

In this case study, I will use an example of a financial MNC to illustrate the current challenges and business impacts, and on how building a seamless big data architecture can help solve the problems and continuously benefit.

Photo by Franki Chamaki on Unsplash

Table of Contents:

1. Introduction

2. Needs and Requirements

3. Data Sources

4. Key Questions To Think Through

4.1. Key question 1

4.2. Answer

4.3. Key question 2

4.4. Scenario & Answer

4.5. Key question 3

4.6. Scenario & Answer

4.7. Key question 4

4.8. Scenario & Answer


A multi-national company ABC provides financial services including investment services and insurance services using…

The right and wrong about data deletion in production databases

It has been a while since I noticed this question and started to think about the severity of it. I recalled about the past experiences on how I have dealt with data and on the reasons, and I realized that it was seldom brought onto the table and open for discussions. While, we all know that production data is critical, but how to keep it neat and tidy is a topic worth discussing. What I will discuss now are based on experiences from previous and current jobs, the mistakes I…

Model Engineering — random forest parameter tuning

By Angela Compagnone from unsplash

From last post, we can know that:

  • Random forest algorithm can be used for regression problems
  • It typically provides very high accuracy

But that is also the reason Random Forest can cause overfitting.

In this post, we will use GridSearchCV with Cross Validation to tune the parameters and improve the model performance.

# print out default parameters
from pprint import pprint


{'cv': 10,
'error_score': nan,
'estimator': RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',

Sweet treats that will make your analytics easier

1. ERD (Entity Relationship Diagrams) rendering

It provides a very convenient way to visualize the whole schema and relationships of every entity. It also allows to download the schema as image.

Model Engineering — select and evaluate linear / non-linear models

from unsplash

From previous post, we can know that:

  • Knowing that this dataset is non-linear, using linear regression will not return meaningful results. However, it is always good to have a baseline. Below linear regressions are tryout to give a baseline.
  • Use non-linear regression algorithm MLP Regressor and Random Forest Regressor.
  • Use GridSearchCV to split the dataset to 10 folds.

— — — — — —USE OF LINEAR MODELS— — — — — —

Use of Linear Regression, SGDRegressor and Support Vector Regression:

from sklearn.linear_model import LinearRegression, SGDRegressordef fit_model_cross_validate(algorithm, parameters, X_train, y_train, X_test, y_test):
gs = GridSearchCV(algorithm, parameters, cv=10, verbose=10), y_train)
print(‘Best parameters:’, gs.best_params_)…


Writing articles to test own knowledge depth

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store