Business scenario and technical implementation about world wide currency conversion with Redshift and Holistics
Hi there! When conducting a profit related analysis project and creating an visualisation to showcase your results across regions, having a currency filter is often required. Currency conversion is very common across industries such as e-commerce and financial works.
We will need a dimension table to get the universal currency and conversion rate in a daily granularity. The base currency is set to USD, every other currency conversion rate is converted to USD rate at its effective date.
CREATE TABLE dim_currency AS
Hive query small tips
This article only talks about the optimisation on DML layer with an explanation on Hadoop MapReduce mechanisms.
Suppose we have two tables: fact_order, dim_customer. fact_order keeps all customer order transaction history, dim_customer is the main dimension table for customer attributes.
The table structures are as below:
hive> desc fact_order;OKorder_id string Nonecustomer_id string Noneproduct_name string Nonetransaction_id string Noneorder_status string Nonetransaction_status string Nonecreate_time timestamp Noneupdate_time timestamp Nonehive> desc dim_customer;OKid string Nonename string Nonegender string Noneage…
The Elegant Way of Deletion In Database
Manipulating with data is common for IT professionals including database administrators, developers, data analysts and scientists, and playing with data in database is simple and straight forward. SQL is a specialized language for database operations and is very easy to understand and use. Unlike C or any other object-oriented programming, we try not to write SQL logics or function blocks except for built-in logics working with applications. We usually don’t define variables, and we try to avoid looping at all costs. For example, if I want to retrieve some data, I simply type…
Redshift practical knowledge to speed up analytics process
Time zone is commonly used in Business Intelligence analytics to allow reports to convert different time zones. In this article, I will share about how Redshift handle the time zone conversions.
Redshift provides a built-in timezone table with worldwide timezone.
SELECT tz.* FROM pg_timezone_names() tz(name text, abbrev text, utc_offset interval, is_dst boolean);
Demonstration on building a realtime data pipeline using Streamsets
StreamSets ia a modern data streaming and integration platform build by company StreamSets, Inc. It is used by many multinational companies such as Shell and Dell.
StreamSets Data Collector (SDC) is an open source data ingestion pipeline as one part of StreamSets DataOps platform, you can download here to try out. Today we will use SDC to demonstrate the real time data transformation from Aurora Postgres to Kinese Firehose.
1. Create a new pipeline by clicking the blue button on the top left of the StreamSets UI, and type in pipeline…
In this case study, I will use an example of a financial MNC to illustrate the current challenges and business impacts, and on how building a seamless big data architecture can help solve the problems and continuously benefit.
Table of Contents:
2. Needs and Requirements
3. Data Sources
4. Key Questions To Think Through
4.1. Key question 1
4.3. Key question 2
4.4. Scenario & Answer
4.5. Key question 3
4.6. Scenario & Answer
4.7. Key question 4
4.8. Scenario & Answer
A multi-national company ABC provides financial services including investment services and insurance services using…
The right and wrong about data deletion in production databases
It has been a while since I noticed this question and started to think about the severity of it. I recalled about the past experiences on how I have dealt with data and on the reasons, and I realized that it was seldom brought onto the table and open for discussions. While, we all know that production data is critical, but how to keep it neat and tidy is a topic worth discussing. What I will discuss now are based on experiences from previous and current jobs, the mistakes I…
Model Engineering — random forest parameter tuning
From last post, we can know that:
But that is also the reason Random Forest can cause overfitting.
In this post, we will use GridSearchCV with Cross Validation to tune the parameters and improve the model performance.
# print out default parameters
from pprint import pprint
'estimator': RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
Sweet treats that will make your analytics easier
It provides a very convenient way to visualize the whole schema and relationships of every entity. It also allows to download the schema as image.
Model Engineering — select and evaluate linear / non-linear models
From previous post, we can know that:
Use of Linear Regression, SGDRegressor and Support Vector Regression:
from sklearn.linear_model import LinearRegression, SGDRegressordef fit_model_cross_validate(algorithm, parameters, X_train, y_train, X_test, y_test):
gs = GridSearchCV(algorithm, parameters, cv=10, verbose=10)
print(‘Best parameters:’, gs.best_params_)…
Writing articles to test own knowledge depth