Integrating Databricks with Azure DW, Cosmos DB & Azure SQL (part 1 of 2)

I tweeted a data flow earlier today that walks through an end-to-end ML scenario using the new Databricks on Azure service (currently in preview). It also includes the orchestration pattern for ETL (populating tables, transforming data, loading into Azure DW etc), as well as the SparkML model creation stored on CosmosDB along with the recommendations output. Here is a refresher:

Some ndatabricksDataflowonAzureuances that are really helpful to understand: Reading data in as CSV but writing results as parquet. This parquet file is then the input for populating a SQL DB table as well as the normalized DIM table in SQL DW both by the same name.

Selecting the latest Databricks on Azure version (4.0 version as of 2/10/18).

Using #ADLS (DataLake Storage , my pref) &/or blob.

Azure #ADFv2 (Data Factory v2) makes it incredibly easy to orchestrate the data movement from 3rd party clouds like S3 or on-premise data sources in a hybrid scenario to Azure with the scheduling / tumbling one needs for effective data pipelines in the cloud.

I love how easy it is to connect BI tools as well.  Power BI Desktop can connect to any ODBC data source and specifically to your Databricks clusters by using the Databricks ODBC driver. Power BI Service is a fully managed web application running in Azure. As of November 2017, it only supports Spark running on HDInsight. However, you can create a report using Power BI Desktop and upload it to an Azure service.

The next post will cover using @databricks on @Azure with #Event Hubs !

KPIs in Retail & Store Analytics

I like this post. While I added some KPIs to their list, I think it is a good list to get retailers on the right path…

KPIs in Retail and Store Analytics (continuation of a post made by Abhinav on
A) If it is a classic brick and mortar retailer:

Retail / Merchandising KPIs:

-Average Time on Shelf

-Item Staleness

-Shrinkage % (includes things like spoilage, shoplifting/theft and damaged merchandise)

Marketing KPIs:

-Coupon Breakage and Efficacy (which coupons drive desired purchase behavior vs. detract)

-Net Promoter Score (“How likely are you to recommend xx company to a friend or family member” – this is typically measured during customer satisfaction surveys and depending on your organization, it may fall under Customer Ops or Marketing departments in terms of responsibility).

-Number of trips (in person) vs. e-commerce site visits per month (tells you if your website is more effective than your physical store at generating shopping interest)

B) If it is an e-retailer :

Marketing KPIs:

-Shopping Cart Abandonment %

-Page with the Highest Abandonment

-Dwell time per page (indicates interest)

-Clickstream path for purchasers (like Jamie mentioned do they arrive via email, promotion, flash sales source like Groupon), and if so, what are the clickstream paths that they take. This should look like an upside down funnel, where you have the visitors / unique users at the top who enter your site, and then the various paths (pages) they view in route to a purchase).

-Clickstream path for visitors (take Expedia for example…Many people use them as a travel search engine but then jump off the site to buy directly from the travel vendor – understanding this behavior can help you monetize the value of the content you provide as an alternate source of revenue).

-Visit to Buy %

-If direct email marketing is part of your strategy, analyzing click rate is a close second to measuring conversion rate. 2 different KPIs, one the king , the other the queen and both necessary to understand how effective your email campaign was and whether it warranted the associated campaign cost.

Site Operations KPIs / Marketing KPIs:

-Error % Overall

-Error % by Page (this is highly correlated to the Pages that have the Highest Abandonment, which means you can fix something like the reason for the error, and have a direct path to measure the success of the change).

Financial KPIs:

-Average order size per transaction

-Average sales per transaction

-Average number of items per transaction

-Average profit per transaction

-Return on capital invested

-Margin %

-Markup %

I hope this helps. Let me know if you have any questions.

You can reach me at mailto:// or you can visit my blog where I have many posts listing out various KPIs by industry and how to best aggregate them for reporting and executive presentation purposes ( ).

It was very likely that I would write on KPIs in Retail or Store Analytics since my last post on Marketing and Customer Analytics. The main motive behind retailers looking into BI is ‘customer’ and how they can quickly react to changes in customer demand, rather predict customer demand, remove wasteful spending by target marketing, exceeding customer expectation and hence improve customer retention.

I did a quick research on what companies have been using as a measure of performance in retail industry and compiled a list of KPIs that I would recommend for consideration.

Customer Analytics

Customer being the key for this industry it is important to segment customers especially for strategic campaigns and to develop relationships for maximum customer retention. Understanding customer requirements and dealing with ever-changing market conditions is the key for a retail industry to survive the competition.

  • Average order size per transaction
  • Average sales per transaction

View original post 278 more words

To go Logical or Not…That is the Question?!


Figure 1: Basic star schema for typical point of sale data-mart

In the current design, the database does not enforce the integrity of data with respect to the definition of the fact tables. The basic fact table should contain rows only for daily sales for a store for all the products sold in a day. If we try to insert an aggregated record in the basic fact table (for week, district and brand), the database would not reject the rows. If we like to separate the fact tables according to the grain, then we would have to rely on the ETL process to enforce the granularity of the fact table. Enforcing rules through ETL process has following drawbacks:




  • ETL processes need to persist the metadata for populating the fact table in the first figure and the aggregated fact from the 2nd figure. 
  • Multiple ETL processes may be responsible for populating the fact table. In such cases, we would need to duplicate the rules to each process. By maintaining rules (business logic) in the database layer, we can minimize maintenance headaches because the logic is now centrally located making change aforethought and less prone to mistakes.
  • ETL processes may not be efficient to enforce rules if the process relies on SQL to enforce the rules. Database integrity constraints are more efficient than SQL.
  • If the ETL process relies on a programming language, then it would be very difficult to maintain the meta data in the ETL process.

Let us see how the design would change when we partition the dimension table to enforce the granularity of the fact and aggregate table. If we partition the dimension table based on the level of the data in the hierarchy, then we can use these partitioned dimension tables to join with the appropriate fact table. Figure 2 illustrates how the data model will look in the partitioned dimension table approach.

Figure 2: Constraint Enforced Point of Sale Mart

In the new design, the basic fact table joins with the basic dimension tables and the aggregate fact table joins with the higher-level dimension tables. In this design, the granularity of the fact table is enforced by the relational constraint defined in the database. We can use SQL to create a view over all partitioned dimension tables using SQL union all to create a logical design that is similar to the basic POS data mart as described in Figure 1. We can also create views over the basic fact table and aggregated fact table to have one fact view with multiple grain, but this depends on how the data modeler wants to present the logical model.