Music has always been an integral part of human culture. From ancient times to modern-day, music has evolved and transformed in various ways. With the advent of technology, music creation and discovery have taken a wild and increasingly influential new turn. Enter Generative AI – For those living under a rock, Artificial Intelligence (AI) is a technology that has revolutionized the way we do almost everything, including how we create and discover music.

@OpenAI #Jukebox (https://www.openai.com/research/jukebox)is a prime example of how AI is bridging the gap between music and future technology. #OpenAI-Jukebox “produces a wide range of music and singing styles, and generalizes to lyrics not seen during training. All the lyrics below have been co-written by a language model and OpenAI researchers” that can generate original songs and tracks in various genres and styles similar to creativity powerhouse known simply as DALL-E for image creation. It uses deep learning algorithms to analyze existing songs’ patterns and structures to create new ones.

So does the future of music creation and discovery lie in generative AI tools like OpenAI Jukebox or design and art creation in tools like DALL-E? Either way, it’s the season of all things #OpenAI. These days you can’t escape a chatGPT meme or SNL skit powered by the little chatterbox, providing endless possibilities and countless hours of entertainment for experimentation with different words, phrases, images, sounds, styles, and genres. And did I mention they can also help you to work smarter not harder with e-mail responses and documentation creation or even building apps for you ( ie – code scripting )?

Generative AI tools like OpenAI Jukebox are not only limited to creating new songs but also have the ability to remix existing ones. This opens up a whole new world of possibilities for artists who want to experiment with their work or collaborate with other musicians.

The use of generative AI tools in music creation also raises questions about copyright laws and ownership rights. As these tools become more advanced, it will be interesting to see how they impact traditional copyright laws.
As a former EDM DJ, Im excited to see where OpenAI takes its Jukebox research – it is just one example of how AI can revolutionize the world of music creation and discovery IMHO. As technology continues to evolve at this crazy rapid pace, it’s exciting to think about what other possibilities lie ahead for the world of AI and ( fill in here ). The future looks bright for humans and  musicians and artists and fans alike as we march on this yellow brick journey towards what ? Emerald city or a more innovative musical landscape powered by artificial intelligence? Who knows ? Let’s ask ChatGPT! 

DevOps for Data Science – Part 1: ML Containers Becoming an Ops Friendly Citizen

So, for the longest time, I was in the typical Data Scientist mind space (or at least, what I personally thought was ‘typical’) when it came to CI/CD for my data science project implementations –> DevOps was for engineering / app dev projects; or lift and shift/Infrastructure projects. Not for the work I did – right? Dev vs. IT (DevOps) seemed to battle on while we Data Scientists quietly pursued or end results outside of this traditional argument:

1_lSkCi_qyxIeNtSF1o71NFQ

At the same time, the rise & subsequent domination of K8 (Kubernetes) and Docker et al <containerization> was the 1st introduction I had to DevOps for Data Science. And, in the beginning, I didn’t get it, if I am being honest. Frankly, I didn’t really get the allure of containers <that was my ignorance>. When most data scientists start working, they realize that the majority of data science work involve getting data into the format needed for the model to use. Even beyond that, the model being developed will need to be operationalized as part of some type of web/mobile/custom application for the end user.

Now most of us data scientists have the minimum required / viable processes to handle things like versioning / source control et al. Most of us have our model versions controlled on Git. But is that enough?

 

It was during an Image Recognition workshop that I was running for a customer that required several specific image pre-processing & deep learning libraries in order to effectively script out an end to end / complete image recognition + object detection solution – In the end, it was scripted using Keras on Tensorflow (on Azure) using the CoCo Stuff 2018 dataset + YOLO real-time object detection that I augmented with additional images/labels specific to my use case & industry (aka ‘Active Learning’):

Active Learning Workflow
Active Learning Workflow

Active Learning is an example of semi-supervised learning in which an algorithm interactively asks for more labeled data in order to affect model performance positively.

Labeling is often rushed because it doesn’t carry the cache of other steps in the typical data science workflow – And getting the data preprocessed (in this case , Images and Labels) is a necessary evil if you want to achieve better model performance in terms of accuracy, precision, recall, F1 – whichever, given a specific algorithm in play & its associated model evaluation metric(s) :

datascienceworkflows

And it was during the setup/installation of these libraries when it occurred to me so clearly the benefit of  Data Science containerization – If you have always scripted locally or on a VM, you will understand the pain of maintaining library / package versions whether using python or R or Julia or whatever the language du jour you use to script your model parameters / methods etc.

And when version conflicts come into play, you know how much time gets wasted searching /  Googling / Stack Overflowing a solution for a resolution (ooh, those version dependency error messages are my FAVE <not really, sigh…but I digress>…

Even when you use Anaconda or miniconda “conda” for environment management, you are cooking with gas until you are not: like when your project requirements demand you pip/conda install very rare/specific libraries/packages that have other pkg version dependencies / prerequisites only to hit an error during the last package install step advising that some other upstream pkg version that is required is incorrect / outdated thus causing your whole install to roll back. Fun times <and this is why Cloud infrastructure experts exist>; but it takes away from what Data Scientists are chartered with doing when working on a ML/DL project. <Sad but true: Most Data Scientists will understand / commiserate what I am describing as a necessary evil in today’s day and age.>

OK and now we are back: Enter containers – how simple is it to have a Dockerfile (for example) which contains all the commands a user could call via the CLI to assemble an image including all of the packages/libraries and their dependencies by version for a set python kernel <2 or 3> and version (2.6, 2.7, 3.4, 3.5, 3.6 etc ) for this specific project that I described above? Technically speaking, Docker can build images automatically by reading the instructions from this Dockerfile. Further, using docker,  build users can create an automated build that executes several command-line instructions in succession. –> Right there, DevOps comes clearly into picture where the benefits of environment management (for starters) and the subsequent time savings / headache avoidance becomes greater than the learning curve for this potentially new concept.

There are some other points to note to make this happen in the real world: Something like VSTS would need to be wrapped into a Docker Image, which would then be put on a Docker container registry on a cloud provider like Azure. Once on the registry, it would be orchestrated using Kubernetes.

Right about now, your mind is wanting to completely shut down. Most data scientists know how to provide a CSV file with predictions / or a scoring web service centered on image recognition/ classification handed off to  a member of your AppDev team to integrate / code into an existing app.

However, what about versioning / controlling the model version ? Each time you hyper tune parameters within a model you are potentially changing the model performance – How do you know which set of ‘tunes’ resulted in the highest evaluation post scoring? I think about this all of the time because even if you save your changes in distinct notebooks (using JupyterHub et al), you have to be very prescriptive on your naming conventions to reflect the changes made to compare side by side across all changes during each tuning session you conduct.

This doesn’t even take into account once you pick the best performing model, actually implementing version control for the model that has been operationalized in production and the subsequent code changes required to consume it via some business app/process. How does the typical end user interact with the operationalized scoring system once introduced to them via the app? How will it scale!? All this would involve confidence testing, checking against a set threshold, and triggering some type of closed loop action system when anomalies are detected. Plus, how do you get sign off from different parties and orchestration between different cloud & on-premise servers that support the business process (with all the corporate firewall / networking / data movement / storage / encryption requirements & rules)? Maybe you have others to think about this – But if you want to be a data scientist worth having the overly used moniker applied to your role, you should care enough to learn about DevOps and how you can be a better corporate citizen & not just the Rockstar Data Scientist who alienates everyone to get to the root cause. IMHO:

This should be part of your Data Scientist process. Period. Hard stop. Not only for you, but for others that come after you or are on your team. No need to reinvent the wheel-Plus, for organizations that have strict CI/CD / DevOps procedures and limited Ops staff, the automation that you can bring with your project deliverables will win you favor, at a minimum, for considering this vital aspect to all other appDev type projects / roles in your company.

Integrating Databricks with Azure DW, Cosmos DB & Azure SQL (part 1 of 2)

I tweeted a data flow earlier today that walks through an end-to-end ML scenario using the new Databricks on Azure service (currently in preview). It also includes the orchestration pattern for ETL (populating tables, transforming data, loading into Azure DW etc), as well as the SparkML model creation stored on CosmosDB along with the recommendations output. Here is a refresher:

Some ndatabricksDataflowonAzureuances that are really helpful to understand: Reading data in as CSV but writing results as parquet. This parquet file is then the input for populating a SQL DB table as well as the normalized DIM table in SQL DW both by the same name.

Selecting the latest Databricks on Azure version (4.0 version as of 2/10/18).

Using #ADLS (DataLake Storage , my pref) &/or blob.

Azure #ADFv2 (Data Factory v2) makes it incredibly easy to orchestrate the data movement from 3rd party clouds like S3 or on-premise data sources in a hybrid scenario to Azure with the scheduling / tumbling one needs for effective data pipelines in the cloud.

I love how easy it is to connect BI tools as well.  Power BI Desktop can connect to any ODBC data source and specifically to your Databricks clusters by using the Databricks ODBC driver. Power BI Service is a fully managed web application running in Azure. As of November 2017, it only supports Spark running on HDInsight. However, you can create a report using Power BI Desktop and upload it to an Azure service.

The next post will cover using @databricks on @Azure with #Event Hubs !

Wonderful World of Sports: Hey NFL, Got RFID?

As requested by some of my LinkedIn followers, here is the NFL Infographic about RFID tags I shared a while back:

nfl_tech_infographic-100612792-large.idge

I hope @NFL @XboxOne #rfid data becomes more easily accessible. I have been tweeting about the Zebra deal for 6 months now, and the awesome implications this would have on everything from sports betting to fantasy enthusiasts to coaching, drafting and what have you. Similarly, I have built a fantasy football (PPR) league bench/play #MachineLearning model using #PySpark which, as it turns out, is pretty good. But it could be great with the RFID stream.
nfl-tagged-shoulder-pads-100612790-large.idge

This is where the #IoT rubber really hits the road because there are so many more fans of the NFL than there are folks who really grok the “Connected Home” (not knocking it, but it doesn’t have the reach tentacles of the NFL). Imagine measuring the burn-rate output vs. performance degradation of these athletes mid game and one day, being able to stream that on the field or booth for game course corrections. Aah, a girl can only dream…

Utilizing #PredictiveAnalytics & #BigData To Improve Accuracy of #EPM Forecasting Process

I was amazed when I read the @TidemarkEPM awesome new white paper on the “4 Steps to a Big Data Finance Strategy.” This is an area I am very passionate about; some might say, it’s become my soap-box since my days as a Business Intelligence consultant. I saw the dawn of a world where EPM, specifically, the planning and budgeting process was elevated from gut feel analytics to embracing #machinelearning as a means of understanding which drivers are statistically significant from those that have no verifiable impact , and ultimately using those to feed a more accurate forecast model.

Big Data Finance

Traditionally (even still today), finance teams sit in a conference room with Excel spreadsheets from Marketing, Customer Service etc., and basically, define the current or future plans based on past performance mixed with a sprinkle of gut feel (sometimes, it was more like a gallon of gut feel to every tablespoon of historical data). In these same meetings just one quarter later, I would shake my head when the same people questioned why they missed their targets or achieved a variance that was greater/less than the anticipated or expected value.

The new world order of Big Data Finance leverages the power of machine learned algorithms to derive true forecasted analytics. And this was a primary driver for my switching from a pure BI focus into data science. And, I have seen so many companies embrace the power of true “advanced predictive analytics” and by doing so, harness the value and benefits of doing so; and doing so, with confidence, instead of fear of this unknown statistical realm, not to mention all of the unsettled glances when you say the nebulous “#BigData” or “#predictiveAnalytics” phrases.

But I wondered, exactlyBig Data Finance, Data Types, Process Use Cases, Forecasting, Budgeting, Planning, EPM, Predictive, Model how many companies are doing this vs. the old way? And I was very surprised to learn from the white-paper that  22.7% of people view predictive capabilities as “essential” to forecasting, with 52.2% claiming it nice to have.  Surprised is an understatement; in fact, I was floored.

We aren’t just talking about including weather data when predicting consumer buying behaviors. What about the major challenge for the telecommunications / network provider with customer churn? Wouldn’t it be nice to answer the question: Who are the most profitable customers WHO have the highest likelihood of churn? And wouldn’t it be nice to not have to assign 1 to several analysts xx number of days or weeks to be able to crunch through all of the relevant data to try to get to an answer to that question? And still probably not have all of the most important internal indicators or be including indicators that have no value or significance to driving an accurate churn outcome?

What about adding in 3rd party external benchmarking data to further classify and correlate these customer indicators before you run your churn prediction model? To manually do this is daunting and so many companies, I now hypothesize, revert to the old ways of doing the forecast. Plus, I bet they have a daunting impression of the cost of big data and the time to implement because of past experiences with things like building the uber “data warehouse” to get to that panacea of the “1 single source of truth”…On the island of Dr. Disparate Data that we all dreamt of in our past lives, right?

I mean we have all heard that before and yet, how many times was it actually done successfully, within budget or in the allocated time frame? And if it was, what kind of quantifiable return on investment did you really get before annual maintenance bills flowed in? Be honest…No one is judging you; well, that is, if you learned from your mistakes or can admit that your pet project perhaps bit off too much and failed.

And what about training your people or the company to utilize said investment as part of your implementation plan? What was your budget for this training and was it successful,  or did you have to hire outside folks like consultants to do the work for you? And by doing so, how long did it actually take the break the dependency on those external resources and still be successful?

Before the days of Apache Spark and other Open Source in-memory or streaming technologies, the world of Big Data was just blossoming into what it was going to grow into as a more mature flower. On top of which, it takes a while for someone to fully grok a new technology, even with the most specialized training, especially if they aren’t organically a programmer, like many Business Intelligence implementation specialists were/are. That is because those who have past experience with something like C++, can quickly apply the same techniques to newer technologies like Scala for Apache Spark or Python and be up and running much faster vs. someone who has no background in programming trying to learn what a loop is or how to call an API to get 3rd party benchmarking data. We programmers take that for granted when applying ourselves to learning something new.

And now that these tools are more enterprise ready and friendly with new integration modules with tools like R or MATLib for the statistical analysis coupled with all of the free training offered by places like University of Berkeley (via eDX online), now is the time to adopt Big Data Finance more than ever.

In a world where the machine learning algorithm can be paired with traditional classification modeling techniques automatically, and said algorithms have been made publicly available for your analysts to use as a starting point or in their entirety for your organization, one no longer needs to be daunted by thought of implementing Big Data Finance or testing out the waters of accuracy to see if you are comfortable with the margin of error between your former forecasting methodology and this new world order.

2015 Gartner Magic Quadrant – Boundaries Blur Between BI & Data Science

2015 Magic Quadrant Business intelligence

2015 Magic Quadrant Business intelligence

…a continuing trend which I gladly welcome… 

IT WAS the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us…

–Charles Dickens

Truer words were never spoken, whether about the current technological times or deep in our past (remember the good ole enterprise report books, aka the 120 page paper weight?)

And, this data gal couldn’t be happier with the final predictions made by Gartner in their 2015 Magic Quadrant Report for Business Intelligence. Two major trends / differentiators fall right into the sweet spot I adore:

New demands for advanced analytics 

Focus on predictive/prescriptive capabilities 

Whether you think this spells out doom for business intelligence as it exists today or not, you cannot deny that these trends in data science and big data can only force us to finally work smarter, and not harder (is that even possible??)

What are your thoughts…?