data engineering

Five steps to get you started

with Machine Learning / Artificial Intelligence If you have ever wondered how do I get myself up to speed with Machine Learning/Artificial Intelligence or why the hype now for a term artificial intelligence that has existed since 1950s” This post may be helpful for you. I would refer to Machine Learning/Artificial Intelligence as ML/AI for the rest of the post. There are multiple business challenges in any organization. There could be usecases related to manual human errors, automating a business process, providing better customer service, recommending a product, understanding the sentiment, getting insights on the trends, predicting natural disasters, estimating vehicle damage, analyzing multiple documents for a summary, processing huge amounts of documents, predicting faults/anomalies. the list goes on. The challenges are endless and the technology is ever evolving. We are seeing continuous advancements in various fields, but do you have to be hands on to be an expert? Not necessarily. My colleague Steve Walker mentioned this once “Do you have to be an expert in the design of an F35 aircraft to fly it or do you have to know just to fly?”. [Image from: https://xkcd.com/1838/] Overnight none of us can/will become Data Scientists, however there is a lot we can learn and grow. The job roles vary in a wide spectrum some much needed hands on experience, some having the ability to architect for an Enterprise solution and some in a leadership role for guiding your team through a strategy. Mahatma Gandhi once said, “Live as if you were to die tomorrow. Learn as if you were to live forever.” Here, I am planning to give you some quick tips on a step-by-step approach towards learning in ML/AI. I will also provide recommendations if you are looking to get a hands-on experience in a follow up post. Step 0: Understand the definition of ML/AI As per Machine Learning Glossary by Google, below are the definitions provided Artificial Intelligence is a non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence. Formally, machine learning is a subfield of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably. Often I see the terms are being used as synonyms. The example of how I would differentiate is through its usage Netflix has recently pitched in an idea of using Eye Tracking for navigation of screens. This would fall under Artificial Intelligence whereas Netflix using Recommendation Engine to predict your next recommended video would fall under Machine Learning. Artificial Intelligence is a moving target as technology advancements grow in several fields this would keep evolving, whereas Machine Learning deals with predictive and/or reinforcement behavior. Step 1: Understand the glossary As you would expect, there are many items to know in ML/AI. I would like to highlight the below terminologies for you to get familiarized with. Supervised vs Unsupervised vs Reinforcement Learning Training vs Evaluation vs Inference Chatbots, Natural Language Processing/Understanding/Generation, Sentiment Analysis Priyanka Vergadia walks you through the key things to learn in Machine Learning If you have time and would like to dig a little deeper, Below are some of the other quick review material to get your hands around the topic. Machine Learning is Fun Making friends with machine learning Machine Learning Crash Course ML Glossary towardsdatascience.com (This is a medium link but has great content to continue following) Optional Reading Rules of Machine Learning Machine Learning : High Interest Credit Card of Technical Debt Human Centered Approach to AI Step 2: Understand three core pillars For an AI driven solution, there are three core pillars. Data, Algorithms and Compute. For most conversations, understanding the terminology and glossary should be adequate. However, I would like to highlight the most important of them all. Data fuels algorithms. Anyone who has worked with ML/AI will tell you it’s one of the prime examples for “garbage in and garbage out”. If your data fails, none of the sophisticated models will work. It’s important to understand what is data exploration, data wrangling, data cleansing, data mining, data transformation. These concepts are generic, with just a search might help. I liked this article from Venturebeat which explains the importance of data for ML/AI as one of the top reasons Why Enterprises fail on their strategy. Also important to understand how enterprises choose to do data lake/data mart/ data pond/data river or whatever they decide to call it. Algorithms and Compute - Though these are one of the core pillars of Machine Learning, this generally comes once the AI/ML project is kicked off. Most times these decisions fall upon the Data Scientist, Data Engineers and Architects based on the use case, security concerns, familiarity with tool stack etc., Step 3: Understand the players in this market Every cloud provider has their unique strengths in their ML/AI portfolio. But these cloud providers are not the only ones; there are a lot of niche players in the market to keep a watch on. Below are just some players offering products for the customers to build on their services. Data Robot, H2O.ai, Dataiku,Alteryx,Data Bricks Besides these companies, there are SaaS providers offering AI solutions for most industries such as Banking, Insurance, Health care, Retail, Manufacturing etc., Symphony Retail AI - Grocery store with AI Mitchell Intelligent Estimating - Vehicle Damage Estimating Platform for Insurance Path AI - Accurate diagnosis of diseases. This list goes on and it helps you to understand how large this space really is and also every company focuses on how to make their customer lives easier. Step 4: Follow technologists and leaders in this space There are many technologists in this space, follow them on social media. Most of them post great content for you to follow and understand. I get some recent trends what they are working on and understanding how technology evolves from these players. I created a Twitter list. Do you have someone you follow? Send them to me so we can create a curated list. Step 5: Understand the principles major technology companies have for their governance During Nov'2019 Apple announced Apple card by Goldman Sachs. There were claims suggesting that the credit limit for men was substantially higher than women due to bias in the system. As organizations accelerate their adoption journey, there needs to be an ethical process on what can and cannot the organizations do. These ethical and responsible principles guide the way how end-user customers are best served without bias, respectful of cultural/social norms, data security and privacy considerations. Some major companies publicly discuss their AI principles for their product strategy. I have highlighted two large AI players. Google Microsoft As we look to become a more AI centric world, if this fails, we as a community would all fail. To summarize, I have outlined how you could learn keywords in ML/AI, organizations you need to watch out for, how to keep yourself updated with the recent trends and Responsible AI for product strategy. Keep learning, keep engaging, always be inquisitive and always be listening. If you have questions/comments/suggestions, please reach out to me @kanchpat

Key Roles in an AI driven organization

If you are someone who is interested in understanding how the teams are formed in an AI organization or unit, this post is for you. Most organizations have Artificial Intelligence as part of their key objectives. To help facilitate this, business units have their version of what the key roles are and their responsibilities in their teams would look like. In this post, we will go over some of the most common key roles in an organization. We’ll look at their personas , responsibilities and the type of products most often used by each of these roles. This is by no means the entire list of personas , responsibilities in an org. Just a generalization of things I have seen across the Enterprises. As you could see above each of these roles have several different path ways they could take based on their responsibilities. We will take some time to understand these roles , their background and what they normally would care about. Data Engineer Data is the new oil. Data Engineers are responsible in making sure Data makes sense to others. Persona: Most likely someone with Database / Warehouse / Data Mart / Data lake background Understands the challenges related to data - Data duplication, Data silo , Data governance issues Has dealt with data transformation for business intelligence Understands the difference between Batch and real time. Can efficiently build a data pipeline Sometimes involved with the infrastructure of the setup (management,provisioning etc.,) Responsible for: Data collection , clean up , transformation, data pipeline ML Ops Engineer In some organizations, Data Engineers sometimes play this role Persona: Most likely someone with exposure to Data Engineering, DevOps and Machine Learning Understands the version control for code, data and model Can automate CI/CD/CT/CM (Continuous Integration/Deployment/Training/Monitoring Knows how to schedule, create workflow Responsible for ML Pipelines and monitoring Data Scientist They are the unicorns with very little qualified data scientists available in the market. We will go over some of why this is in a future post. Persona: Has a deep level of understanding with the business problem Solid grasp with statistics, data analytics , machine learning, deep learning, natural language processing etc., Works with programming languages in creating models Responsible for Building explainable models Validating between several algorithms Feature Engineering Detection of drift and skew Citizen Data Scientist This is an emerging set of roles. As most organizations look to redeploy their existing talents towards Data Science related jobs, this becomes more prelevant and the definitions differ Persona: Has a deep level of knowledge with the business problem Typically a developer or a data engineer . Can sometimes be a business analyst Looking to build solutions with tools available by 3rd parties and cloud providers. Responsible for Creating a solution for the identified business problem Understands all the options available and identifies the best of breed for accuracy, performance and cost We saw above the key roles in an AI org and the relevant services available in Google Cloud enabling you to leverage and accelerate your learning and implementation Though the diagram represents Google Cloud services, it could be substituted with any cloud provider or home grown solutions. Irrespective of the options, the key path would remain the same. In the future posts, we’ll look at some of these Google Cloud services in detail. In the meanwhile, you can review some of these resources to get further info. AI with Google Do you want to continue the discussion with me? Feel free to reach out at @kanchpat

GCP Data Engineering - Round 2!

Its been two years since my last post on the Professional Data Engineer certification AND it was time for renewal. Successfully renewed the certification exactly 2 years later I wanted to update some of my recommendations on what I used. Linux Academy Google Cloud Documentation for the services - Big Query, Dataflow, BigTable, Pub/Sub , Composer and other big data services Solution approach Migrating Apache Spark to Dataproc Building your datalake Data Lifecycle When to use Dataflow vs Dataproc, BigTable vs Spanner vs Datastore, ML APIs vs Automl, Composer vs Kubeflow , Transfer Service vs Appliance, Pub/Sub vs Kafka IAM Permissions for all the services Background - Hadoop and its components Wishing you the best of luck for your certification