Been getting questions about "how data works" or "how to build data analytics/team when you don't know data". Love to get a discussion going to see how I can make it better.
This is how I do it, it's not necessarily the best way and definitely not the only way. There's frustrating cons which I'll go in to along with the pros. Be great to see how others do it along with their pros and cons as well.
My intended audience includes both people who don't have any experience that wants to learn the basics and those who can teach me how to do it better. I'll try my best to avoid unnecessary jargon or link to necessary ones.
This is how I do it, it's not necessarily the best way and definitely not the only way. There's frustrating cons which I'll go in to along with the pros. Be great to see how others do it along with their pros and cons as well.
My intended audience includes both people who don't have any experience that wants to learn the basics and those who can teach me how to do it better. I'll try my best to avoid unnecessary jargon or link to necessary ones.
Specifically, I'll go in to
- Overview of how data flows: how data get's captured, stored, and consumed
- What skillsets, mindsets, tools are necessary (future post)
- How to check for data integrity and centralize it (future post)
- How to get users comfortable using the tools (future post)
- How to prioritize 100s of data visuals, users, requests (future post)
General Flow
.Frontline Domain Experts: These are the people that execute actual work, feel the pain and reward of decisions made. I give their word a lot of weight because everything else is a representation of their real life experience. None of it matters if you don't actually add more profit, make some process faster, ... The problem though is their experience are in their heads and dispersed all over the place. We need to record their experience.
Operational Systems (usually Source of Truth): this is where Operational systems com in. In typical finance, it can be a popular program like Microsoft Great Plains Dynamics. In hospitals, its usually an Electronic Medical Records application like Epic. Given there are multiple applications, it then needs to be stored in a central location using a standard convention.
Data Warehouse
Data Warehousing: This is where Data Warehouses come in and the process is called Extract, Transform, Load.
There are different tools to do this. The majority of my experience is with Microsoft SQL Server which comes with Integration Services to handle the ETL part and Management Studio handle the server part.
There are different tools to do this. The majority of my experience is with Microsoft SQL Server which comes with Integration Services to handle the ETL part and Management Studio handle the server part.
SQL Server Pros: The biggest value is the job market demand. Run an indeed search for SQL Server and you'll get over 30k jobs. According to Gartner, Oracle is more mature and has a bigger marketshare but run the same search on Indeed and you'll get around 2k jobs. Searches for mongodb, 5k or mysql, 12k.
If there is an existing infrastructure like my experience, you don't have much of a choice. The other value is there's a mature and big community to find support.
If there is an existing infrastructure like my experience, you don't have much of a choice. The other value is there's a mature and big community to find support.
SQL Server Cons: I have a bias against SQL as I come from an object oriented programming background. There's plenty of hate for Microsoft SQL Server in general.
- Complicated and Messy query language: As I stated in the pros, if you come in to a SQL infrastructure, you're pretty much stuck with it. The option left is then to find extensions of SQL
- LINQ looked promising but the problem is it hit a limit when I tried write data back in to a database. Despite asking its community and product support, there was no solution nor confirmation it was even capable
- Visualizing, Debugging and Optimizing Queries: I haven't found an Integrated Development Environment for SQL like I'm used to with C# where I can put in breakpoints, step through lines of code, and watch variables.
- There's the SQL analyzer but interpreting its not the easiest thing to read
- Management Studio has a visual query builder but it is not able to interpret or the more complex queries such as correlated subqueries. This where visualization would be most helpful
- Seemingly basic tasks like granting security, making a copy of a database, importing data from csv or excel files are overly complicated or not directly possible. We have do work-arounds like backing up and restoring a databse to achieve a copy of a database
- Integration Services: too many to list
- If you are building it from scratch, consider NOSQL options as some predict the traditional SQL market is in decline.
Visualization
My Background
My experience comes from working at a large hospital network and a medium sized medical supplies company. Regardless of the size of these organizations, the data teams were small (5-2 people) enough where the team members weren't working on the same thing at the same time.
I'm not an expert of any of these technologies. I'm more a generalist that focuses on delivering actionable knowledge as opposed to reports. Many moons ago, I did a lot of management consultation with Excel. It was in the last 3 years that I picked up and got certified in Tableau. It was in the last year that I develop confidence in Data Warehouse.
At my current job as of writing this post, I am responsible for everything except administration of the servers (DBA). I wasn't there for the initial pulling of the data from the application to the SQL servers (ETL) but I am responsible for maintenance of existing and new ETL work now.
I'm not an expert of any of these technologies. I'm more a generalist that focuses on delivering actionable knowledge as opposed to reports. Many moons ago, I did a lot of management consultation with Excel. It was in the last 3 years that I picked up and got certified in Tableau. It was in the last year that I develop confidence in Data Warehouse.
At my current job as of writing this post, I am responsible for everything except administration of the servers (DBA). I wasn't there for the initial pulling of the data from the application to the SQL servers (ETL) but I am responsible for maintenance of existing and new ETL work now.