Jargon-Busting Guide to Data Terms

Single Source of Truth terminology  explained

Creating a single source of truth is a pre-requisite for today’s data-driven business. At PTR we specialise in advising you on what that might look like as well as creating the trusted repository itself. As we explain in this comprehensive article Your Single Source of Truth – What is it and how do I get it? there are many options.

The basic concepts outlined in the piece are straightforward and we highly recommend you have a read if you are considering a data project, but the terminology can be baffling and so we have created a jargon-busting guide to data terms to help. You might also find the article AI Decoded - Beyond the Buzzwords useful too!

 

Single Source of Truth (SSOT)

A Single Source of Truth (SSOT) is a centralised data repository that brings together information from many different systems across your business. It breaks down silos and gives everyone access to the same trusted data. Always up to date, it eliminates inconsistencies and acts as a backbone for both AI applications and an overarching data-driven strategy.

Data Lake

data lake is a centralised storage system that holds raw, unprocessed data in its original format (like files, videos, or database dumps) until you need it. Think of it as a "catch-all bucket" for everything your organisation generates – sales records, social media posts, sensor data, etc.

While a data lake can store all your data in one place, it’s not automatically a trustworthy source. The raw data is like random items in a cupboard – you need governance (labels, expiration dates etc) and tools to make it reliable.

Adding structure and rules will help transform the raw data into a single source of truth

Data Warehouse

A data warehouse is a structured storage system designed to hold processed, organised data ready for analysis – like a library where books are sorted by category. It transforms raw data from multiple sources (sales systems, apps, etc) into consistent formats for use in reports and business intelligence.

It consolidates data from multiple sources into a relational format, serving as an organisation’s "single source of truth" for reporting and decision-making.

A real-life example can be found at Devon County Council where their School Finance Power BI Dashboard pulls near-real-time data from 400+ schools’ financial systems into a warehouse. This allows them to track actual spending against budgets, carry out trend analysis across school types and keep strict control on public funds.

Data Lakehouse

A Lakehouse is a modern data storage and management system that combines the best features of data lakes and data warehouses. A Lakehouse aims to provide a single, unified platform for storing, managing, and analysing both structured and unstructured data. It means you don’t need separate systems for different data workloads, potentially reducing costs and complexity while improving accessibility for analysis.

Semantic Model

A semantic model is a user-friendly framework for business data which defines what data actually means to non-technical users, thus bridging technical data structures and business user needs. Like a shared dictionary it aligns terms, for example "patient wait time" or "school attendance" across departments, ensuring everyone calculates and interprets them the same way. This keeps things consistent and accurate across things like Business Intelligence (BI) tools and AI systems.

Example; A city council’s semantic model ensures "homelessness case" means the same thing in housing reports, budget dashboards, and grant applications.

ETL (Extract, Transform, Load)

ETL (Extract, Transform, Load) is a crucial stage in the processing of data. It involves moving and transforming information from various sources into a centralised storage system, such as a data warehouse or data lake.

In the initial extract phase, the data is pulled from multiple sources and often includes basic data validation to ensure the extracted information meets expected formats and values.

During this middle stage, the raw data undergoes various operations to prepare it for analysis including, cleaning to remove errors or inconsistencies, standardising formats, aggregating or summarising data, joining data from different sources and filtering out unnecessary information. Transformation typically occurs in a staging area, separate from both the source and destination systems.

Load is the final step which involves moving the transformed data into its target destination, usually a data warehouse or data lake and may be automated to take place in off peak hours.

ELT (Extract, Load, Transform)

A newer variation called ELT (Extract, Load, Transform) has gained popularity, especially with cloud-based systems:

  • Loads raw data directly into the target system
  • Performs transformations within the destination database
  • Better suited for handling large volumes of unstructured data
  • Leverages the processing power of modern data warehouse

Data Pipeline

A data pipeline is typically the ongoing, continuous movement of data into the repository from where it can be extracted for analysis. It includes ingestion, transformation and storage. Data pipelines can operate in two primary modes: batch processing, which handles large volumes of data at scheduled intervals, and stream processing, which processes data in real-time as it is generated/

Data Governance

These are the policies you put in place to ensure your data quality, security and compliance with regulations. It ensures that your data always remains reliable and consistent in line with what you expect from a single source of truth.

Data Dictionary

A data dictionary plays a crucial role when setting up a data warehouse or single source of truth (SSOT). It serves as a comprehensive guide that defines and describes all data elements in terms of their meaning, structure and rules. This ensures consistency and clarity, helping to ensure everyone understands and uses data consistently.

Business Metrics

A business metric is a quantifiable measure that organisations use to track, monitor, and evaluate their performance against objectives. Examples of such metrics include liquidity indicators like the current ratio and profitability measures such as gross profit margin.

Key Performance Indicators (KPI)

Key Performance Indicators (KPIs) are the metrics businesses use to evaluate success against goals, over time. They are the tools a business uses to assess performance and identify areas for improvement.

Medallion Architecture

Medallion architecture is a data design pattern used to organise data in a Lakehouse. The aim is to progressively improve the structure and quality of the data as it flows through each layer from Bronze to Silver and Gold. The bronze layer represents the raw data, which is then cleansed in the silver layer and finally curated in the gold layer ready to be analysed and consumed.

Raw Data

Raw data is source or primary data in its unprocessed form which has not been altered or analysed in any way.

Cleansed Data

Cleansed data has been screened for errors, inconsistencies and duplication. Irrelevant information is removed, resulting in a reliable dataset suitable for analysis.

Curated Data

Curated data represents the processed, standardised layer of information, providing high-quality, structured data for business intelligence, reporting, and advanced analytics.

Structured Data

Structured data is highly organised information stored in predefined formats, like tables with rows and columns, making it easy to search, manage, and analyse.

Unstructured Data

Unstructured data does not conform to a predefined data model or format, it includes things like emails, social media posts, images, videos and text which can be a challenge to analyse.

Apache Spark

Apache Spark is open-source technology which is used to cleanse, prepare and transform large volumes of all types of data very quickly. It is suitable for on-demand processing, direct delivery to analytics tools, or loading into persistent structures like data warehouses and in multiple programming languages which makes it very versatile.

Databricks

Databricks is a cloud-based platform comprising a suite of integrated services and tools that combines data, analytics, and artificial intelligence capabilities. Databricks pioneered the concept of a data Lakehouse, combining the benefits of data lakes and data warehouses to store and analyse different data types in a single platform. It leverages Apache Spark to process data at scale. From ingestion to insights, Databricks works alongside Azure in the Microsoft ecosystem to streamline data operations and extract value more efficiently.

Data Silo

A data silo is information controlled by one department, isolated and unseen by other business units. When silos are broken down, data can be shared across teams promoting collaboration based on one single source of truth.

Data Warehouse Snapshotting

Data warehouse snapshotting is a technique that captures a static view of the data at a specific point in time, allowing businesses to preserve historical information and analyse data trends.

Slowly Changing Dimension (SCD)

A Slowly Changing Dimension (SCD) is a data management concept used in data warehouses that allows you to track and manage changes to an objects properties over time. These are changes in the "who, what, where, when, why, and how" aspects of business information such as product category, customer type, sales territotry, employee manager, chart of accounts category. Values that may change occasionally over time and impact how our data is grouped and reported.

Share this post