
Blog Post
Data Cleansing with the Medallion Method
Many businesses and organisations are struggling to produce trustworthy, accurate and consistent reporting due to poor quality source data, often described as dirty data. Here we focus on how we go about turning dirty data into curated quality data.


Dirty Data to Clean, Quality and Integrated Data
Many businesses and organisations are struggling to produce trustworthy, accurate and consistent reporting due to poor quality source data, often described as dirty data.
Our article "What is Dirty Data?" will provide further background to what dirty data is along with the challenges and business impact of poor quality, dirty data.
In order to ensure we are making the right business decisions we need accurate and trustworthy data. In last month's newsletter we focused on the importance of building a Single Source of Truth (SSOT) repository that could act as the central hub for all business reporting driving consistent reporting and well informed business decisions. This month we focus on how we go about cleansing, preparing and transforming our data from dirty data into curated quality data.
So, Data Cleansing may be fairly self-explanatory - we clean our data up so that we can use it for good quality reporting and data driven decision making. But, Medallion Method? What is that? Medallion Architecture or Method defines a layered approach to preparing your data and transforming it into a quality, integrated Single Source of Truth repository.
The Medallion Architecture
The Medallion method for organising data at the various stages of ingestion, cleansing and transformation revolves around the Medallion Architecture where we organise data into three distinct layers:
Bronze
Silver
Gold
Bronze Layer
The Bronze layer hosts raw, often dirty, data. No cleansing or transformation has been carried out, data is stored "as is" in its original untampered with form, and a complete history of all data imports or extractions. Data from this layer is not suitable for general reporting and data analysis, but could be used to support auditing of data, or reloading of silver and gold repositories. The Bronze layer acts as a Single Source of Truth for the data in its original state.
Consumers of Bronze data would include Data Engineers who are involved in the data preparation and storage processes for generating the Bronze, Silver and Gold data resources.
Auditors would also require access to Bronze data to ensure authenticity and compliance.
Silver Layer
The Silver layer hosts clean data. Our dirty/raw data has been passed through some processing to clean up, validate, integrate and transform it preparing it for analysis and reporting. Data exposed by this layer would be suitable for Data Scientists working on predictive analysis models, but would not be suitable for general business reporting/analysis users. Our supporting article "What is Dirty Data?" provides further detail on the tasks typically carried out to prepare data for the Silver layer:
Data de-duplication
Data quality transformations and error corrections
Normalisation and joining of entities
Handling of nested semi-structured data
Schema enforcement with data typing
The Silver data would be accessible to Data Engineers involved in the cleansing and preparation of Bronze data and could be made accessible to data scientists who are building complex models and require access to the detailed information located in the Silver layer.
Data De-Duplication
Data duplication can be caused by duplicate records within a single source system such as multiple records in a CRM system such as Dynamics or Salesforce with the same name and identical field values, or perhaps multiple records with the same name but different field values. Data duplication can also be caused by records created in independent, disparate systems such as customer records in a CRM system such as Dynamics or Salesforce, but also in a separate order processing system, and separate accounting system.
Decisions will need to be made to determine if records are really duplicates and should be -deduped, or whether they are in fact all part of the same entity and should be merged into a single record.
Data quality transformations and error corrections
Data quality issues can be anything from missing values, late arriving values, wrong values, NULL values, inconsistent data typing within a single field, mixed upper and lower case values.
All of these issues need to be addressed of a quality and complete data set is going to be presented to the business for reporting and analytics.
Some issues may have to be simply identified and reconciliation reports produced to enable corrections to be made to data at source. Other issues might be easily handled within the silver layer of your solution through coding in SQL, Python, ELT pipelines.
Normalisation and joining of entities
No aggregation or rolling up of data should be carried out in the Silver layer. A record in the silver layer should represent a single instance of an object such as a single customer or a single line on an order. A record at the Silver level should not typically be something such as total order amount for a month. Aggregations generally belong in the Gold layer, but of course there are always exceptions to every rule and there may be a good business case for storing aggregate representations of data in the Silver layer - perhaps for performance reasons in some budget or technology restricted solutions.
Entities might be joined together in the Silver layer though, such as joining ad order header record to an order line record to form a single table so it is easier to work with at the Gold layer and beyond.
Entities may also need to be joined where conforming a number of records into a single record, such as a conformed customer record being built by joining CRM, order and financial records into a single customer record.
Handling of nested semi-structured data
Many modern applications expose their data via APIs that output JSON or XML extracts. At the high level the extract/output is a string, but within that string of data are elements, fields and metadata that enable a structured view of the data to be derived.
A decision will need to be made on how this semi-structured data is to be used and what should be stored or represented in the Bronze and Silver layers.
It is common to find both the semi-structured data stored in a VARIANT (generic data type) field along with structured data fields stored in individual fields derived from the semi-structured data. In some cases such as when there are nested JSON elements within the JSON data, the semi-structured data may need to be normalised into separate entities within the silver layer.
Schema enforcement with data typing
The Silver layer needs to present structured data that can be passed to the Gold layer in a useable format for reporting and data analytics.
In the Bronze layer it is most likely that all source data is ingested into VARIANT or long text data type fields that are insensitive to source structural/data type changes. It is also likely that in the Bronze layer objects are created every time a load is run to ensure that structural changes are picked up and do not cause loads to fail due to differing formats. Applications go through continual development and new releases and our medallion architecture solution needs to be resilient to these changes and ensures no data loss as a result of change.
In the Silver level the generic data types will be cast as data types suitable to the field content - text, date, numeric. Entities within the silver layer will enforce a fixed schema to ensure that data can be manipulated/used in the desired way in the Gold layer.
Gold Layer
The Gold layer hosts our Single Source of Truth quality data that is exposed to and consumed by business users, data analysts, data scientists and machine learning engineers.
In the Gold Layer we will ensure that all entities are presented in a form that aligns with business logic and reporting/analysis requirements. Our Gold layer will be optimised to ensure that performance is optimal when models, reports and dashboards refresh and execute queries against the Gold objects.
Entities within the Gold layer may represent detailed/transactional data to cater for modelled/report aggregations and drill through to lower grains of data, or aggregated data to facilitate higher level summary views for more performant views of aggregated data and protecting from exposing lower level detail that may be of a more sensitive or confidential nature.
Building a Medallion Architecture Solution
The term Medallion Architecture is probably best known in the Lakehouse world, but this layered approach can be followed in more traditional data warehouse solutions, or hybrid Lakehouse and Data Warehouse solutions.
As we work on gathering all of our valuable data assets from the far corners of our business it is important that we can always trace gold seal data consumed by reports and dashboards back to its original source and form, and doing this requires that we are organised, methodical and have inspection points that allow us to validate authenticity and see an audit trail of where our data has come from. For this reason a single programmed solution that cleanses, transforms, joins and manipulates data in memory, and the simply outputs the final result to a stored location, is not enough. How will you know where the hundreds of fields and measures (KPIs) exposed to the reporting and analytics world came from if you suspect a quality issue with the data? We need inspection points along the way.
Another requirement of you data solution may be to be able to identify dependencies on specific sets of data or core data elements. Let's suppose you are planning ion implementing a new CRM system. You want to be able to assess the impact on your business reporting and analytics. Which reports, dashboards and AI/ML models will be affected by this change?
One of the most important considerations when building a data solution is your security policy. Who should have access to the raw, clean and curated data? Are you able to easily restrict access or audit access to more sensitive data sets? Security is important to prevent people from picking up unsuitable (dirty) data for reporting and analysis, as well as to prevent business users or data analysts/data engineers/data scientists from accessing confidential data they are not cleared to view.
Without an organised approach to moving your data through the various preparation stages from dirty/raw to clean and "Gold Seal" you will have great difficulty in making provision for:
Data Authenticity, Lineage and Auditing
Data Dependencies
Point in Time/Version Control view of data
Robust security model
Data Authenticity, Lineage and Auditing
Traceability, and the ability to inspect data at every part of its journey is key to gaining trust from business users, but is also extremely important where regulation and compliance rules requires the ability to demonstrate data authenticity and provide data lineage analysis
Where did the data originally come from and what did it look like in its original form?
The Medallion architecture lends itself to being able to inspect data at every stage of its journey from raw to ready to use.
Data Dependencies
Passing some form of data origin data through to final Gold entities will make it easier to determine which models, reports or dashboards will be impacted by a change to a data source, business application or service.
Silver and Gold entities can be designed to include metadata that identifies the original source or sources.
Point in Time/Version Control view of data
If a business user questions validity or quality of data exposed in a report or dashboard you may need to be able to view the data as it was at a particular point in time that they are referring to as well as the current view of the data.
The Bronze layer should hold a complete history of imported data so that it is always possible to go back and see what data looked like at a specific point in time rather than just today.
Robust Security Model
All three layers of the Medallion Architecture solution should provide an appropriate security policy to ensure that access is given only to entities that are relevant to an individual and that they are authorised to view.
Location of Bronze, Silver and Gold Data
Whilst more commonly associated with cloud based Lakehouse repositories the Medallion layered approach is a good one to take regardless of whether you data will be physically located in a Lakehouse or Data Warehouse.
In both Lakehouses and Data Warehouses typically schemas can be created which provide a perfect way of separating data at each stage of its journey from raw and dirty to quality and clean (From Bronze to Silver to Gold), but within a single catalog for ease of management.
Here is an example of a single Microsoft Fabric Lakehouse with 3 separate schemas (a feature in preview at the time of writing in Match 2025):
It may be preferable to separate each layer out into completely separate catalogs to provide a simpler method of implementing a robust security separation at each layer.
Here is an example of a Microsoft Fabric set of Lakehouses, on for each layer:
A solely Data Warehouse approach could be taken rather than a Lakehouse, again with either a single database with separate schemas for each layer, or completely separate databases for each schema.
Medallion Architecture Workflows
ELT toolsets often provide workflow templates that provide an easy means for you to follow the Medallion Architecture for your data solutions.
Microsoft Fabric Dataflows
For example Microsoft Fabric workspaces enable dataflows to be created from a Medallion Architecture template:
The green represents the extraction of the raw data
The blue represents a storage layer of Bronze, Silver or Gold
The purple represents the processes for cleansing, preparing and transforming the data at the Silver and Gold stages.
The gold and red represent consuming the Gold data for reporting and analytics.
Selecting the New Work Item will give a selection of new object types relevant to the type of operation - get data, store data, process data.
Get Data
Get Data (High Volume):
Get Data (Low Volume):
Store Data (Bronze, Silver, Gold)
Prepare Data (Initial Process, Further Transformation)
Visualise Data
Analyse and Training Data - ML Serving
Analyse and Training Data - ML Serving
Conclusion
As with all things data there is no one right or wrong way to go about implementing your Data Analytics solution and the choices you make about storage and processing methods along the way will be influenced by many factors:
Internal skill set and resource
Budget
Timeline
Data source support
Data Governance and compliance requirements
Quality of source data
You would be very well advised to approach a data analytics solution with careful planning based on a solid data discovery and data strategy exercise before launching into an actual implementation.
Whether you follow a rigid Medallion Architecture approach to your data journey or not you must be methodical, organised and above all ensure that you can meet all of the your data governance and security requirements, whilst also delivering high quality, trusted single source of truth data that will drive better business decisions and generate a good return on investment.
Our Data Analytics consultancy and Data Strategy consultancy services give you access to a team of experts from solution architects through to data engineers, data scientists and data analysts, all of which can be complimented by our tailored and blended training services to give your team the mentoring and support they need throughout your whole data journey.
Share This Post
Mandy Doward
Managing Director
PTR’s owner and Managing Director is a Microsoft MCSE certified Business Intelligence (BI) Consultant, with over 30 years of experience working with data analytics and BI.
Related Articles
Frequently Asked Questions
Couldn’t find the answer you were looking for? Feel free to reach out to us! Our team of experts is here to help.
Contact Us