Dodgy data in a declarative language

Working within the health sector in the UK we have lots of very old systems; some built 20 years + ago. Many have questionable database forms and data validation built into their front end application layers. This can lead to interesting data transformations needed when creating data warehouse data marts.

One of these systems stores everything in questionnaires. People can create their own questionnaires made up of a number of custom questions. Unfortunately all of the answers in this system get stored as strings. The system also lacks the ability to add robust data validation to the user front end which results in unexpected data being input!

Lets create a temporary table holding 15 records that demonstrates this. We are expecting this column to describe a numeric attribute but some people didn’t get the memo and have added non-numeric data.

IF OBJECT_ID('tempdb..#dodgyData') IS NOT NULL
	DROP TABLE #dodgyData;

CREATE TABLE #dodgyData (

	Id				int			NOT NULL,
	SomeNumbers		varchar(20)	NOT NULL

);

INSERT INTO #dodgyData (Id, SomeNumbers)

VALUES	(1, '1'),
		(2, '10'),
		(3, '100'),
		(4, '1.23'),
		(5, 'Not A number'),
		(6, '1.4What'),
		(7, 'No weight'),
		(8, '919'),
		(9, '27.1234'),
		(10, 'Sixteen'),
		(11, '5'),
		(12, 'Unknown'),
		(13, '19'),
		(14, '12'),
		(15, '0.1.2');


SELECT	*
FROM	#dodgyData;

We have 9 records which are valid numeric numbers and 6 that aren’t.

If we tried to CAST this column to a numeric then we will get a data type conversion error.

SELECT

	Id, SomeNumbers,
	CAST(SomeNumbers AS decimal(10,2)) AS SomeNumbersFormatted

FROM

	#dodgyData;

Let’s just filter out the 9 valid records, selecting only the Id’s which have valid numeric values.

SELECT

	Id, SomeNumbers,
	CAST(SomeNumbers AS decimal(10,2)) AS SomeNumbersFormatted

FROM

	#dodgyData

WHERE

	Id IN (1, 2, 3, 4, 8, 9, 11, 13, 14);

Great that works, but is not particularly useful as we are not likely to know which Id’s will / won’t contain valid data in a real word scenario. However let’s roll with it for the example.

We have been told we need to include some further logic to filter the numeric values. We only want to show records where the value divided by 2 is greater than 5. Let’s amend our WHERE predicate to handle this.

SELECT

	Id, SomeNumbers,
	CAST(SomeNumbers AS decimal(10,2)) AS SomeNumbersFormatted

FROM

	#dodgyData

WHERE

	Id IN (1, 2, 3, 4, 8, 9, 11, 13, 14)
        -- only records where the result divided by 2 is greater than 5
	AND (CAST(SomeNumbers AS decimal(10,2))  / 2) > 5;

Doh! The query failed with another data conversion error. What the hell happened?

Declarative Languages

SQL is part of a branch of languages called declarative languages. These are programming languages where the program specifies what has to be done rather than exactly how to do it. This means it is up to the engine optimiser to work that bit out; it is possible for the optimiser to make a number of different decisions to get to the same end result. The decisions the optimiser makes can be seen by looking at the execution plan.

If we take a look at the estimated execution plan (as the query will not execute without error we need to look at the estimated plan rather than the actual plan), the optimiser is going to scan the table. This means it will read the entire table row by row. It will attempt to run (CAST(SomeNumbers AS decimal(10,2)) / 2) > 5 against each row which will result in a conversion error when it encounters a non-numeric value.

Let’s see if we can change the path the optimiser takes to our favour by adding a clustered index against the Id column.

CREATE CLUSTERED INDEX CIX_TEST ON #dodgyData (Id);

If we run the same query again the optimiser has made a different decision and runs without an error.

The table scan has been replaced with a table seek. The clustered index has sorted the data by the Id column so the optimiser knows exactly where the rows for the relevant Id’s are located so can “seek” directly to them. This means we are not reading the other rows which contain the non-numeric data.

CTE Tricks

Let’s remove the clustered index and rewrite the query with a CTE. We will set the CTE part to filter out the non-numeric records and then add the additional filter on the output of the CTE.

IF OBJECT_ID('tempdb..#dodgyData') IS NOT NULL
	DROP TABLE #dodgyData;

CREATE TABLE #dodgyData (

	Id				int			NOT NULL,
	SomeNumbers		varchar(20)	NOT NULL

);

INSERT INTO #dodgyData (Id, SomeNumbers)

VALUES	(1, '1'),
		(2, '10'),
		(3, '100'),
		(4, '1.23'),
		(5, 'Not A number'),
		(6, '1.4What'),
		(7, 'No weight'),
		(8, '919'),
		(9, '27.1234'),
		(10, 'Sixteen'),
		(11, '5'),
		(12, 'Unknown'),
		(13, '19'),
		(14, '12'),
		(15, '0.1.2');

WITH cteExample AS (

	SELECT

		Id, SomeNumbers,
		CAST(SomeNumbers AS decimal(10,2)) AS SomeNumbersFormatted
	
	FROM
	
		#dodgyData
	
	WHERE
	
		Id IN (1, 2, 3, 4, 8, 9, 11, 13, 14)
	
)

SELECT

	*

FROM

	cteExample

WHERE

	(CAST(SomeNumbers AS decimal(10,2))  / 2) > 5;

This results in the same data conversion error we had originally. Even though we have structured the statement differently the optimiser has rewritten the query to make the same decisions as before with a table scan. The execution plan shown is exactly the same as the first execution plan we had even though the query is written in a different way. This shows the declarative nature of SQL at work.

It is possible to get the optimiser to do something different if we incorporate the TOP operator. Let’s add the TOP operator into the query within the CTE. We will give it a extremely high number which we know will never be reached.

WITH cteExample AS (
        -- We will never reach this number!
	SELECT TOP 9999999

		Id, SomeNumbers,
		CAST(SomeNumbers AS decimal(10,2)) AS SomeNumbersFormatted
	
	FROM
	
		#dodgyData
	
	WHERE
	
		Id IN (1, 2, 3, 4, 8, 9, 11, 13, 14)
	
)

SELECT

	*

FROM

	cteExample

WHERE

	(CAST(SomeNumbers AS decimal(10,2))  / 2) > 5;

This time the query runs. The TOP operator in the CTE has tricked the optimiser into thinking it needs to execute the CTE query first before applying the filter on the outer query giving us a different execution plan. On the previous attempt the optimiser folded them together.

If we look at the execution plan, the first table scan only checks for the Id’s we wanted. This equates to the WHERE clause Id IN (1, 2, 3, 4, 8, 9, 11, 13, 14).

We have an additional operator later on in the plan which filters these records. This equates to the WHERE clause (CAST(SomeNumbers AS decimal(10,2)) / 2) > 5. Since we have already filtered out the non-numeric Id’s the query runs without the conversion error.

Persisting Intermediate Results

Another way we can work round this problem is to persist an intermediate dataset. If we persist the records which are just numeric’s then when we come to perform the second part we do not need to worry about having a casting error as the optimiser has no opportunity to fold the queries together. We can achieve this with either a table variable or a temporary table. Below is an example using a temporary table. When we create the temporary table it gets written to disk within the tempdb database.

IF OBJECT_ID('tempdb..#intermediateResults') IS NOT NULL
	DROP TABLE #intermediateResults;

SELECT

	Id, SomeNumbers,
	CAST(SomeNumbers AS decimal(10,2)) AS SomeNumbersFormatted

INTO

	#intermediateResults

FROM

	#dodgyData

WHERE

	Id IN (1, 2, 3, 4, 8, 9, 11, 13, 14);


SELECT

	*

FROM
	
	#intermediateResults

WHERE

	(SomeNumbersFormatted / 2) > 5;	

The execution plan is split into two distinct sections. The first is the creation of the temporary table with only numeric numbers. The second part is the filter of this table to only include values that satisfy (SomeNumbersFormatted / 2) > 5.

Final Thoughts

The declarative nature of the SQL is something newcomers often miss as they learn the language which can result in misunderstanding in how queries get executed and why sometimes things to do not happen as we expect. Hopefully the above has given a quick insight into how the database engine can behave in certain scenarios.