Data Cleaning

While browsing through the internet to pull in some data to clean, I came across some housing data. I had a quick glance through the data and noticed a few things that can be cleaned or transformed.

SELECT *
FROm SQLPortfolio.dbo.housing;
  1. I noticed that the property address itself was NULL in many records. I digged deeper into the data and found that property address and parcelID are linked. The parcelID is never NULL. So, I can use that as the identifier to populate the missing property address. I have taken one sample to cross check the link between parcelID and property address.
SELECT *
FROm SQLPortfolio.dbo.housing
WHERE ParcelID = '025 07 0 031.00';
WHERE ParcelID IN (
SELECT ParcelID
FROM SQLPortfolio.dbo.housing
WHERE PropertyAddress IS NULL);

Next, I wrote a SELECT query to validate part of the query I would put into the UPDATE query to fill in values to the records where the property address was NULL

SELECT h1.ParcelID, h1.PropertyAddress, h2.ParcelID, h2.PropertyAddress
FROM SQLPortfolio.dbo.housing AS h1
JOIN SQLPortfolio.dbo.housing AS h2
ON h1.ParcelID = h2.ParcelID
AND h1.[UniqueID ] <> h2.[UniqueID ]
WHERE h1.PropertyAddress IS NULL;

UPDATE h1
SET h1.PropertyAddress = ISNULL(h1.PropertyAddress, h2.PropertyAddress)
FROM SQLPortfolio.dbo.housing AS h1
JOIN SQLPortfolio.dbo.housing AS h2
ON h1.ParcelID = h2.ParcelID
AND h1.[UniqueID ] <> h2.[UniqueID ]
WHERE h1.PropertyAddress IS NULL;

2. I wanted to split the property address into seperate columns – Address_Only and City. This will be helped to do analysis later if one needs to group by the city name. I started off by writing a SELECT query to substring function to ensure that I am splitting the address correctly.

SELECT SUBSTRING(PropertyAddress,1,CHARINDEX(',',PropertyAddress)-1) 
AS "Address_Only",
SUBSTRING(PropertyAddress,CHARINDEX(',',PropertyAddress)+1,LEN(PropertyAddress)) 
AS "City"
FROM SQLPortfolio.dbo.housing;

Next, I modify the definition of the table by adding the two new columns using the ALTER TABLE query. Then, I add the details into the table using the UPDATE query.

ALTER TABLE SQLPortfolio.dbo.housing
ADD Address_Only nvarchar(255),
City nvarchar(255);

UPDATE SQLPortfolio.dbo.housing
SET Address_Only = SUBSTRING(PropertyAddress,1,CHARINDEX(',',PropertyAddress)-1),
City = SUBSTRING(PropertyAddress,CHARINDEX(',',PropertyAddress)+1,
LEN(PropertyAddress));

3. The date column is having 00:00:00 in all the rows. It is kind of irrelevant and redundant data as it doesn’t have any significance. So, I wrote a few simple queries to update it.

SELECT SaleDate, CONVERT(DATE, SaleDate) AS 'Only_Date'
FROM SQLPortfolio.dbo.housing;

ALTER TABLE SQLPortfolio.dbo.housing
ADD SaleDateOnly DATE;

UPDATE SQLPortfolio.dbo.housing
SET SaleDateOnly = CONVERT(DATE, SaleDate);

4. There is a column named as SoldAsVacant. It is a Boolean column having both Y/N and Yes/No. Let’s jump in and standardize it. I wrote a SELECT query to find out all possible entries in that column, then wrote an UPDATE query to standardize it.

SELECT COUNT(*),SoldAsVacant
FROm SQLPortfolio.dbo.housing
GROUP BY SoldAsVacant;

UPDATE SQLPortfolio.dbo.housing
SET SoldAsVacant = (
CASE
WHEN SoldAsVacant = 'Yes' THEN 'Y'
WHEN SoldAsVacant = 'No' THEN 'N'
END
);

5. I again see another address column i.e. owner address. The entire address is present in the column similar to the property address column. So, let’s split it to OwnerAddress_Only, OwnerCity and OwnerState

SELECT SUBSTRING(OwnerAddress,1,CHARINDEX(',',OwnerAddress)-1) 
AS "OwnerAddress_Only",
SUBSTRING(SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress)),1,
CHARINDEX(',',SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress)))-1) AS "OwnerCity",
SUBSTRING(SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,LEN(OwnerAddress)),
CHARINDEX(',',SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress)))+1,LEN(SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress)))) AS 'OwnerState'
FROM SQLPortfolio.dbo.housing;

ALTER TABLE SQLPortfolio.dbo.housing
ADD OwnerAddress_Only nvarchar(255),
OwnerCity nvarchar(255),
OwnerState nvarchar(255);

UPDATE SQLPortfolio.dbo.housing
SET OwnerAddress_Only = SUBSTRING(OwnerAddress,1,CHARINDEX(',',OwnerAddress)-1),
OwnerCity = SUBSTRING(SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress)),1,CHARINDEX(',',SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress)))-1),
OwnerState = SUBSTRING(SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,LEN(OwnerAddress)),
CHARINDEX(',',SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress)))+1,LEN(SUBSTRING(OwnerAddress,CHARINDEX(',',OwnerAddress)+1,
LEN(OwnerAddress))));

In the above queries, the substring function was getting a bit complex. In order to reduce the complexity, I was researching online and came across the PARSENAME function. This function splits the data considering ‘.’ as the delimiter, so I used REPLACE function to convert ‘,’ as ‘.’ in the owner address field. We can use this as an alternative approach to the above method.

SELECT PARSENAME(REPLACE(OwnerAddress,',','.'),3) AS 'Address',
PARSENAME(REPLACE(OwnerAddress,',','.'),2) AS 'City',
PARSENAME(REPLACE(OwnerAddress,',','.'),1) AS 'State'
FROM SQLPortfolio.dbo.housing;

6. Let’s remove the duplicate entries in the table. I grouped by parcelID, legalReference and SaleDate. The first time a record has the same three attributes the count will be 1 and it will be 2 when another record has the same attributes. That is how we find the duplicate records. It is good practice to write a SELECT query every time before you want to delete anything using a DELETE query. This helps you to validate that you would delete the correct records or not.

WITH countCTE AS(
SELECT *,
ROW_NUMBER() OVER( PARTITION BY ParcelID, LegalReference, SaleDate ORDER BY UniqueID) AS 'count'
FROM SQLPortfolio.dbo.housing
)
SELECT COUNT(*)
FROM countCTE
WHERE count > 1;

WITH countCTE AS(
SELECT *,
ROW_NUMBER() OVER( PARTITION BY ParcelID, LegalReference, SaleDate ORDER BY UniqueID) AS 'count'
FROM SQLPortfolio.dbo.housing
)
DELETE
FROM countCTE
WHERE count > 1;

Finally, let’s DROP the columns that are not needed.

ALTER TABLE SQLPortfolio.dbo.housing
DROP COLUMN PropertyAddress, SaleDate, OwnerAddress;

Let’s list the data to validate that the required details are present.

SELECT *
FROm SQLPortfolio.dbo.housing;

That’s the end of data cleaning on this dataset.

Liked my work?