Nashville Housing Data Cleaning
Nashville Housing Data Cleaning using Microsoft SQL Server. Like our house, we love our data clean.
The Data
Home value data for the booming Nashville market with 56,000+ rows altogether. The dataset can be found here.
Standardize Date Format
The SaleDate column contains date and time stamps. The time stamp could be removed because it only contains 00:00:00.000.
ALTER TABLE NashvilleHousing
ADD SaleDateConverted DATEUPDATE NashvilleHousing
SET SaleDateConverted = CONVERT(DATE, SaleDate)SELECT SaleDateConverted
FROM PortfolioProject..NashvilleHousing
Populate Property Address data
There are 29 rows with null property address.
SELECT *
FROM PortfolioProject..NashvilleHousing
WHERE PropertyAddress IS NULL
ORDER BY ParcelID
If we look closer, the unique unit [UniqueID] can be in the same parcel [ParcelID]. Each parcel only had one address, so more than one unit could have the same property address if they are in the same parcel.
SELECT *
FROM PortfolioProject..NashvilleHousing
-- WHERE PropertyAddress IS NULL
ORDER BY ParcelID
Using self-join, we could populate the null property address with a property address that had the same ParcelID.
SELECT a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress
FROM PortfolioProject..NashvilleHousing AS a
JOIN PortfolioProject..NashvilleHousing AS b
ON a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
WHERE a.PropertyAddress IS NULL
ORDER BY a.ParcelID
UPDATE a
SET PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress)
FROM PortfolioProject..NashvilleHousing AS a
JOIN PortfolioProject..NashvilleHousing AS b
ON a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
WHERE a.PropertyAddress IS NULL
After the update, we could check if there are any null rows left & if the updated rows are filled with the correct address.
SELECT a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress
FROM PortfolioProject..NashvilleHousing AS a
JOIN PortfolioProject..NashvilleHousing AS b
ON a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
WHERE a.PropertyAddress IS NULL
ORDER BY a.ParcelID
SELECT a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress
FROM PortfolioProject..NashvilleHousing AS a
JOIN PortfolioProject..NashvilleHousing AS b
ON a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
-- WHERE a.PropertyAddress IS NULL
ORDER BY a.ParcelID
Breaking out Address into Individual Columns (Address, City, State)
The PropertyAddress column contains the address and the city the property is located. We could separate the address and the city into different columns for future analysis purposes.
SELECT
SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) - 1),
SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1, LEN(PropertyAddress))
FROM PortfolioProject..NashvilleHousing
ALTER TABLE NashvilleHousing
ADD PropertySplitAddress NVARCHAR(255)UPDATE NashvilleHousing
SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) - 1)ALTER TABLE NashvilleHousing
ADD PropertySplitCity NVARCHAR(255)UPDATE NashvilleHousing
SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1, LEN(PropertyAddress))SELECT * FROM PortfolioProject..NashvilleHousing
We could do the same with the OwnerAddress column which consists of address, city, and state and could be split into new columns that are OwnerSplitAddress, OwnerSplitCity, and OwnerSplitState.
SELECT
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 3),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 2),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 1)
FROM PortfolioProject..NashvilleHousing
ALTER TABLE NashvilleHousing
ADD OwnerSplitAddress NVARCHAR(255)UPDATE NashvilleHousing
SET OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3)ALTER TABLE NashvilleHousing
ADD OwnerSplitCity NVARCHAR(255)UPDATE NashvilleHousing
SET OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2)ALTER TABLE NashvilleHousing
ADD OwnerSplitState NVARCHAR(255)UPDATE NashvilleHousing
SET OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1)
Change Y and N to Yes and No in “Sold as Vacant” field
There are some inconsistencies in the SoldAsVacant column. We could standardize it to only contain ‘Yes’ and ‘No’ categories.
SELECT DISTINCT(SoldAsVacant), COUNT(SoldAsVacant)
FROM PortfolioProject..NashvilleHousing
GROUP BY SoldAsVacant
ORDER BY 2
UPDATE NashvilleHousing
SET SoldAsVacant =
CASE WHEN SoldAsVacant = 'Y' THEN 'Yes'
WHEN SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
ENDSELECT DISTINCT(SoldAsVacant), COUNT(SoldAsVacant)
FROM PortfolioProject..NashvilleHousing
GROUP BY SoldAsVacant
ORDER BY 2
What We Did
- Standardizing date format
- Populating null property address
- Breaking property and owner address into new columns
- Standardizing ‘Yes’ and ‘No’ category
Having clean data will ultimately increase overall productivity and allow for the highest quality information in our decision-making.