Data Migration to Azure Synapse Data Warehouse using PolyBase

Shamen Paris
Feb 6, 2021
5 min read

What is PloyBase?

PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. The same query can also access relational tables in your instance of SQL Server. PolyBase enables the same query to also join the data from external sources and SQL Server.

Azure integration

With the underlying help of PolyBase, T-SQL queries can also import and export data from Azure Blob Storage. Further, PolyBase enables Azure Synapse Analytics to import and export data from Azure Data Lake Store, and from Azure Blob Storage.

Why use PolyBase?

PolyBase allows you to join data from a SQL Server instance with external data. Prior to PolyBase to join data to external data sources you could either:

Transfer half your data so that all the data was in one location.
Query both sources of data, then write custom query logic to join and integrate the data at the client level.

PolyBase allows you to simply use Transact-SQL to join the data.

PolyBase does not require you to install additional software to your Hadoop environment. You query external data by using the same T-SQL syntax used to query a database table. The support actions implemented by PolyBase all happen transparently. The query author does not need any knowledge about the external source.

PolyBase uses

PolyBase enables the following scenarios in SQL Server:

Query data stored in Hadoop from a SQL Server instance or PDW. Users are storing data in cost-effective distributed and scalable systems, such as Hadoop. PolyBase makes it easy to query the data by using T-SQL.
Query data stored in Azure Blob Storage. Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL.
Import data from Hadoop, Azure Blob Storage, or Azure Data Lake Store. Leverage the speed of Microsoft SQL's columnstore technology and analysis capabilities by importing data from Hadoop, Azure Blob Storage, or Azure Data Lake Store into relational tables. There is no need for a separate ETL or import tool.
Export data to Hadoop, Azure Blob Storage, or Azure Data Lake Store. Archive data to Hadoop, Azure Blob Storage, or Azure Data Lake Store to achieve cost-effective storage and keep it online for easy access.
Integrate with BI tools. Use PolyBase with Microsoft's business intelligence and analysis stack, or use any third party tools that are compatible with SQL Server.

PolyBase Setup

Create a master Key
Create a database scoped credential with the storage key
Create an external data source
Create external file format
Create an external table
Load from the external table

Let start Migrate data to Blob storage using PolyBase.

Step 1: Export table to the flat file

First I will export FactTransactionHistory data from AdventureWorksDW2017 database to flat file because PolyBase work for flat files.

Step 2: Create Blob Storage Account

After Export the flat file need to store data in storage. We can use Azure Blob Storage for that. Create a Azure Blob Storage for store FactTransactionHistory Data file.

Step 3: Upload flat file to Blob Storage Account

For store the data in Blob Storage we can use several methods. I will use Microsoft Azure Storage Explorer to store file in Blob Storage.

Open the Microsoft Azure Storage Explorer application and sign in with Microsoft Account.

Then go to Storage Account under your subscription and go to Blob containers. After that click on Upload button.

Then choose upload file in popup and select file by click on browser. After that upload the file.

Step 4: Run PolyBase 6 steps process

1. Create a Database Master Key. Only necessary if one does not already exist. Required to encrypt the credential secret in the next step. To access your Data Lake Storage account, you will need to create a Database Master Key to encrypt your credential secret.

You then create a Database Scoped Credential to store your secret. When authenticating using service principals (Azure Directory Application user), the Database Scoped Credential stores the service principal credentials set up in AAD.

You can also use the Database Scoped Credential to store the storage account key for Blob storage.

CREATE MASTER KEY;
GO

2.(for blob storage key authentication): Create a database scoped credential

IDENTITY: Provide any string, it is not used for authentication to Azure storage.

SECRET: Provide your Azure storage account key.

CREATE DATABASE SCOPED CREDENTIAL BlobStorageCredential
WITH
    IDENTITY = 'blobuser',  
	SECRET = 'blobstorage secret key'   
;
GO

3 (for blob): Create an external data source

TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure Data Lake Storage.

LOCATION: Provide Data Lake Storage blob account name and URI

CREDENTIAL: Provide the credential created in the previous step.

CREATE EXTERNAL DATA SOURCE AzureBlobStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://demofiles@<Storage account name>.blob.core.windows.net',
    CREDENTIAL = BlobStorageCredential
);
GO

4: Create an external file format

FIELD_TERMINATOR: Marks the end of each field (column) in a delimited text file

STRING_DELIMITER: Specifies the field terminator for data of type string in the text-delimited file.

DATE_FORMAT: Specifies a custom format for all date and time data that might appear in a delimited text file.

Use_Type_Default: Store missing values as default for datatype.

CREATE EXTERNAL FILE FORMAT CSVFileFormat 
WITH 
(   FORMAT_TYPE = DELIMITEDTEXT
,   FORMAT_OPTIONS  (   FIELD_TERMINATOR = ','
                    ,   STRING_DELIMITER = ''
                    ,   DATE_FORMAT      = 'yyyy-MM-dd HH:mm:ss'
                    ,   USE_TYPE_DEFAULT = FALSE 
                    )
);
GO

5: Create an External Table

LOCATION: Folder under the Data Lake Storage root folder.

DATA_SOURCE: Specifies which Data Source Object to use.

FILE_FORMAT: Specifies which File Format Object to use

REJECT_TYPE: Specifies how you want to deal with rejected rows. Either Value or percentage of the total

REJECT_VALUE: Sets the Reject value based on the reject type.

Important Note: External Tables are strongly typed.

This means that each row of the data being ingested must satisfy the table schema definition.

If a row does not match the schema definition, the row is rejected from the load.

CREATE SCHEMA [stage];
GO

CREATE EXTERNAL TABLE [stage].FactTransactionHistory 
(
    [TransactionID] [int] NOT NULL,
	[ProductKey] [int] NOT NULL,
	[OrderDate] [datetime] NULL,
	[Quantity] [int] NULL,
	[ActualCost] [money] NULL
)
WITH
(
    LOCATION='/<file Path>' 
,   DATA_SOURCE = AzureBlobStorage
,   FILE_FORMAT = CSVFileFormat
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
GO

6 CREATE TABLE AS - CTAS

CTAS creates a new table and populates it with the results of a select statement.

CTAS defines the new table to have the same columns and data types as the results of the select statement.

If you select all the columns from an external table, the new table is a replica of the columns and data types in the external table.


CREATE SCHEMA [prod];
GO

CREATE TABLE [prod].[FactTransactionHistory]       
WITH (DISTRIBUTION = HASH([ProductKey]  ) ) 
AS 
SELECT * FROM [stage].[FactTransactionHistory]        
OPTION (LABEL = 'Load [prod].[FactTransactionHistory1]');

Step 5: Monitor and confirm successful migration

While run you can monitor the number of threads (Writer, External_Reader, Hash_converter)

-- check the progress of the load
SELECT * FROM sys.dm_pdw_exec_requests r 
JOIN sys.dm_pdw_dms_workers w on r.request_id = w.request_id
WHERE r.[label] = 'Load [prod].[FactTransactionHistory1]'
order by w.start_time desc;

Step 6: Confirm 60 distributions in destination table

First take count of records and compare with source table.

Then check whether data distribute in to 60 equally (approximately).

DBCC PDW_SHOWSPACEUSED('prod.FactTransactionHistory');

Good Luck!

Data Migration to Azure Synapse Data Warehouse using PolyBase

Recent Posts

Comments

Subscribe Here.