Power BI Experiences

I’ve spent quite a bit of time the last few days trying to get Power BI preview to work for accessing a tabular model so am sharing my experience. Based on that experience, I am making some assertions that are not even in the Microsoft documentation, albeit most of this is found on various other blogs of users who have tried to get this working.

Below is a screen snapshot of a demo that my adult son, Blake built in his spare time. My thanks to Blake for his work to do the tabular model and report.

image

My conclusion is that the Power BI configuration is extremely complex, difficult to troubleshoot and requires Active Directory infrastructure configuration. However, once it is working it is seamless to develop reports from a single Power BI site that access tabular models exposed by other organizations without the need to store the data locally.

Here are the steps that I had to do – your mileage may vary and I’m happy if you didn’t have to go through all of this:
1)    Power BI can only access a tabular model on a machine joined to a domain.
2)    The service account registered with Power BI must be defined in Active Directory with a full UPN exactly matching the Power BI logon.  For example, if the logon account is john.doe@company.onmicrosoft.com, then there must be an account in Active Directory registered with john.doe@company.onmicrosoft.com This is because Power BI sends all requests for tabular data using the “effectiveusername” technique (see profiler trace at end of this post).  This requires the following steps:
a.    Use Active Directory Domains and Trusts tool and select properties at the root level to get the UPN Suffixes dialogue as shown below and add the UPN suffix that matches to the Power BI domain name (i.e. company.onmicrosoft.com)
 image
Note that multiple UPNs may be added, but only one can be registered to a specific A/D account.
b.    Use Active Directory Users and Computers to identify or create users is mapped to the UPN suffix.  Use caution when creating Power BI accounts that are the same as those of the company domain accounts unless the Power BI domain is the same as the standard UPN suffix. Best practice is to register for Power BI using the company’s Email domain  rather than relying on the company-name.onmicrosoft.com. Below is an example of using a different UPN than the corporate domain:

 

image
 
The only article I’ve found on this is https://www.artisconsulting.com/blogs/greggalloway/Lists/Posts/Post.aspx?ID=27
Power BI uses a semantic model connection technique with effectiveusername – see https://msdn.microsoft.com/en-us/library/gg471575.aspx for an explanation.
c.    This needs to be done for all the Power BI user accounts that consume the model not just the one under which the Analysis Services Connector is configured.

3)    Depending on the firewall restrictions, port 8051 (HTTP transport) or 8050 (HTTPS) may need be opened. The documentation states that this is only needed for OData, but I have found it needed at least in my environment even for tabular only. (http://blogs.technet.com/b/powerbisupport/archive/2014/12/19/failed-to-register-the-gateway-key.aspx) For Azure use the endpoint functionality of either 8051 for HTTP and 8050 for HTTPS so that the Analysis Services Connector will work. If using 8051 all data will be transmitted in plain text. However, the credentials themselves appear to be tokenized between the connector and the Power BI site such that username/password theft does not appear to be a risk. I’ve not yet done extensive network sniffing to evaluate the risks with not using SSL.

Only after the above steps are done, will a connector be able to be defined that will actually allow the tabular data to be piped back to Power BI.

4)    Configure the Analysis Services Connector using instructions from http://support.powerbi.com/knowledgebase/articles/471577-configure-a-power-bi-analysis-services-connector.  When configuring the connector, you are asked to login to Power BI again. Use the Power BI account configured in step 2 that has the required UPN suffix.
5)    At this point, Power BI should be able to be connected directly to the tabular model and Excel reports that use the tabular model can be uploaded and should function.

Note the profiler trace of SSAS tabular connection via the gateway –  the gateway was actually configured using serviceaccount@authintel.com, yet the connection is made on behalf of the specific Power BI user. This means that all users must be granted read access using their full UPN mapping to the Power BI account.

…<PropertyList xmlns=”urn:schemas-microsoft-com:xml-analysis” xmlns:soap=”http://schemas.xmlsoap.org/soap/envelope/”><Catalog>DemoSales_Tabular</Catalog><LocaleIdentifier>1033</LocaleIdentifier>
<EffectiveUserName>john.doe@yourcompany.com</EffectiveUserName><sspropinitappname>PowerBI</sspropinitappname></PropertyList>

Posted in Uncategorized | Leave a comment

Data Security and the Lack thereof

Americans are increasingly becoming aware of the vulnerability of their private data with breaches now up in the millions that have not only disclosed credit card information, but even more potentially dangerous information such as social security numbers linked to birthdates, names, and addresses.  Your name and address along with a birth date and social security number is a virtual ticket to complete identity theft creating risks of massive financial destruction. There are multiple pieces of data that are private to an individual, but the social security number (SSN) is the one that is the most critical and is the focus of this post. The same principles apply to securing SSN as any other private data.

How did we get to this point? Is this just the inevitable consequence of Internet access, distributed systems, and evolving technology? Or is much of this due to irresponsibility of those that hold the most confidential aspects of your data? In this post, I make the argument that it is due to carelessness of organizations that hold such data and that there are ways to ensure safeguarding of data protecting not only against external threats, but against the more powerful threats from within an organization.

This post first articulates the drivers behind the situation which can be divided into two main areas: wrong assumptions and lack of diligence. Next, we identify the architectural principles of the software and infrastructure that mitigate the risks for security breach and contrast that with what most organizations believe is adequate.  Finally, we provide practical strategies to remediate the situation.

Wrong Assumptions

The most common wrong assumption is that security threats are primarily outside of an organization rather than from within. Studies have shown this is not the case. Even in the case of external attack, there is often an insider involved such as distraught employee who gives away or even sells his username and password that will provide access to confidential data. Organizations spend large amounts of data on firewalls, external network monitoring, but how much do they invest to protect the confidentiality of the data within their own network?

Many organizations run on the assumption that their employee usernames and passwords could never be compromised and do not consider the impact of an external user using a username/password to gain access into the network. Many organizations allow the  simple use of a username and password to access a VPN without use of any other physical validation such as a smart card or token device. But, even those that have restrictions to minimize the risk of an external user utilizing corporate credentials to enter the network are often careless about private data including the SSN and birthdate and have overly lose restrictions on access with often no auditing. A disgruntled user who turns dark has the potential to create just as much damage as any hacker from North Korea.

I find it interesting that virtually every break-in is blamed on sophisticated attacks from foreign entities or terrorists. However, the reality is that most of the break-ins that occur require very little sophistication and are probably mostly home-grown even if foreign entities are involved.

I have had the opportunity to work on database, architecture and application development projects over the years. Many of those customers stored SSN unencrypted and lacked effective safeguards to audit access to that data. If I was a criminal, I could have retired a long time ago. 10 million SSNs along with birthdates, and complete private data including with identifying information can be dumped from a database into a .CSV file in a manner of seconds. Such a file can be zipped up and encrypted, renamed to a .txt and small enough to be emailed as an attachment using a web mail client with virtually no detection. That is how easy it is to completely confiscate social security numbers within many organizations. At $25.00 per complete private identity record (street value) for 10 million records, that works out to a $250 million payoff.

Lack of Regulation

It amazes me that our government which supposedly is so concerned about protecting the welfare of it’s citizens has created virtually no regulatory standards for storage and retrieval of private data. Yes, there are requirements for secure external interfaces such as SSL, but that is only one piece of a secure infrastructure. I speak from experience in working with state and federal entities and have found that the vast majority of the databases that have SSNs do not even store these encrypted. Worse, the SSN is stored right alongside all of the information needed to identify the person linked to the number, including complete name, address, and many times even their birthdates. The complacency in regard to social security numbers is startling given that one cannot change this and once compromised means a person is at risk of identity theft for the rest of their life. At least when a password is compromised, it can be changed. There is no excuse for any organization to store social security numbers unencrypted in a database. This is a worse violation of security than storing a password in plain text.

Lack of Diligence

The fact that the government has not imposed regulations that require companies to store private data encrypted is no excuse for organizations. The necessary technology exists to not only encrypt private data, but to ensure that it remains encrypted through the lifecycle of it’s access and use. Even in organizations that have made attempts to secure this data, huge gaps are usually left. Consider the following ineffective strategies that often employed:

1) Data at Rest Security: This is the concept that data as it exists on any file system should be encrypted so that even if a database is stolen, it cannot be moved to another system so that the data can be used. Microsoft SQL Server Transparent Data Encryption (TDE) is an example of this approach. While this is useful to protect against physical data theft, it is useless for protection from internal threats unless a framework of auditing and security has been placed around access to the system containing the data. Secure data at rest systems still reveal private data to those that are authorized through the interfaces that access the data.

2) Reversible encryption: The problem with this is in the name – it is reversible. SQL Server again provides the capability to encrypt particular data columns such as a SSN. I have seen this utilized, but the fact that the encryption is reversible implies that it will be decrypted at some point downstream. In systems that use this, often the private key information is included in the views or stored procedures so that the data is only secure in the case of direct table access. Even in the case where the decryption is not accomplished until further down in the application layer, it is still decrypted and available to anybody who has the key. Worse, the security of the data is totally dependent on some supposedly private key that has to be defined somewhere in code. While reversible encryption provides some protection, it is only as good as the secret of the encryption key and anybody with access to that information has the keys to go in and exploit the data.

3) Partial obfuscation. Even the last 4 digits commonly used to validate along with other PI data should not need to be stored in plain text or with reversible encryption. A partial number such as the last 4 is still a ticket to identity theft when combined with other data. Any portion of the SSN that is involved with identification must be non-reversibly encrypted to minimize identity theft risk.

What is really needed

Earlier, I suggested that the only real solution to protecting private data such as social security numbers is to treat them with the same respect as a password. This implies that they utilize non-reversible encryption. With non-reversible encryption, there is no way to decrypt the value. The only way to determine a value that is based on non-reversible encryption is to actually know the value. When one knows the value, the value can be encrypted and then it can be matched against the stored encrypted key. But, is this practical? There are many arguments used for why social security numbers cannot be encrypted. Below is a list of them and why none of them are valid excuses:

1) The SSN is used to link data together. This is easily resolved through the use of surrogate keys.

2) Users need to search on a SSN or partial SSN to verify the identity – i.e. A credit agency will need the user’s security number as one piece of data to verify the person is authorized for the information requested. This is not a problem for a non-reversible encryption scheme. The full SSN as well as last 4 digits is stored in the database using a non-reversible encryption scheme. The user can enters the last 4 digits in a web form on the client over a secure connection. The web client then invokes the same encryption scheme to pass this along with other identifying information to find a match. Note that with non-reversible encryption whether it is a MD5 hash or some other technique, the encryption scheme does not even need to be kept private. Since there is no way to reverse the encrypted values, knowing the scheme does not circumvent the security.

The bottom line is that even organizations who need to utilize SSNs for validation purpose have no excuse to store the data in plain text or even with a reversible encryption scheme. At no point, should such data be unencrypted. Instead such data should be permanently one-way encrypted (non-reversible).  Validation using the SSN should always be done with partial strings – the last 4, rather than the whole number.  The full SSN does not need to be used since the last four digits can be encrypted separately. In this approach, presentation of the last part of the SSN is immediately encrypted to lookup a matching record in the database using just the last 4 digits encrypted along with some other personal identification, rather than the plain text value.

Strategies

What is an organization to do that finds itself in the position of having SSNs in plain text or with a reversible encryption scheme, especially when it is linked to other private data and there are applications dependent on these structures? The first step is to start to remediate the situation immediately. This needs to be a top priority for any business, even given the government’s failures to make it a priority. Consider the risks and costs and litigation not to mention the tarnished reputation when a breach occurs. And it will occur or worse it maybe already has occurred and the business isn’t even aware of it. Few organizations actually audit external activities at the level of detail needed to even know if private data has been breached. They only find out usually after the data is already been utilized in a fraudulent fashion.

It may take months or years to remediate all the systems within an organization in a way that still allows the organization to do the business. In keeping with the principle of due diligence, what can an organization do until they reach the ideal state in securing data? The key is to go after the low hanging fruit to put safeguards in place even while mitigating the core issue. One way to do this is through implementation of a gatekeeper mentality. If you have to send an important package, you request tracking from the mail service. In the same way any confidential data must be tracked from end to end. That is the first step toward protection. It is not adequate, but it is a start.

The Gatekeeper Approach

Below are 5 steps to implementing a gatekeeper approach:

1) Ensure that all access to private data is captured. This includes the source from which the data was retrieved, the account used to access the data and everything that happens to the data afterwards. This means that all aspects of data movement must be tracked including not only the database access, but any user manipulation of the data after it has been retrieved.  This means that both database auditing and network auditing must be implemented.

2) Define a repository for storing the complete lifecycle of data access corresponding to the actions captured from #1. For every piece of sensitive data, all access to the information must be accomplished. There are ways to implement this framework in SQL Server trough the use of custom profiler traces as well as extended events and logon triggers. I cannot speak to how to do this at the network level, but I know that it is possible.

3) Define a dictionary of rules that identify the type of accesses allowed to the data. As data movements are captured, exceptions will be encountered. Safeguards can be put in place to block movements that are not defined in the dictionary. A major part of this development is training such that the system learns from previously sanctioned activities so that it can recognize unauthorized accesses.

4) Stop movements that are not allowed by the dictionary. This may result in some pain as individuals with legitimate requests are blocked.

5) Put a learning framework in place with continuous feedback so that the systems can learn from mistakes and improve.

Architecture Remediation

Databases: Any organization that is storing sensitive data non-encrypted or even with reversible encryption has been following a flawed data architecture. All databases and applications should be reviewed for compliance with principles of non-reversible encryption as well as complete tracking of all such data. Systems that are in violation need to be prioritized in terms of business impact due to downtime with changing them over to a secure approach.

Interfaces: In addition to databases, interfaces also need to be analyzed and modified to preserve security of SSN and other private data in flight. The service accounts associated with all processes must be linked to the interfaces using such techniques as IPSEC such that compromise of the password for the service account does not create a risk. It is not enough to simply have secure interfaces, the data within the secure interface must be obfuscated so that if the interface is compromised, there is little to gain from the data being transmitted.

User interfaces: User applications that simply retrieve data including unencrypted SSN must be modified to work with non-reversible encrypted SSN and components need to be added to allow validating a SSN by re-encrypting the portion of it needed to check for a match rather than by matching in the database.

In Summary

This post just scratches the surface of the data security issue. I hope to write more in the upcoming months. It takes a focused team of database experts, network experts, application developers, architects, and user-interface experts working together to remediate this issue. The time is now to insist that organizations become accountable for protecting their customer data and this includes government agencies. Partial obfuscation with just the last 4 digits unencrypted is almost as bad as having the whole string unencrypted since this provides the means to impersonation to acquire the full SSN.

As a citizen, I encourage you to lobby your congressman for regulations that require protection of private data. Developers, I encourage you to not even consider transmitting any piece of private PI without using non-reversible encryption schemes. CEOs, I encourage you to place private data security at the top of the risk and treat this with the urgency of the Y2K crisis. DBAs, I encourage you to educate your organization and insist on not storing private data without non-reversible encryption. Architects, do not even consider a design that does not include complete auditing of all sensitive data flows or allows for transmission of SSN in plain text, even within the context of a SSL interface.

Posted in Uncategorized | Leave a comment

Home-grown PCIE SSD VM vs. Azure (Update 12/7)

More testing today with larger database, multiplied the number of rows by 10x up to 1 billion rows in the test dataset.  The latest results indicate I/O about 3 times slower on an Azure D13 server using similar CPU/Ram specs from a dedicated VM on a Gen-1 raid Fusion-io duo drive. Also, the time to insert the test record set was about 2.5 times slower.  This aligns with the performance metrics outlined by http://azure.microsoft.com/blog/2014/10/06/d-series-performance-expectations/. The D series scales upward in terms of SSD capability, so the only way to achieve the I/O of the VM would be to upgrade to a D14 configuration with 800 GB SSD. Along with that comes more cores and memory (16 cores and 112 GB RAM) but that should not significantly affect the I/O comparison.

Here are the bottom line backup logs:

From the dedicated VM (48GB RAM, 6 cores @ 2.66, Dell-branded Fusion-io 640 GB MLC card:

Processed 1931504 pages for database ‘TestDb’, file ‘TestDb’ on file 1.
100 percent processed.
Processed 9 pages for database ‘TestDb’, file ‘TestDb_log’ on file 1.
BACKUP DATABASE successfully processed 1931513 pages in 26.102 seconds (578.114 MB/sec).

From the Azure D13 server (56 GB RAM, 8 Cores @ 2.20 GHz, 400 GB SSD)

Processed 1931648 pages for database ‘TestDb’, file ‘TestDb’ on file 1.
100 percent processed.
Processed 9 pages for database ‘TestDb’, file ‘TestDb_log’ on file 1.
BACKUP DATABASE successfully processed 1931657 pages in 81.958 seconds (184.131 MB/sec).

I’m going to come up with some more tests to perform such as how long it takes to provision a VM from scratch, how long to provision from an image, boot up times, and other metrics that are significant for somebody who wants to be able to quickly provision VMs. I’ll compare these with the times from the colocation environment. This will be posted in a new article.

Here is the configuration I am running on Azure, running the server hasn’t been too expensive yet, used about 5$.00 in last 24 hours, but only put a load on for a few minutes.

image

These are the main findings to this point:

1) For all but a basic/small SQL database in the Azure environment, customers will probably want to utilize the ‘D’ type configuration with the SSD option as this improves performance by a factor of at least 5.

2) Even with the SSD option, the I/O is still 2 – 3 times slower than simply storing a VM or at least the drive with the SQL data on a gen-1 fusion-IO card.

As these offerings mature to allow SSD storage to be enabled for fault-tolerance with larger sizes and higher performance, the standard D offerings will become more appealing for larger customers. It would be helpful if more robust I/O capability could be had without the expense for additional cores as databases tend to get bottlenecked by I/O well below they hit the CPU bottlenecks.

Azure is a great environment to create virtual machines quickly and test out things. But the ‘A’ series is hard to justify as a database solution even for a virtual machine provisioned with 8 cores and 56 GB RAM with a cost of over $1,000 to run over a month. I also have a virtual machine configured on one of my servers at a colocation running SQL Server utilizing PCIE SSD with Fusion-IO storage. The ‘ VM is configured with a similar number of cores and 64 GB of RAM. I created two identical databases with a two tables of 10 million rows.  After building out the data, I tested the SQL backup using the data compression option on both.

Here are the results for the ‘A’ type Azure VM side by side to the results of my VM on PCIE SSD. My performance beat the Azure Enterprise SQL VM with comparable cores and memory by 16x (685 MB/s vs. 39 MB/s and 4 seconds versus 64 seconds). SQL is all about the I/O.

imageimage

Posted in Azure, Cloud, SQL Server | Tagged , , | Leave a comment

Using INSTEAD OF Triggers to enable updatable views

A common misconception with SQL Server views is that only indexed views can provide update capability. Actually, any view that is schema-bound can be enabled for update by simply using an “INSTEAD OF” trigger.  Although an indexed view may also be setup with a trigger, it is not required.  The view does not need to obey the constraints of an indexed view other than being schema-bound since the trigger writes to the base tables.

There are a few scenarios where this is useful. One is in the case of tables that utilize a master-secondary design pattern.  In many scenarios, there may be central objects that manifest themselves in different forms requiring different attributes. Often the central objects need to be accessed singularly for their core attributes so splitting up the objects into multiple tables makes running queries that need objects of both types together awkward and forces the use of UNION queries. The other complexity involved with separating objects based on type into separate tables is that they often share relationships to common child objects. To support the different types of interfaces for different object types, multiple views can be configured.

One example of this is a name and address system I worked on a long time ago related for a non-profit organization. It turns out that donors may be organizations or individuals or married couples. Different attributes are associated with each. To further complicate matters, a donor could be a member – that is a member could donate to a donor. The member might also be married and the organization need to track the household unit associated with married members.

The solution was to create a generic entity object and then satellite objects for individual, household, or organization. This allowed the name to be linked to the outside systems without having to maintain multiple relationships and then allow the relationships between the entities to be maintained in another table. In this example an organization could donate as well as an individual which would show up on a donation report, whereas the organization attributes were not muddled up with individual attributes.

Below is the database diagram representing the common entity approach for name/address. Note that in addition to the master-secondary relationships, there are relationships between the objects such as the organizational contact for an organization entity or the spouse for a household entity. From the diagram, it is not clear as to the direction of the 1-to-1 relationships, but it is is from Entity – that is a row on Org, Household, or Person must first have a row in Entity with the Entity Id.

image

In order to support interfaces that need to insert/update different types of entities, views can be created to support the specific type of entity. The view is coupled with an INSTEAD OF trigger so that adding data to the view will execute the insert tables. Only the main Entity table uses the identity property. The other tables primary keys are all based on the entity table key. The below triggers provide a couple of interfaces to adding new entities. The first inserts a singular person and generate both the entity and the person table.

The second trigger could add a new organization with the assumption that the contact person was already added as a person entity.

The third trigger shows how cascading can be used for even more elaborate scenarios including inserting into other views. This example provides the columns for the contact person entity as well as the organization entity rather than the person entity key and inserted into the person table as well as the organization and main entity tables.  It leverages the prior two triggers by inserting directly into the views.

CREATE VIEW v_PersonEntity WITH SCHEMABINDING
AS SELECT FirstName, LastName, Suffix, WorkPhone, CellPhone, HomePhone, PrimaryPhone
FROM dbo.Person p
GO

CREATE VIEW v_OrgEntity WITH SCHEMABINDING
AS SELECT ContactPersonId, FederalId, EntityName
FROM dbo.Org o
INNER JOIN dbo.Entity e
    ON o.EntityId = e.EntityId
GO

CREATE VIEW v_OrgEntity_WithContact WITH SCHEMABINDING
AS SELECT FederalId, EntityName, FirstName, LastName, WorkPhone
FROM dbo.Org o
INNER JOIN dbo.Entity e
    ON o.EntityId = e.EntityId
INNER JOIN dbo.Person p
    ON p.EntityId = o.ContactPersonId
GO

CREATE TRIGGER v_Person_ITrig ON dbo.v_PersonEntity
   INSTEAD OF INSERT
AS BEGIN
    IF (SELECT COUNT(*) FROM inserted) > 1 BEGIN
        RAISERROR (‘View only supports adding 1 person at a time’,16,1)
    END
    DECLARE @EntityID INT
    — Create the main entity row
    INSERT INTO dbo.Entity (EntityName, EntityType, MainPhoneNumber)
    SELECT
        i.FirstName + ‘ ‘ + i.LastName, ‘P’
        ,CASE
            WHEN PrimaryPhone = ‘H’ THEN HomePhone
            WHEN PrimaryPhone = ‘C’ THEN CellPhone
            WHEN PrimaryPhone = ‘W’ THEN WorkPhone
        END
    FROM inserted i

    — Create the secondary person row
    SELECT @EntityId = SCOPE_IDENTITY()
    INSERT into dbo.Person (EntityId, FirstName, LastName, Suffix,
    WorkPhone, CellPhone, HomePhone, PrimaryPhone)
    SELECT @EntityId, FirstName, LastName, Suffix,
    WorkPhone, CellPhone, HomePhone, PrimaryPhone
    FROM inserted
END
GO

CREATE TRIGGER v_OrgEntity_ITrig on dbo.v_OrgEntity
   INSTEAD OF INSERT
AS BEGIN
    IF (SELECT COUNT(*) FROM inserted) > 1 BEGIN
        RAISERROR (‘View only supports adding 1 person at a time’,16,1)
    END
    DECLARE @EntityID INT
   
    — Create the main entity row
    INSERT INTO dbo.Entity (EntityName, EntityType)
    SELECT EntityName, ‘O’
    FROM inserted i

    — Create secondary org row
    SELECT @EntityId = SCOPE_IDENTITY()
    INSERT into dbo.Org
    (EntityId, ContactPersonId, FederalId)
    SELECT @EntityId, ContactPersonId, FederalId
    FROM inserted

END
GO

CREATE TRIGGER v_OrgEntity_WithContact_ITrig on dbo.v_OrgEntity_WithContact
   INSTEAD OF INSERT
AS BEGIN
    IF (SELECT COUNT(*) FROM inserted) > 1 BEGIN
        RAISERROR (‘View only supports adding 1 person at a time’,16,1)
    END
    DECLARE @EntityID INT
   
    — Could use the view to create the person and entity rows, but
    — then there is a challenge with getting the identity inserted
    — since SCOPE_IDENTITY is nulled out after the last add to the
    — person table
    INSERT INTO dbo.Entity (EntityName, EntityType, MainPhoneNumber)
    SELECT
        i.FirstName + ‘ ‘ + i.LastName, ‘P’, WorkPhone
    FROM inserted i

    — Create the secondary person row
    SELECT @EntityId = SCOPE_IDENTITY()
    INSERT into dbo.Person (EntityId, FirstName, LastName,
    WorkPhone, PrimaryPhone)
    SELECT @EntityId, FirstName, LastName,
    WorkPhone, ‘W’
    FROM inserted

    — Create the organization and entity row using the org view
    INSERT into dbo.v_OrgEntity
    (ContactPersonId, FederalId, EntityName)
    SELECT @EntityId, FederalId, EntityName
    FROM inserted
END
GO

Below shows how all tables are affected by insert to the single Org with contact view:

image

These triggers illustrate the use of insert, however update and delete actions may also be utilized in INSTEAD OF triggers. Below is an example that does both an insert and update for one of the tables that I utilize in my universal problem resolution framework utilized in my dissertation. Also, in the prior example, the triggers only work for 1 row insertion at a time because of the need to capture the identity associated with the new entity due to the fact that triggers operate against the entire row set rather than individually on each row. In the below example, a temporary table and a pre-defined sequencing approach is utilized to support multiple simultaneous insertions. The coalesce for the update causes omitted items to remain as they were before.

ALTER TRIGGER [Data].[view_Problem_Item_ITrig]
   ON  [Data].[view_Problem_Item]
   INSTEAD OF INSERT
AS
BEGIN
    — SET NOCOUNT ON added to prevent extra result sets from
    — interfering with SELECT statements.
    SET NOCOUNT ON;
    BEGIN TRY
        BEGIN TRANSACTION
        –SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
        DECLARE @EntityTypeId MONEY = Data.udf_GetItemTypeId(‘Data.Entity’)
        DECLARE @NextEntityId MONEY — = Data.udf_GetNextItemId(@EntityTypeId)

        DECLARE @ProblemTypeId MONEY = Data.udf_GetItemTypeId(‘Data.Problem’)
        DECLARE @NextProblemId MONEY — = Data.udf_GetNextItemId(@ProblemTypeId)
        SELECT @NextProblemId = MAX(ItemId) FROM Data.Item  WITH (HOLDLOCK, UPDLOCK)

        SET XACT_ABORT ON

        SELECT  
            (CONVERT(MONEY,ROW_NUMBER() OVER (ORDER BY ItemName)) / 10000.00) — – .0001
             + @NextProblemId AS ItemId,
            ActivationTaskDetailId,
            DefineActionTreeId,
            SolveActionTreeId,
            ProblemDefinitionState,
            ProblemSolutionState,
            CurrentStepNumber,
            ItemName,
            TypeId,
            VersionNumber,
            IsSystemItem,
            ItemPath,
            AutoGenerated,
            RootProblemId,
            BranchProblemId,
            IsRoot,
            ActivationValue,
            ProblemCheckSum,
            BranchStepNumber,
            OutcomeReason
        INTO #Problem FROM inserted
        WHERE (ItemId IS NULL OR ItemId = 0)

        INSERT Data.Item  WITH (ROWLOCK)
            (
            ItemId,
            ItemName,
            TypeId,
            VersionNumber,
            IsSystemItem,
            ItemPath,
            AutoGenerated,
            RootProblemId
            )
        SELECT
            ItemId,
            ItemName,
            COALESCE(TypeId,@ProblemTypeId),
            COALESCE(VersionNumber,1.0),
            COALESCE(IsSystemItem, 0),
            COALESCE(ItemPath, ItemName),
            COALESCE(AutoGenerated,0),
            RootProblemId
        FROM #Problem

        INSERT Data.Problem  WITH (ROWLOCK)
        (
            ProblemId,
            ActivationTaskDetailId,
            DefineActionTreeId,
            SolveActionTreeId,
            ProblemDefinitionState,
            ProblemSolutionState,
            CurrentStepNumber,
            BranchProblemId,
            IsRoot,
            ActivationValue,
            ProblemCheckSum,
            BranchStepNumber,
            OutcomeReason
        )
        SELECT
            ItemId,
            ActivationTaskDetailId,
            DefineActionTreeId,
            SolveActionTreeId,
            ProblemDefinitionState,
            ProblemSolutionState,
            COALESCE(CurrentStepNumber,0.0),
            BranchProblemId,
            COALESCE(IsRoot,0),
            ActivationValue,
            ProblemCheckSum,
            BranchStepNumber,
            OutcomeReason
        FROM #Problem

        UPDATE item  WITH (ROWLOCK)
        SET ItemName = COALESCE(i.ItemName, item.ItemName),
            TypeId = COALESCE(i.TypeId, item.TypeId),
            VersionNumber = COALESCE(i.VersionNumber,item.VersionNumber),
            IsSystemItem = COALESCE(i.IsSystemItem,item.IsSystemItem),
            ItemPath = COALESCE(i.ItemPath,item.ItemName),
            AutoGenerated = COALESCE(i.AutoGenerated, item.AutoGenerated),
            RootProblemId = COALESCE(i.RootProblemId, item.RootProblemId)
        FROM inserted i
        INNER JOIN Data.Item item 
            ON i.ItemId = item.ItemId 
        WHERE (i.ItemId IS NOT NULL AND i.ItemId <> 0)

        UPDATE p WITH (ROWLOCK)
        SET ActivationTaskDetailId = COALESCE(i.ActivationTaskDetailId, P.ActivationTaskDetailId),
            DefineActionTreeId = COALESCE(i.DefineActionTreeId, p.DefineActionTreeId),
            SolveActionTreeId = COALESCE(i.SolveActionTreeId, p.SolveActionTreeId),
            ProblemDefinitionState = COALESCE(i.ProblemDefinitionState, p.ProblemDefinitionState),
            ProblemSolutionState = COALESCE(i.ProblemSolutionState, p.ProblemSolutionState),
            CurrentStepNumber = COALESCE(i.CurrentStepNumber, p.CurrentStepNumber),
            BranchProblemId = COALESCE(i.BranchProblemId, p.BranchProblemId),
            ActivationValue = COALESCE(i.ActivationValue, p.ActivationValue),
            ProblemCheckSum = COALESCE(i.ProblemCheckSum, p.ProblemCheckSum),
            OutcomeReason = COALESCE(i.OutcomeReason, p.OutcomeReason),
            Complexity = COALESCE(
        FROM inserted i
        INNER JOIN Data.Problem p
            ON i.ItemId = p.ProblemId 
        WHERE (i.ItemId IS NOT NULL AND i.ItemId <> 0)

        DROP TABLE #Problem
       
        COMMIT
    END TRY
    BEGIN CATCH
        IF XACT_STATE() <> 0 ROLLBACK
        EXEC dbo.usp_RethrowError
    END CATCH

END

Posted in Uncategorized | Leave a comment

Solving Sudoku with T-SQL

Note: This is an update to the earlier post to include the query to actually display the solved puzzle (for the skeptics… Smile)

Recently I had to make a short trip and took a look at a Sudoku puzzle in the airline magazine. In the past, I’ve posted some about automating solving problems and just finished a PhD in this area that outlined an automated technique for algorithm discovery through analyzing solution sequences from simulations.

What intrigued me about the Sudoku scenario is that it lends itself to a Cartesian product solution approach – i.e. rows and columns and groupings must all contain specific values. Based on that, it should be solvable with relational algebra. I modeled the data structure on the plane to see how hard it would be to arrive to a solution using T-SQL for a puzzle. Before spending too much time on this, I searched the Internet and found that somebody else had already modeled and solved this. Samuel Aina actually undertook this scenario back in 2010 – the link to his approach is at http://www.developerfusion.com/article/84374/solving-sudoku-with-sql/.

In keeping with my generic problem model approach, I still thought it would be worthwhile to do this via a less procedural approach than that of Aina. My approach for generic problem solving relies on a schema that defines the start, transition, and final states for any problem solving scenario. In keeping with that approach, I decided to try to solve the problem based on a view that would present to a solver procedure to enumerate through the steps.

For testing purposes, I went to http://www.websudoku.com/ and pulled in the first puzzle that came up shown below:

image

The schema and execution script is at the end of this post. Below are the general steps for accomplishing the solution.

1) Define a table for the Sudoku cell with the row, column, and section row/column. I also added an absolute row/column/section number to simplify the queries to allow the sums to go across the sections (section being the 3 x 3 areas that need to have all number from 1 to 9) and provide the values associated in a row or column across the whole puzzle.  I also created a table to record the solution steps and a trigger on the cell table to write to this so that the solution generated by a solver procedure can be replayed and verified.

2) Create a generic numbers table with values 1 to 9. This is used to generate possible values as well as populate the initial rows in the Soduku cell table.

3) Populate the Cell table – I did a stored procedure that receives the absolute row, column, and value and updates the appropriate row in the table.

4) Create the views that return the values not yet used in either a row, column, or section

5) Create a view that returns all possible moves by joining the values via absolute row and column across the possible value views from row, column, section. This provides all the legal moves

6) Create a view that returns the required next move – This uses a having clause to look for the one distinct row, column that has only one valid move.  If there is not a possible path with only one row, column multiple instances are required to pursue possible solution paths. This adds some complexity, but not a huge amount.  This issue is explained further down in the post (Note).

7) Create a stored procedure that looks for the row/column with only one solution path and update the value. This results in further solutions to materialize for the next step – as each number is filled in the puzzle, it leads to other row/columns that can be filled in based on the result.

8) Execute the stored procedure to solve the puzzle – this results in the moves being recorded into a solver table for verification purposes using the trigger outlined in step 1.

Note: In my test case I was fortunate that there was at least one cell meeting the criteria of having only one possible solution as I was using an easy puzzle. This simplifies the solution so that I did not need to try multiple paths. For a more complex puzzle, I would have had to generate multiple instances for whatever row/column had the least number of paths. This is the design pattern of my generic simulator that I utilized to solve Tower of Hanoi where there were multiple solution paths. Aina’s solution does handle this scenario whereas mine does not yet.

Here are the detailed steps in SQL:

1) Create tables and trigger (for more complex scenario, an instance id would be needed to generate multiple solution paths for the same game configuration)

CREATE TABLE [Sudoku].[Cell](
    [CellId] [int] IDENTITY(1,1) NOT NULL,
    [GameId] [int] NOT NULL,
    [Row] [int] NOT NULL,
    [Col] [int] NOT NULL,
    [Val] [int] NULL,
    [SectionRow] [int] NOT NULL,
    [SectionCol] [int] NOT NULL,
    [AbsoluteRow]  AS ([Row]+([SectionRow]-(1))*(3)),
    [AbsoluteCol]  AS ([Col]+([SectionCol]-(1))*(3)),
    [SectionNumber]  AS (([SectionRow]-(1))*(3)+[SectionCol]),
    [MoveNumber] [int] NULL,
CONSTRAINT [PK__Cell__EA424A08D48B6731] PRIMARY KEY CLUSTERED
(
    [CellId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [Sudoku].[Cell] ADD  CONSTRAINT [DF__Cell__GameId__24927208]  DEFAULT ((0)) FOR [GameId]
GO

ALTER TABLE [Sudoku].[Cell] ADD  CONSTRAINT [DF_Cell_MoveNumber]  DEFAULT ((0)) FOR [MoveNumber]
GO

CREATE TABLE [Sudoku].[SolveLog](
    [StepId] [int] IDENTITY(1,1) NOT NULL,
    [CellId] [int] NOT NULL,
    [MoveDateTime] [datetime] NULL,
    [Val] [int] NOT NULL,
    [MoveNumber] [int] NOT NULL,
CONSTRAINT [PK__SolveLog__2434335717D3913E] PRIMARY KEY CLUSTERED
(
    [StepId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

CREATE TRIGGER [Sudoku].[Cell_Utrig]
   ON  [Sudoku].[Cell]
   AFTER UPDATE
AS
BEGIN
    — SET NOCOUNT ON added to prevent extra result sets from
    — interfering with SELECT statements.
    SET NOCOUNT ON;
    INSERT [Sudoku].[SolveLog]
    (CellId, Val, MoveNumber)
    SELECT CellId, Val, MoveNumber
    FROM inserted
    — Insert statements for trigger here
END

GO

2) Create generic number table and populate it

CREATE TABLE [dbo].[Numbers](
    [Number] [int] NULL
) ON [PRIMARY]

GO

insert dbo.Numbers select 1
insert dbo.Numbers select 2
insert dbo.Numbers select 3
insert dbo.Numbers select 4
insert dbo.Numbers select 5
insert dbo.Numbers select 6
insert dbo.Numbers select 7
insert dbo.Numbers select 8
insert dbo.Numbers select 9

3) Insert rows for a game for all rows/columns and create stored procedure to populate puzzle data and execute

insert into Sudoku.cell
(GameId, Row, Col, SectionRow, SectionCol)

select 1, r.Number, c.Number, gc.Number, gr.Number
from dbo.Numbers r
cross join dbo.Numbers c
cross join dbo.Numbers gr
cross join dbo.Numbers gc
where r.Number <= 3
and c.Number <= 3
and gr.Number <= 3
and gc.Number <= 3

CREATE PROCEDURE [Sudoku].[UpdateCell_ByAbsolutePosition]
    @GameId int,
    @AbsRow int,
    @AbsCol int,
    @Val int,
    @MoveNumber int = 0
AS BEGIN
    UPDATE  c
        SET Val = @Val,
            MoveNumber = @MoveNumber
    FROM [Sudoku].[Cell] c
    WHERE c.AbsoluteRow = @AbsRow
    AND c.AbsoluteCol = @AbsCol
END

GO

EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 1, @AbsCol = 6, @Val = 8
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 1, @AbsCol = 7, @Val = 4
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 1, @AbsCol = 8, @Val = 7
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 2, @AbsCol = 1, @Val = 9
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 2, @AbsCol = 3, @Val = 8
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 2, @AbsCol = 4, @Val = 4
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 2, @AbsCol = 5, @Val = 5
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 3, @AbsCol = 1, @Val = 6
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 3, @AbsCol = 5, @Val = 9
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 3, @AbsCol = 6, @Val = 1
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 3, @AbsCol = 7, @Val = 5
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 3, @AbsCol = 8, @Val = 3
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 4, @AbsCol = 1, @Val = 2
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 4, @AbsCol = 2, @Val = 8
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 4, @AbsCol = 7, @Val = 3
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 4, @AbsCol = 9, @Val = 4
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 5, @AbsCol = 3, @Val = 9
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 5, @AbsCol = 7, @Val = 7
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 6, @AbsCol = 1, @Val = 7
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 6, @AbsCol = 3, @Val = 5
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 6, @AbsCol = 8, @Val = 2
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 6, @AbsCol = 9, @Val = 6
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 7, @AbsCol = 2, @Val = 1
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 7, @AbsCol = 3, @Val = 4
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 7, @AbsCol = 4, @Val = 5
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 7, @AbsCol = 5, @Val = 6
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 7, @AbsCol = 9, @Val = 7
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 8, @AbsCol = 5, @Val = 4
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 8, @AbsCol = 6, @Val = 7
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 8, @AbsCol = 7, @Val = 2
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 8, @AbsCol = 9, @Val = 5
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 9, @AbsCol = 2, @Val = 2
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 9, @AbsCol = 3, @Val = 7
EXEC [Sudoku].[UpdateCell_ByAbsolutePosition] @GameId = 1, @AbsRow = 9, @AbsCol = 4, @Val = 1

4) Create the views that return the candidate values for row, column, and section

CREATE VIEW Sudoku.v_Candidate_RowVal
AS SELECT r.Number AS Row, v.Number as Val — select *
from dbo.Numbers r
INNER JOIN dbo.Numbers v
ON v.Number NOT IN (SELECT c2.Val from Sudoku.Cell c2 WHERE c2.AbsoluteRow = r.Number AND c2.Val IS NOT NULL)

CREATE VIEW Sudoku.v_Candidate_ColVal
AS SELECT r.Number AS Col, v.Number as Val — select *
from dbo.Numbers r
INNER JOIN dbo.Numbers v
ON v.Number NOT IN (SELECT c2.Val from Sudoku.Cell c2 WHERE c2.AbsoluteCol = r.Number AND c2.Val IS NOT NULL)

CREATE VIEW Sudoku.v_Candidate_SectionVal
AS SELECT r.Number AS Section, v.Number as Val — select *
from dbo.Numbers r
INNER JOIN dbo.Numbers v
ON v.Number NOT IN (SELECT c2.Val from Sudoku.Cell c2 WHERE c2.SectionNumber = r.Number AND c2.Val IS NOT NULL)

5) Possible move view

CREATE VIEW Sudoku.v_PossibleMove AS
SELECT c.AbsoluteRow, c.AbsoluteCol, r.Val
FROM Sudoku.Cell c
INNER JOIN Sudoku.v_Candidate_RowVal r on r.Row = c.AbsoluteRow
INNER JOIN Sudoku.v_Candidate_ColVal cv on cv.Col = c.AbsoluteCol and cv.Val = r.Val
INNER JOIN Sudoku.v_Candidate_SectionVal s on s.Section = c.SectionNumber and s.Val = cv.Val
WHERE c.Val IS NULL

6) Next definite view (For more complex games, would need to be modified to select the one with the smallest count rather than count = 1

CREATE VIEW Sudoku.v_NextMove AS
SELECT c.AbsoluteRow, c.AbsoluteCol, max(c.Val) AS Val FROM Sudoku.v_PossibleMove c
GROUP BY c.AbsoluteRow, c.AbsoluteCol having count(*) = 1

7) Stored procedure that solves the puzzle

CREATE PROCEDURE [Sudoku].[Solve] AS
BEGIN
    DECLARE @MoveNumber INT = 1
    WHILE EXISTS (SELECT 0 FROM Sudoku.Cell c WHERE c.val is null)
    BEGIN
        UPDATE c
            SET c.Val = m.Val,
                MoveNumber = @MoveNumber
        FROM Sudoku.Cell c
        INNER JOIN Sudoku.v_NextMove m
            ON m.AbsoluteRow = c.AbsoluteRow
            AND m.AbsoluteCol = c.AbsoluteCol
        WHERE c.Val IS NULL
        SET @MoveNumber = @MoveNumber + 1
    END
END
GO

8) Execute stored procedure and verify results – Note that multiple cells are solved simultaneously on some of the intermediate steps. Even though 47 cells need to be solved, the total number of steps required is actually only 13

EXEC Sudoku.Solve

(1 row(s) affected)

(2 row(s) affected)

(2 row(s) affected)

(2 row(s) affected)

(5 row(s) affected)

(3 row(s) affected)

(4 row(s) affected)

(8 row(s) affected)

(6 row(s) affected)

(6 row(s) affected)

(4 row(s) affected)

(3 row(s) affected)

(1 row(s) affected)

SELECT c.[Val]
      ,c.AbsoluteRow
      ,c.AbsoluteCol
      ,c.[MoveNumber]
  FROM [Games].[Sudoku].[SolveLog] s
  inner join Sudoku.cell c
    on c.CellId = s.CellId
  where s.MoveNumber > 0
  order by movenumber

Val

AbsoluteRow

AbsoluteCol

MoveNumber

2

3

3

1

8

3

9

2

7

3

4

2

1

5

9

3

4

3

2

3

2

2

9

4

3

6

2

4

9

1

9

5

6

5

2

5

4

5

1

5

7

2

2

5

5

1

2

5

3

9

9

6

9

8

2

6

1

4

3

6

9

9

6

7

8

9

5

7

7

4

5

7

3

1

3

7

6

9

7

8

3

8

4

8

4

6

6

8

1

6

5

8

2

1

5

8

5

9

1

8

6

8

3

8

1

1

1

8

4

9

8

9

1

2

7

9

2

7

6

9

3

5

5

9

6

1

4

9

8

8

1

9

1

8

8

10

6

2

8

10

5

5

6

10

9

4

4

10

3

2

6

10

3

7

1

10

8

5

8

11

5

4

8

11

8

6

4

11

6

4

6

11

9

7

8

12

9

6

7

12

2

5

4

12

8

7

7

13

This query outputs the solved puzzle results:

select [1],[2],[3],[4],[5],[6],[7],[8],[9]
from (select AbsoluteRow, AbsoluteCol, val from Sudoku.Cell) as s
pivot(Avg(Val) for AbsoluteCol in ([1],[2],[3],[4],[5],[6],[7],[8],[9]))
as pvt
order by AbsoluteRow

And here are the results:

1

5

3

6

2

8

4

7

9

9

7

8

4

5

3

1

6

2

6

4

2

7

9

1

5

3

8

2

8

1

9

7

6

3

5

4

4

6

9

2

3

5

7

8

1

7

3

5

8

1

4

9

2

6

3

1

4

5

6

2

8

9

7

8

9

6

3

4

7

2

1

5

5

2

7

1

8

9

6

4

3

Voila!

Posted in Problem Solving, SQL Server, SQL Tips and Techniques | Tagged , | 1 Comment

Creating a 3-way 10 Gbe Cluster without a switch– Part 1

After finishing the PhD, I’m back into the lab to test out some new high-speed computing experiments. Recently, I retrieved one of my servers from co-location and put it back into my home-office lab giving me 3 servers with Fusion-IO cards all in the same spot. I’m trying to move around some virtual machines and update the trading optimizer (CapGen) investing databases so thought it would be useful to get all of the servers talking on at least 10 Gb. The Fusion-io cards easily overwhelm the 1 Gb connection since even one of the duos provides 1.5 GB/s which is actually 9 Gb.

A few months back I managed to get 2 servers working using Mellanox Connectx cards on QFSP connections with 20 Gb/s, but that stopped working on me with a driver update (ConnectX not supported on Windows Server 2012 R2), so went to a better supported, although slower connection using 10 Gb/s. To do this, I got the Mellanox 3 EN cards for 2 of the servers and bought the add-on dual 10 Gb/e adapter for one of the HP DL 370 G6 servers. One advantage to using the HP add-on adapter is that it doesn’t require an additional slot although you do trade off 2 of the 1 Gb connectors.

This approach allows the maximum number of Fusion-io (HP IO Accelerator) cards in the server (current at 9 with 8 of them being duos) as shown below.

fio-cards

In this arrangement, each server has a dedicated high-speed connection to the other two servers via the dual interface as shown in the below table without the need for a switch.  Basically, it is just 3 cables connecting each server to the other two servers via the 6 total ports (2 on each server).

image

Server Source Port/ IP Address Destination Server Destination Port/ IP Address
Server 1: 1: 10.0.0.11 Server 2 1: 10.0.0.21
  2: 10.0.0.21 Server 3 1: 10.0.0.31
Server 2 1:10.0.0.21 Server 1 1: 10.0.0.11
  2:10.0.0.22 Server 3 2: 10.0.0.32
Server 3 1:10.0.0.31 Server 1 1:10.0.0.11
  2:10.0.0.32 Server 2 2:10.0.0.22

Below are pictures of the rear of a couple of the servers.

server2

server1b

One of the pain points with setting up a separate private network is that the adapters by default end up in the public class. Some articles have been written about how to fix this with a script, but for my testing I am taking the lazy way out and just turning off the windows firewall on the servers. After having done that, I can may drives directly over the high-speed link and verified ability to achieve the 10 Gb/s throughput by copying files using Fusion-io drives as the source and targets. I am now able to copy a 60GB file from one server to the other are between 40 – 80 GB, this should provide the ability to achieve a live migration of a VM in around a minute.

Now that the infrastructure is in place, I will start experimenting with clustering on the next post and will look into some other alternatives beyond just the Ethernet/ISCSI approach including RDMA. I will also do some experimentation with log shipping and always-on capability with SQL Server. I will also try out the live migration features in Hyper-V to test out the practicality for this on the 10 Gb backbone. Lastly, I will test out my idea for a distributed SQL Server database that sits on top of ISCSI on the high-speed network wherein the database instance is effectively scaled out beyond one instance to multiple servers via the other servers hosting the storage through file server roles.

I hope to finish some testing in the next couple of weeks in my spare time.

Posted in Uncategorized | Leave a comment

Generating a Type-2 Dimension from Multiple Tables with Effective Dates

In slowly changing (SCD) dimensions, type-2 attributes involves ending a dimension row when the attribute value changes with the current date and creating a new row starting from the current date. While this works great to capture history once a warehouse/cube is implemented, it does not address the situation of historical data if such tracking was already occurring. For most scenarios, the history can simply be loaded without the cube processing having the type-2 rules implemented to pre-populate the SCD start/end dates. However in some business such as health care, history may sometimes need to be “re-written”. For example, a claim may not get into the system until several days after it occurs and the claim would need to be effective for a point in time in history already loaded.

This challenge is highlighted in the below examples. A real-world example might be tracking what certifications were held by an individual at a given point in time. For example a software professional may have been an Oracle OCP from 6/1/2010 to 8/31/2010, and a PMP from 7/1 TO 12/31. To track the history of all certifications held by date ranges, 3 rows are needed:

- 6/1/2010 – 6/31/2010 : OCP

- 7/1/2010 – 8/31/2010  : OCP, PMP

- 9/1/2010 – 12/31/2011: PMP only

The spreadsheet illustrates the scenario where sets of data for a dimension row coming from 2 different tables have values effective for different historical periods – the dates associated with the 2 sets of values could be overlapping, equivalent, or covering (i.e. ‘A’ start date before ‘B’ start date with end date also greater than ‘B’ end date)

This is of interest for the scenario where customer already has SCD-type2 tracking in place for historical data across more than one tables which have values controlled by different effective date ranges and desires to consolidate to a single dimension. The goal is a single dimension from multiple source tables with TYPE-2 tracking using a single start/end date to capture history of all values associated to effective dates from the source tables including support for regenerating if the individual historical values or date ranges are changed.

Below is the SQL that demonstrates the solution approach I found for taking multiple histories with different start/end dates associated to historical values and getting them to a consolidated History input with type-2 SCD behavior that emulates what would have happened if SCD tracking had been in place incrementally over the date range of the historical data. Here are the steps:

1) Unpivot start/end dates through unions of the start/end dates from all historical tables into 2 columns

2) Create distinct ranges from the unpivoted dates by joining the distinct start dates to the next available end date

3) Join back to the source tables for all values effective for the segmented date ranges.

The SQL accomplishes the 3 items through views (unpivoted, range, history) to generate the required rows for the type-2 SCD dimension.

Although only 2 tables are involved, this should work the same way for additional tables, it just means adding additional unions in the unpivoted view and additional left joins to the history view.

CREATE TABLE [dbo].[t1](

[k] [int] NOT NULL,

[v] [int] NULL,

[s] [date] NULL,

[e] [date] NULL)

GO

CREATE TABLE [dbo].[t2](

[k] [int] NOT NULL,

[v] [int] NULL,

[s] [date] NULL,

[e] [date] NULL)

CREATE VIEW [dbo].[Unpivoted] AS

SELECT s, e FROM T1 UNION SELECT s, e FROM T2

GO

CREATE VIEW [dbo].[Ranges] AS

SELECT s, (SELECT MIN(e) FROM Unpivoted WHERE e > VS.s) AS e FROM (SELECT DISTINCT s FROM Unpivoted) AS VS

GO

CREATE VIEW [dbo].[History] AS

SELECT R.s, R.e, t1.k as T1_Key, t2.k as T2_Key, T1.v as T1_Value, T2.V as T2_Value FROM Ranges R

LEFT JOIN T1

ON T1.s BETWEEN R.s AND R.e

OR T1.e BETWEEN R.S and R.e

OR T1.s < R.s AND T1.e > R.e

LEFT JOIN T2

ON T2.s BETWEEN R.s AND R.e

OR T2.e BETWEEN R.s AND R.e

OR T2.s < R.s AND T2.e > R.e

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x78380B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x79380B00 AS Date), CAST(0xF3380B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0xF4380B00 AS Date), CAST(0x11390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x21390B00 AS Date), CAST(0x30390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x31390B00 AS Date), CAST(0x4E390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x4A380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x4B380B00 AS Date), CAST(0x87380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0x88380B00 AS Date), CAST(0xB5380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0xB6380B00 AS Date), CAST(0x11390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x21390B00 AS Date), CAST(0x4E390B00 AS Date))

Below are the results of running the SQL

/*————————

SELECT * FROM T1 – Test table 1

SELECT * FROM T2 – Test table 2

SELECT * FROM UnPivoted – All distinct start/end dates from all rows extracted into 2 columns

SELECT * FROM Ranges – All distinct start dates with the next end date

SELECT * FROM History – Join back to the source tables to get the historical keys and values

————————*/

k           v           s          e

———– ———– ———- ———-

1           0           2014-01-01 2014-04-30

2           1           2014-05-01 2014-08-31

3           0           2014-09-01 2014-09-30

4           1           2014-10-01 2014-10-15

5           0           2014-10-16 2014-10-31

6           1           2014-11-01 2014-11-30

(6 row(s) affected)

k           v           s          e

———– ———– ———- ———-

1           0           2014-01-01 2014-03-15

2           1           2014-03-16 2014-05-15

3           0           2014-05-16 2014-06-30

4           1           2014-07-01 2014-09-30

5           0           2014-10-01 2014-10-15

6           1           2014-10-16 2014-11-30

(6 row(s) affected)

s          e

———- ———-

2014-01-01 2014-03-15

2014-01-01 2014-04-30

2014-03-16 2014-05-15

2014-05-01 2014-08-31

2014-05-16 2014-06-30

2014-07-01 2014-09-30

2014-09-01 2014-09-30

2014-10-01 2014-10-15

2014-10-16 2014-10-31

2014-10-16 2014-11-30

2014-11-01 2014-11-30

(11 row(s) affected)

s          e

———- ———-

2014-01-01 2014-03-15

2014-03-16 2014-04-30

2014-05-01 2014-05-15

2014-05-16 2014-06-30

2014-07-01 2014-08-31

2014-09-01 2014-09-30

2014-10-01 2014-10-15

2014-10-16 2014-10-31

2014-11-01 2014-11-30

(9 row(s) affected)

s          e          T1_Key      T2_Key      T1_Value    T2_Value

———- ———- ———– ———– ———– ———–

2014-01-01 2014-03-15 1           1           0           0

2014-03-16 2014-04-30 1           2           0           1

2014-05-01 2014-05-15 2           2           1           1

2014-05-16 2014-06-30 2           3           1           0

2014-07-01 2014-08-31 2           4           1           1

2014-09-01 2014-09-30 3           4           0           1

2014-10-01 2014-10-15 4           5           1           0

2014-10-16 2014-10-31 5           6           0           1

2014-11-01 2014-11-30 6           6           1           1

(9 row(s) affected)

Posted in Uncategorized | Leave a comment