A couple of factoids on data security to think about

Here’s factoid that most people, even many at the IT management level don’t realize:

A 128 GB Thumb drive which can be had for under $40.00 can store enough information to accomplish identity theft for the population of the entire world (7 billion people). 128 GB is approximately 137 billion bytes which is 19 bytes per person. Name and address data can normally be compressed by a factor of 3 and birth dates and social security numbers or other national identities only use 6 bytes in packed format. So, figuring about 40 bytes for name and address compressed to 13 bytes, the total amount compresses to 19 bytes per person.

Another factoid: It takes less than 90 seconds to download 10 million records containing a person’s name, address, spouse, birth date, and social security number. This is based on a cable modem connection of 50 Mb/S which equates to about 6 MB/s or 360 MB/minute. A person’s complete identity record normally is less than 50 bytes. The entire data set for 10 million people therefore is only about 500 MB (50 * 10,000,0000). That is in uncompressed format. With compression, it takes less than 30 seconds.

Unfortunately, my experience has been that the government and most companies are not making the effort of protecting data much of a priority. They throw money at it for sophisticated products that do not actually address the problem. They pay for expensive audits from companies that do not actually have the technical expertise to spot them.  The regulatory audits seldom find actual problems as they are focused on outdated security mechanisms that do not have applicability to the most common scenarios whereby data is taken from inside the network to the outside rather than somebody breaking through a firewall from the outside.

In some cases, the approaches taken actually make it easier to steal data. For example, data thieves love encryption, which only protects against actual physical data being stolen and does nothing to protect data once it arrives to the user decrypted.  Encryption allows thieves to encrypt the data they steal so that systems have no way of knowing what is leaving the network. Even worse the companies rarely even audit access to sensitive data and have no idea that there is a breach until all of the data is exposed. Many IT departments fail to implement simple controls for locking down files stored on the network, ensuring point-to-point security for service accounts, etc. We are not just talking about PII (Personally Identifiable Information) or Private Health Information (PHI) which is bad enough, but now large chunks of intellectual property (IP) are being stolen. The number of instances of data theft will only continue to multiply as organizations do not try to solve the root problems.

Our company is focused on data security from the inside-out. We will come onsite for a day for free and call out vulnerabilities. We can provide a rapid assessment over a couple of weeks that generates a score sheet which identifies the specific vulnerabilities and remedies. We have a complete tool set to automate identification and resolution of security issues at the database, applications, and file system levels where data theft originates. We are especially focused on healthcare with experience with HIPAA regulations. Most healthcare providers are not actually meeting HIPAA requirements. One of the requirements is that a record of access for all individuals who have looked at a person’s healthcare data can be produced upon demand. 

We are experts at analyzing for vulnerabilities at the database and file system level where data theft originates. By the time the data goes out the firewall, it is already encrypted and non-detectable by firewalls. The only way to stop data theft is to implement safeguards at the data and application level. This requires a unique combination of data security, database, and application development skills. We are experts at working with huge amounts of data – one of our products we developed for financial risk management has a database of over 3 billion records which supports near instantaneous queries of complex information requests.

I am one of less than 150 certified Microsoft SQL Server  masters in the world and one of less than probably 15 or 20 that also holds a top-tier ISC2 CISSP certification.  My recent PhD is in the area of automated learning whereby problems can be modeled, simulated, and used to learn heuristics for solving the problems. I have over 30 years of experience in application development. My experience includes 10 years working with classified system. My network of resources includes the top persons and companies in the world with expertise in machine learning related to data security as well as all aspects of data security including at the network level.

Do you want to do something proactive to stop data theft and have truly end-to-end security implemented to prevent inside-out theft or wait until after a breach occurs? Do you have a way to detect that a breach has even happened if the person uses trusted credentials to carry it out? Most data theft is carried out by an unauthorized person using authorized credentials and misusing them. Do your systems really detect this situation? This can only be done by implementing controls at the database and application levels. 

Contact us at sales@authintel.com if you really want to practice due diligence to prevent and stop data theft. Give us an opportunity to help you before it is too late. We host a large secure co-located environment that can provide a sandbox area where we can stage your entire IT structure as virtual machines.  Through the use of over 20 fusion-IO high speed SSD drives, we can provision virtual machines in seconds. We have an automated data obfuscation tool that includes verification that will allow you to create a realistic testing environment without risk of theft of meaningful data. Using our sandbox also helps evaluate your level of data preparedness and disaster recovery ability.

Why wait until after a breach is out to take action? Does your company really want the liability of not only having it’s data stolen, but also now meeting regulatory requirements such as those mandated by HIPAA.

Posted in Uncategorized | Leave a comment

High-performance Statistical Queries using Self-joins

In my pursuit of understanding asset markets, I’ve maintained a SQL Server database with a lot of information about the stock market and other indexes. Using some data services along with SSIS, this database has been kept current to the point that it now has over 3 billion records in total including 2.3 billion records in an intraday table. Most of my queries and interest concerns cumulative equity and index end-of-day history which is only 75 million row and the options data since 2003 which is now up to 175 million rows.

To be able to query this level of data, I utilize Fusion-io PCIE SSD storage for the SQL Server database. Using self-joins can produce some very interesting analysis. For example, the below query outlines the performance of a few global indexes where there have been large bounces close to market tops and the ensuing performance afterwards. This query complete in just a few seconds. There are couple of tricks that make this run faster – one is the storing of a relative day number to avoid performance issues with working around weekends for querying prior dated history. The day number is sequential across holidays and weekends so a direct link can be done without a range test. The other trick is that the table is partitioned based on the date which allows a good deal of parallelism.

Here is the query

select h.TradingSymbol, h.MarketDate, h.ChgPct + hprev.ChgPct as TwoDayChg, hprev.PriceAtClose as Price,
(h20.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg1Month,
(h60.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg3Month,
(h180.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg9Month,
(h360.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chg18Month,
(h540.PriceAtClose – h.PriceAtClose) / h.PriceAtClose as Chng27Month,

((select MAX(hpast.PriceAtClose) from dbo.EquityHistory hpast
where hpast.TradingSymbol = h.TradingSymbol
and hpast.MarketDate < hprev.MarketDate) – hprev.PriceAtClose) / hprev.PriceAtClose
as PctFromTop

from dbo.EquityHistory h
inner join dbo.EquityHistory hprev
    on hprev.DayNumber = h.DayNumber – 1
    and hprev.TradingSymbol = h.TradingSymbol
inner join dbo.EquityHistory h20
    on h20.TradingSymbol = hprev.TradingSymbol
    and h20.DayNumber = hprev.DayNumber + 20
inner join dbo.EquityHistory h60
    on h60.TradingSymbol = hprev.TradingSymbol
    and h60.DayNumber = hprev.DayNumber + 60
inner join dbo.EquityHistory h180
    on h180.TradingSymbol = hprev.TradingSymbol
    and h180.DayNumber = hprev.DayNumber + 180
inner join dbo.EquityHistory h360
    on h360.TradingSymbol = hprev.TradingSymbol
    and h360.DayNumber = hprev.DayNumber + 360
inner join dbo.EquityHistory h540
    on h540.TradingSymbol = hprev.TradingSymbol
    and h540.DayNumber = hprev.DayNumber + 540
where h.TradingSymbol in (‘^dji’,’^dax’,’^ixic’,’^n225′,’^ftse’,’^djt’,’^gspc’,’^rut’,’^ssec’)
and h.ChgPct + hprev.ChgPct > 6.0
and  (select MAX(hpast.PriceAtClose) from dbo.EquityHistory hpast
where hpast.TradingSymbol = h.TradingSymbol
and hpast.MarketDate < hprev.MarketDate) between 1.0 * h.PriceAtClose and 1.15 * h.PriceAtClose
order by h.MarketDate

And here are the results formatted in Excel:

This is not a good omen for the stock market for the next couple of years based on history.

There is probably not enough data here to draw conclusions, but since the query is against indexes rather than individual stocks, it does seem pretty convincing. It is pretty certain that a query against the entire equity history for this would yield similar bottom line averages, but that would take several minutes to complete.

Here is the query again as it shows with the SQL markup:

query-picture

Posted in Uncategorized | Leave a comment

Increased data security focus for Authintel

We’ve all seen the news about the latest data security breaches. While bureaucrats blame these on sophisticated hacks from China, the reality is that these are mainly due to negligence and are so simple that a child with basic computer knowledge could pull off many of these. http://www.darkreading.com/attacks-breaches/the-eight-most-common-causes-of-data-breaches/d/d-id/1139795?

The problem is that technology has focused on encryption and firewalls while neglecting security at the basic data and application tiers and has very little concept of proactive monitoring of actual user behaviors. Most data theft occurs due to compromised employees or stolen credentials wherein the perpetrator appears to the system as a trusted user and is not monitored. Our company holds credentials that include a PhD for automated learning, the highest-level ISC security certification, CISSP, SQL Server master, and certified .NET application developers. We are uniquely qualified to resolve the use cases that lead to security breaches at the application and data levels. We have produced a video that outlines how millions of PII data records can be stolen without trace in less than 5 minutes that will work at most companies using an ordinary user account. We are focused on resolution of the actual uses cases that lead to data theft rather than on elaborate technologies that are difficult to configure and mostly ineffective. Contact us and we can perform an audit as well as provide remediation including deployment of automated scripts and tools.

Posted in Uncategorized | Leave a comment

Power BI Experiences

I’ve spent quite a bit of time the last few days trying to get Power BI preview to work for accessing a tabular model so am sharing my experience. Based on that experience, I am making some assertions that are not even in the Microsoft documentation, albeit most of this is found on various other blogs of users who have tried to get this working.

Below is a screen snapshot of a demo that my adult son, Blake built in his spare time. My thanks to Blake for his work to do the tabular model and report.

image

My conclusion is that the Power BI configuration is extremely complex, difficult to troubleshoot and requires Active Directory infrastructure configuration. However, once it is working it is seamless to develop reports from a single Power BI site that access tabular models exposed by other organizations without the need to store the data locally.

Here are the steps that I had to do – your mileage may vary and I’m happy if you didn’t have to go through all of this:
1)    Power BI can only access a tabular model on a machine joined to a domain.
2)    The service account registered with Power BI must be defined in Active Directory with a full UPN exactly matching the Power BI logon.  For example, if the logon account is john.doe@company.onmicrosoft.com, then there must be an account in Active Directory registered with john.doe@company.onmicrosoft.com This is because Power BI sends all requests for tabular data using the “effectiveusername” technique (see profiler trace at end of this post).  This requires the following steps:
a.    Use Active Directory Domains and Trusts tool and select properties at the root level to get the UPN Suffixes dialogue as shown below and add the UPN suffix that matches to the Power BI domain name (i.e. company.onmicrosoft.com)
 image
Note that multiple UPNs may be added, but only one can be registered to a specific A/D account.
b.    Use Active Directory Users and Computers to identify or create users is mapped to the UPN suffix.  Use caution when creating Power BI accounts that are the same as those of the company domain accounts unless the Power BI domain is the same as the standard UPN suffix. Best practice is to register for Power BI using the company’s Email domain  rather than relying on the company-name.onmicrosoft.com. Below is an example of using a different UPN than the corporate domain:

 

image
 
The only article I’ve found on this is https://www.artisconsulting.com/blogs/greggalloway/Lists/Posts/Post.aspx?ID=27
Power BI uses a semantic model connection technique with effectiveusername – see https://msdn.microsoft.com/en-us/library/gg471575.aspx for an explanation.
c.    This needs to be done for all the Power BI user accounts that consume the model not just the one under which the Analysis Services Connector is configured.

3)    Depending on the firewall restrictions, port 8051 (HTTP transport) or 8050 (HTTPS) may need be opened. The documentation states that this is only needed for OData, but I have found it needed at least in my environment even for tabular only. (http://blogs.technet.com/b/powerbisupport/archive/2014/12/19/failed-to-register-the-gateway-key.aspx) For Azure use the endpoint functionality of either 8051 for HTTP and 8050 for HTTPS so that the Analysis Services Connector will work. If using 8051 all data will be transmitted in plain text. However, the credentials themselves appear to be tokenized between the connector and the Power BI site such that username/password theft does not appear to be a risk. I’ve not yet done extensive network sniffing to evaluate the risks with not using SSL.

Only after the above steps are done, will a connector be able to be defined that will actually allow the tabular data to be piped back to Power BI.

4)    Configure the Analysis Services Connector using instructions from http://support.powerbi.com/knowledgebase/articles/471577-configure-a-power-bi-analysis-services-connector.  When configuring the connector, you are asked to login to Power BI again. Use the Power BI account configured in step 2 that has the required UPN suffix.
5)    At this point, Power BI should be able to be connected directly to the tabular model and Excel reports that use the tabular model can be uploaded and should function.

Note the profiler trace of SSAS tabular connection via the gateway –  the gateway was actually configured using serviceaccount@authintel.com, yet the connection is made on behalf of the specific Power BI user. This means that all users must be granted read access using their full UPN mapping to the Power BI account.

…<PropertyList xmlns=”urn:schemas-microsoft-com:xml-analysis” xmlns:soap=”http://schemas.xmlsoap.org/soap/envelope/”><Catalog>DemoSales_Tabular</Catalog><LocaleIdentifier>1033</LocaleIdentifier>
<EffectiveUserName>john.doe@yourcompany.com</EffectiveUserName><sspropinitappname>PowerBI</sspropinitappname></PropertyList>

Posted in Uncategorized | Leave a comment

Data Security and the Lack thereof

Americans are increasingly becoming aware of the vulnerability of their private data with breaches now up in the millions that have not only disclosed credit card information, but even more potentially dangerous information such as social security numbers linked to birthdates, names, and addresses.  Your name and address along with a birth date and social security number is a virtual ticket to complete identity theft creating risks of massive financial destruction. There are multiple pieces of data that are private to an individual, but the social security number (SSN) is the one that is the most critical and is the focus of this post. The same principles apply to securing SSN as any other private data.

How did we get to this point? Is this just the inevitable consequence of Internet access, distributed systems, and evolving technology? Or is much of this due to irresponsibility of those that hold the most confidential aspects of your data? In this post, I make the argument that it is due to carelessness of organizations that hold such data and that there are ways to ensure safeguarding of data protecting not only against external threats, but against the more powerful threats from within an organization.

This post first articulates the drivers behind the situation which can be divided into two main areas: wrong assumptions and lack of diligence. Next, we identify the architectural principles of the software and infrastructure that mitigate the risks for security breach and contrast that with what most organizations believe is adequate.  Finally, we provide practical strategies to remediate the situation.

Wrong Assumptions

The most common wrong assumption is that security threats are primarily outside of an organization rather than from within. Studies have shown this is not the case. Even in the case of external attack, there is often an insider involved such as distraught employee who gives away or even sells his username and password that will provide access to confidential data. Organizations spend large amounts of data on firewalls, external network monitoring, but how much do they invest to protect the confidentiality of the data within their own network?

Many organizations run on the assumption that their employee usernames and passwords could never be compromised and do not consider the impact of an external user using a username/password to gain access into the network. Many organizations allow the  simple use of a username and password to access a VPN without use of any other physical validation such as a smart card or token device. But, even those that have restrictions to minimize the risk of an external user utilizing corporate credentials to enter the network are often careless about private data including the SSN and birthdate and have overly lose restrictions on access with often no auditing. A disgruntled user who turns dark has the potential to create just as much damage as any hacker from North Korea.

I find it interesting that virtually every break-in is blamed on sophisticated attacks from foreign entities or terrorists. However, the reality is that most of the break-ins that occur require very little sophistication and are probably mostly home-grown even if foreign entities are involved.

I have had the opportunity to work on database, architecture and application development projects over the years. Many of those customers stored SSN unencrypted and lacked effective safeguards to audit access to that data. If I was a criminal, I could have retired a long time ago. 10 million SSNs along with birthdates, and complete private data including with identifying information can be dumped from a database into a .CSV file in a manner of seconds. Such a file can be zipped up and encrypted, renamed to a .txt and small enough to be emailed as an attachment using a web mail client with virtually no detection. That is how easy it is to completely confiscate social security numbers within many organizations. At $25.00 per complete private identity record (street value) for 10 million records, that works out to a $250 million payoff.

Lack of Regulation

It amazes me that our government which supposedly is so concerned about protecting the welfare of it’s citizens has created virtually no regulatory standards for storage and retrieval of private data. Yes, there are requirements for secure external interfaces such as SSL, but that is only one piece of a secure infrastructure. I speak from experience in working with state and federal entities and have found that the vast majority of the databases that have SSNs do not even store these encrypted. Worse, the SSN is stored right alongside all of the information needed to identify the person linked to the number, including complete name, address, and many times even their birthdates. The complacency in regard to social security numbers is startling given that one cannot change this and once compromised means a person is at risk of identity theft for the rest of their life. At least when a password is compromised, it can be changed. There is no excuse for any organization to store social security numbers unencrypted in a database. This is a worse violation of security than storing a password in plain text.

Lack of Diligence

The fact that the government has not imposed regulations that require companies to store private data encrypted is no excuse for organizations. The necessary technology exists to not only encrypt private data, but to ensure that it remains encrypted through the lifecycle of it’s access and use. Even in organizations that have made attempts to secure this data, huge gaps are usually left. Consider the following ineffective strategies that often employed:

1) Data at Rest Security: This is the concept that data as it exists on any file system should be encrypted so that even if a database is stolen, it cannot be moved to another system so that the data can be used. Microsoft SQL Server Transparent Data Encryption (TDE) is an example of this approach. While this is useful to protect against physical data theft, it is useless for protection from internal threats unless a framework of auditing and security has been placed around access to the system containing the data. Secure data at rest systems still reveal private data to those that are authorized through the interfaces that access the data.

2) Reversible encryption: The problem with this is in the name – it is reversible. SQL Server again provides the capability to encrypt particular data columns such as a SSN. I have seen this utilized, but the fact that the encryption is reversible implies that it will be decrypted at some point downstream. In systems that use this, often the private key information is included in the views or stored procedures so that the data is only secure in the case of direct table access. Even in the case where the decryption is not accomplished until further down in the application layer, it is still decrypted and available to anybody who has the key. Worse, the security of the data is totally dependent on some supposedly private key that has to be defined somewhere in code. While reversible encryption provides some protection, it is only as good as the secret of the encryption key and anybody with access to that information has the keys to go in and exploit the data.

3) Partial obfuscation. Even the last 4 digits commonly used to validate along with other PI data should not need to be stored in plain text or with reversible encryption. A partial number such as the last 4 is still a ticket to identity theft when combined with other data. Any portion of the SSN that is involved with identification must be non-reversibly encrypted to minimize identity theft risk.

What is really needed

Earlier, I suggested that the only real solution to protecting private data such as social security numbers is to treat them with the same respect as a password. This implies that they utilize non-reversible encryption. With non-reversible encryption, there is no way to decrypt the value. The only way to determine a value that is based on non-reversible encryption is to actually know the value. When one knows the value, the value can be encrypted and then it can be matched against the stored encrypted key. But, is this practical? There are many arguments used for why social security numbers cannot be encrypted. Below is a list of them and why none of them are valid excuses:

1) The SSN is used to link data together. This is easily resolved through the use of surrogate keys.

2) Users need to search on a SSN or partial SSN to verify the identity – i.e. A credit agency will need the user’s security number as one piece of data to verify the person is authorized for the information requested. This is not a problem for a non-reversible encryption scheme. The full SSN as well as last 4 digits is stored in the database using a non-reversible encryption scheme. The user can enters the last 4 digits in a web form on the client over a secure connection. The web client then invokes the same encryption scheme to pass this along with other identifying information to find a match. Note that with non-reversible encryption whether it is a MD5 hash or some other technique, the encryption scheme does not even need to be kept private. Since there is no way to reverse the encrypted values, knowing the scheme does not circumvent the security.

The bottom line is that even organizations who need to utilize SSNs for validation purpose have no excuse to store the data in plain text or even with a reversible encryption scheme. At no point, should such data be unencrypted. Instead such data should be permanently one-way encrypted (non-reversible).  Validation using the SSN should always be done with partial strings – the last 4, rather than the whole number.  The full SSN does not need to be used since the last four digits can be encrypted separately. In this approach, presentation of the last part of the SSN is immediately encrypted to lookup a matching record in the database using just the last 4 digits encrypted along with some other personal identification, rather than the plain text value.

Strategies

What is an organization to do that finds itself in the position of having SSNs in plain text or with a reversible encryption scheme, especially when it is linked to other private data and there are applications dependent on these structures? The first step is to start to remediate the situation immediately. This needs to be a top priority for any business, even given the government’s failures to make it a priority. Consider the risks and costs and litigation not to mention the tarnished reputation when a breach occurs. And it will occur or worse it maybe already has occurred and the business isn’t even aware of it. Few organizations actually audit external activities at the level of detail needed to even know if private data has been breached. They only find out usually after the data is already been utilized in a fraudulent fashion.

It may take months or years to remediate all the systems within an organization in a way that still allows the organization to do the business. In keeping with the principle of due diligence, what can an organization do until they reach the ideal state in securing data? The key is to go after the low hanging fruit to put safeguards in place even while mitigating the core issue. One way to do this is through implementation of a gatekeeper mentality. If you have to send an important package, you request tracking from the mail service. In the same way any confidential data must be tracked from end to end. That is the first step toward protection. It is not adequate, but it is a start.

The Gatekeeper Approach

Below are 5 steps to implementing a gatekeeper approach:

1) Ensure that all access to private data is captured. This includes the source from which the data was retrieved, the account used to access the data and everything that happens to the data afterwards. This means that all aspects of data movement must be tracked including not only the database access, but any user manipulation of the data after it has been retrieved.  This means that both database auditing and network auditing must be implemented.

2) Define a repository for storing the complete lifecycle of data access corresponding to the actions captured from #1. For every piece of sensitive data, all access to the information must be accomplished. There are ways to implement this framework in SQL Server trough the use of custom profiler traces as well as extended events and logon triggers. I cannot speak to how to do this at the network level, but I know that it is possible.

3) Define a dictionary of rules that identify the type of accesses allowed to the data. As data movements are captured, exceptions will be encountered. Safeguards can be put in place to block movements that are not defined in the dictionary. A major part of this development is training such that the system learns from previously sanctioned activities so that it can recognize unauthorized accesses.

4) Stop movements that are not allowed by the dictionary. This may result in some pain as individuals with legitimate requests are blocked.

5) Put a learning framework in place with continuous feedback so that the systems can learn from mistakes and improve.

Architecture Remediation

Databases: Any organization that is storing sensitive data non-encrypted or even with reversible encryption has been following a flawed data architecture. All databases and applications should be reviewed for compliance with principles of non-reversible encryption as well as complete tracking of all such data. Systems that are in violation need to be prioritized in terms of business impact due to downtime with changing them over to a secure approach.

Interfaces: In addition to databases, interfaces also need to be analyzed and modified to preserve security of SSN and other private data in flight. The service accounts associated with all processes must be linked to the interfaces using such techniques as IPSEC such that compromise of the password for the service account does not create a risk. It is not enough to simply have secure interfaces, the data within the secure interface must be obfuscated so that if the interface is compromised, there is little to gain from the data being transmitted.

User interfaces: User applications that simply retrieve data including unencrypted SSN must be modified to work with non-reversible encrypted SSN and components need to be added to allow validating a SSN by re-encrypting the portion of it needed to check for a match rather than by matching in the database.

In Summary

This post just scratches the surface of the data security issue. I hope to write more in the upcoming months. It takes a focused team of database experts, network experts, application developers, architects, and user-interface experts working together to remediate this issue. The time is now to insist that organizations become accountable for protecting their customer data and this includes government agencies. Partial obfuscation with just the last 4 digits unencrypted is almost as bad as having the whole string unencrypted since this provides the means to impersonation to acquire the full SSN.

As a citizen, I encourage you to lobby your congressman for regulations that require protection of private data. Developers, I encourage you to not even consider transmitting any piece of private PI without using non-reversible encryption schemes. CEOs, I encourage you to place private data security at the top of the risk and treat this with the urgency of the Y2K crisis. DBAs, I encourage you to educate your organization and insist on not storing private data without non-reversible encryption. Architects, do not even consider a design that does not include complete auditing of all sensitive data flows or allows for transmission of SSN in plain text, even within the context of a SSL interface.

Posted in Uncategorized | Leave a comment

Home-grown PCIE SSD VM vs. Azure (Update 12/7)

More testing today with larger database, multiplied the number of rows by 10x up to 1 billion rows in the test dataset.  The latest results indicate I/O about 3 times slower on an Azure D13 server using similar CPU/Ram specs from a dedicated VM on a Gen-1 raid Fusion-io duo drive. Also, the time to insert the test record set was about 2.5 times slower.  This aligns with the performance metrics outlined by http://azure.microsoft.com/blog/2014/10/06/d-series-performance-expectations/. The D series scales upward in terms of SSD capability, so the only way to achieve the I/O of the VM would be to upgrade to a D14 configuration with 800 GB SSD. Along with that comes more cores and memory (16 cores and 112 GB RAM) but that should not significantly affect the I/O comparison.

Here are the bottom line backup logs:

From the dedicated VM (48GB RAM, 6 cores @ 2.66, Dell-branded Fusion-io 640 GB MLC card:

Processed 1931504 pages for database ‘TestDb’, file ‘TestDb’ on file 1.
100 percent processed.
Processed 9 pages for database ‘TestDb’, file ‘TestDb_log’ on file 1.
BACKUP DATABASE successfully processed 1931513 pages in 26.102 seconds (578.114 MB/sec).

From the Azure D13 server (56 GB RAM, 8 Cores @ 2.20 GHz, 400 GB SSD)

Processed 1931648 pages for database ‘TestDb’, file ‘TestDb’ on file 1.
100 percent processed.
Processed 9 pages for database ‘TestDb’, file ‘TestDb_log’ on file 1.
BACKUP DATABASE successfully processed 1931657 pages in 81.958 seconds (184.131 MB/sec).

I’m going to come up with some more tests to perform such as how long it takes to provision a VM from scratch, how long to provision from an image, boot up times, and other metrics that are significant for somebody who wants to be able to quickly provision VMs. I’ll compare these with the times from the colocation environment. This will be posted in a new article.

Here is the configuration I am running on Azure, running the server hasn’t been too expensive yet, used about 5$.00 in last 24 hours, but only put a load on for a few minutes.

image

These are the main findings to this point:

1) For all but a basic/small SQL database in the Azure environment, customers will probably want to utilize the ‘D’ type configuration with the SSD option as this improves performance by a factor of at least 5.

2) Even with the SSD option, the I/O is still 2 – 3 times slower than simply storing a VM or at least the drive with the SQL data on a gen-1 fusion-IO card.

As these offerings mature to allow SSD storage to be enabled for fault-tolerance with larger sizes and higher performance, the standard D offerings will become more appealing for larger customers. It would be helpful if more robust I/O capability could be had without the expense for additional cores as databases tend to get bottlenecked by I/O well below they hit the CPU bottlenecks.

Azure is a great environment to create virtual machines quickly and test out things. But the ‘A’ series is hard to justify as a database solution even for a virtual machine provisioned with 8 cores and 56 GB RAM with a cost of over $1,000 to run over a month. I also have a virtual machine configured on one of my servers at a colocation running SQL Server utilizing PCIE SSD with Fusion-IO storage. The ‘ VM is configured with a similar number of cores and 64 GB of RAM. I created two identical databases with a two tables of 10 million rows.  After building out the data, I tested the SQL backup using the data compression option on both.

Here are the results for the ‘A’ type Azure VM side by side to the results of my VM on PCIE SSD. My performance beat the Azure Enterprise SQL VM with comparable cores and memory by 16x (685 MB/s vs. 39 MB/s and 4 seconds versus 64 seconds). SQL is all about the I/O.

imageimage

Posted in Azure, Cloud, SQL Server | Tagged , , | Leave a comment

Using INSTEAD OF Triggers to enable updatable views

A common misconception with SQL Server views is that only indexed views can provide update capability. Actually, any view that is schema-bound can be enabled for update by simply using an “INSTEAD OF” trigger.  Although an indexed view may also be setup with a trigger, it is not required.  The view does not need to obey the constraints of an indexed view other than being schema-bound since the trigger writes to the base tables.

There are a few scenarios where this is useful. One is in the case of tables that utilize a master-secondary design pattern.  In many scenarios, there may be central objects that manifest themselves in different forms requiring different attributes. Often the central objects need to be accessed singularly for their core attributes so splitting up the objects into multiple tables makes running queries that need objects of both types together awkward and forces the use of UNION queries. The other complexity involved with separating objects based on type into separate tables is that they often share relationships to common child objects. To support the different types of interfaces for different object types, multiple views can be configured.

One example of this is a name and address system I worked on a long time ago related for a non-profit organization. It turns out that donors may be organizations or individuals or married couples. Different attributes are associated with each. To further complicate matters, a donor could be a member – that is a member could donate to a donor. The member might also be married and the organization need to track the household unit associated with married members.

The solution was to create a generic entity object and then satellite objects for individual, household, or organization. This allowed the name to be linked to the outside systems without having to maintain multiple relationships and then allow the relationships between the entities to be maintained in another table. In this example an organization could donate as well as an individual which would show up on a donation report, whereas the organization attributes were not muddled up with individual attributes.

Below is the database diagram representing the common entity approach for name/address. Note that in addition to the master-secondary relationships, there are relationships between the objects such as the organizational contact for an organization entity or the spouse for a household entity. From the diagram, it is not clear as to the direction of the 1-to-1 relationships, but it is is from Entity – that is a row on Org, Household, or Person must first have a row in Entity with the Entity Id.

image

In order to support interfaces that need to insert/update different types of entities, views can be created to support the specific type of entity. The view is coupled with an INSTEAD OF trigger so that adding data to the view will execute the insert tables. Only the main Entity table uses the identity property. The other tables primary keys are all based on the entity table key. The below triggers provide a couple of interfaces to adding new entities. The first inserts a singular person and generate both the entity and the person table.

The second trigger could add a new organization with the assumption that the contact person was already added as a person entity.

The third trigger shows how cascading can be used for even more elaborate scenarios including inserting into other views. This example provides the columns for the contact person entity as well as the organization entity rather than the person entity key and inserted into the person table as well as the organization and main entity tables.  It leverages the prior two triggers by inserting directly into the views.

CREATE VIEW v_PersonEntity WITH SCHEMABINDING
AS SELECT FirstName, LastName, Suffix, WorkPhone, CellPhone, HomePhone, PrimaryPhone
FROM dbo.Person p
GO

CREATE VIEW v_OrgEntity WITH SCHEMABINDING
AS SELECT ContactPersonId, FederalId, EntityName
FROM dbo.Org o
INNER JOIN dbo.Entity e
    ON o.EntityId = e.EntityId
GO

CREATE VIEW v_OrgEntity_WithContact WITH SCHEMABINDING
AS SELECT FederalId, EntityName, FirstName, LastName, WorkPhone
FROM dbo.Org o
INNER JOIN dbo.Entity e
    ON o.EntityId = e.EntityId
INNER JOIN dbo.Person p
    ON p.EntityId = o.ContactPersonId
GO

CREATE TRIGGER v_Person_ITrig ON dbo.v_PersonEntity
   INSTEAD OF INSERT
AS BEGIN
    IF (SELECT COUNT(*) FROM inserted) > 1 BEGIN
        RAISERROR (‘View only supports adding 1 person at a time’,16,1)
    END
    DECLARE @EntityID INT
    — Create the main entity row
    INSERT INTO dbo.Entity (EntityName, EntityType, MainPhoneNumber)
    SELECT
        i.FirstName + ‘ ‘ + i.LastName, ‘P’
        ,CASE
            WHEN PrimaryPhone = ‘H’ THEN HomePhone
            WHEN PrimaryPhone = ‘C’ THEN CellPhone
            WHEN PrimaryPhone = ‘W’ THEN WorkPhone
        END
    FROM inserted i

    — Create the secondary person row
    SELECT @EntityId = SCOPE_IDENTITY()
    INSERT into dbo.Person (EntityId, FirstName, LastName, Suffix,
    WorkPhone, CellPhone, HomePhone, PrimaryPhone)
    SELECT @EntityId, FirstName, LastName, Suffix,
    WorkPhone, CellPhone, HomePhone, PrimaryPhone
    FROM inserted
END
GO

CREATE TRIGGER v_OrgEntity_ITrig on dbo.v_OrgEntity
   INSTEAD OF INSERT
AS BEGIN
    IF (SELECT COUNT(*) FROM inserted) > 1 BEGIN
        RAISERROR (‘View only supports adding 1 person at a time’,16,1)
    END
    DECLARE @EntityID INT
   
    — Create the main entity row
    INSERT INTO dbo.Entity (EntityName, EntityType)
    SELECT EntityName, ‘O’
    FROM inserted i

    — Create secondary org row
    SELECT @EntityId = SCOPE_IDENTITY()
    INSERT into dbo.Org
    (EntityId, ContactPersonId, FederalId)
    SELECT @EntityId, ContactPersonId, FederalId
    FROM inserted

END
GO

CREATE TRIGGER v_OrgEntity_WithContact_ITrig on dbo.v_OrgEntity_WithContact
   INSTEAD OF INSERT
AS BEGIN
    IF (SELECT COUNT(*) FROM inserted) > 1 BEGIN
        RAISERROR (‘View only supports adding 1 person at a time’,16,1)
    END
    DECLARE @EntityID INT
   
    — Could use the view to create the person and entity rows, but
    — then there is a challenge with getting the identity inserted
    — since SCOPE_IDENTITY is nulled out after the last add to the
    — person table
    INSERT INTO dbo.Entity (EntityName, EntityType, MainPhoneNumber)
    SELECT
        i.FirstName + ‘ ‘ + i.LastName, ‘P’, WorkPhone
    FROM inserted i

    — Create the secondary person row
    SELECT @EntityId = SCOPE_IDENTITY()
    INSERT into dbo.Person (EntityId, FirstName, LastName,
    WorkPhone, PrimaryPhone)
    SELECT @EntityId, FirstName, LastName,
    WorkPhone, ‘W’
    FROM inserted

    — Create the organization and entity row using the org view
    INSERT into dbo.v_OrgEntity
    (ContactPersonId, FederalId, EntityName)
    SELECT @EntityId, FederalId, EntityName
    FROM inserted
END
GO

Below shows how all tables are affected by insert to the single Org with contact view:

image

These triggers illustrate the use of insert, however update and delete actions may also be utilized in INSTEAD OF triggers. Below is an example that does both an insert and update for one of the tables that I utilize in my universal problem resolution framework utilized in my dissertation. Also, in the prior example, the triggers only work for 1 row insertion at a time because of the need to capture the identity associated with the new entity due to the fact that triggers operate against the entire row set rather than individually on each row. In the below example, a temporary table and a pre-defined sequencing approach is utilized to support multiple simultaneous insertions. The coalesce for the update causes omitted items to remain as they were before.

ALTER TRIGGER [Data].[view_Problem_Item_ITrig]
   ON  [Data].[view_Problem_Item]
   INSTEAD OF INSERT
AS
BEGIN
    — SET NOCOUNT ON added to prevent extra result sets from
    — interfering with SELECT statements.
    SET NOCOUNT ON;
    BEGIN TRY
        BEGIN TRANSACTION
        –SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
        DECLARE @EntityTypeId MONEY = Data.udf_GetItemTypeId(‘Data.Entity’)
        DECLARE @NextEntityId MONEY — = Data.udf_GetNextItemId(@EntityTypeId)

        DECLARE @ProblemTypeId MONEY = Data.udf_GetItemTypeId(‘Data.Problem’)
        DECLARE @NextProblemId MONEY — = Data.udf_GetNextItemId(@ProblemTypeId)
        SELECT @NextProblemId = MAX(ItemId) FROM Data.Item  WITH (HOLDLOCK, UPDLOCK)

        SET XACT_ABORT ON

        SELECT  
            (CONVERT(MONEY,ROW_NUMBER() OVER (ORDER BY ItemName)) / 10000.00) — – .0001
             + @NextProblemId AS ItemId,
            ActivationTaskDetailId,
            DefineActionTreeId,
            SolveActionTreeId,
            ProblemDefinitionState,
            ProblemSolutionState,
            CurrentStepNumber,
            ItemName,
            TypeId,
            VersionNumber,
            IsSystemItem,
            ItemPath,
            AutoGenerated,
            RootProblemId,
            BranchProblemId,
            IsRoot,
            ActivationValue,
            ProblemCheckSum,
            BranchStepNumber,
            OutcomeReason
        INTO #Problem FROM inserted
        WHERE (ItemId IS NULL OR ItemId = 0)

        INSERT Data.Item  WITH (ROWLOCK)
            (
            ItemId,
            ItemName,
            TypeId,
            VersionNumber,
            IsSystemItem,
            ItemPath,
            AutoGenerated,
            RootProblemId
            )
        SELECT
            ItemId,
            ItemName,
            COALESCE(TypeId,@ProblemTypeId),
            COALESCE(VersionNumber,1.0),
            COALESCE(IsSystemItem, 0),
            COALESCE(ItemPath, ItemName),
            COALESCE(AutoGenerated,0),
            RootProblemId
        FROM #Problem

        INSERT Data.Problem  WITH (ROWLOCK)
        (
            ProblemId,
            ActivationTaskDetailId,
            DefineActionTreeId,
            SolveActionTreeId,
            ProblemDefinitionState,
            ProblemSolutionState,
            CurrentStepNumber,
            BranchProblemId,
            IsRoot,
            ActivationValue,
            ProblemCheckSum,
            BranchStepNumber,
            OutcomeReason
        )
        SELECT
            ItemId,
            ActivationTaskDetailId,
            DefineActionTreeId,
            SolveActionTreeId,
            ProblemDefinitionState,
            ProblemSolutionState,
            COALESCE(CurrentStepNumber,0.0),
            BranchProblemId,
            COALESCE(IsRoot,0),
            ActivationValue,
            ProblemCheckSum,
            BranchStepNumber,
            OutcomeReason
        FROM #Problem

        UPDATE item  WITH (ROWLOCK)
        SET ItemName = COALESCE(i.ItemName, item.ItemName),
            TypeId = COALESCE(i.TypeId, item.TypeId),
            VersionNumber = COALESCE(i.VersionNumber,item.VersionNumber),
            IsSystemItem = COALESCE(i.IsSystemItem,item.IsSystemItem),
            ItemPath = COALESCE(i.ItemPath,item.ItemName),
            AutoGenerated = COALESCE(i.AutoGenerated, item.AutoGenerated),
            RootProblemId = COALESCE(i.RootProblemId, item.RootProblemId)
        FROM inserted i
        INNER JOIN Data.Item item 
            ON i.ItemId = item.ItemId 
        WHERE (i.ItemId IS NOT NULL AND i.ItemId <> 0)

        UPDATE p WITH (ROWLOCK)
        SET ActivationTaskDetailId = COALESCE(i.ActivationTaskDetailId, P.ActivationTaskDetailId),
            DefineActionTreeId = COALESCE(i.DefineActionTreeId, p.DefineActionTreeId),
            SolveActionTreeId = COALESCE(i.SolveActionTreeId, p.SolveActionTreeId),
            ProblemDefinitionState = COALESCE(i.ProblemDefinitionState, p.ProblemDefinitionState),
            ProblemSolutionState = COALESCE(i.ProblemSolutionState, p.ProblemSolutionState),
            CurrentStepNumber = COALESCE(i.CurrentStepNumber, p.CurrentStepNumber),
            BranchProblemId = COALESCE(i.BranchProblemId, p.BranchProblemId),
            ActivationValue = COALESCE(i.ActivationValue, p.ActivationValue),
            ProblemCheckSum = COALESCE(i.ProblemCheckSum, p.ProblemCheckSum),
            OutcomeReason = COALESCE(i.OutcomeReason, p.OutcomeReason),
            Complexity = COALESCE(
        FROM inserted i
        INNER JOIN Data.Problem p
            ON i.ItemId = p.ProblemId 
        WHERE (i.ItemId IS NOT NULL AND i.ItemId <> 0)

        DROP TABLE #Problem
       
        COMMIT
    END TRY
    BEGIN CATCH
        IF XACT_STATE() <> 0 ROLLBACK
        EXEC dbo.usp_RethrowError
    END CATCH

END

Posted in Uncategorized | Leave a comment