[sql] Are there any disadvantages to always using nvarchar(MAX)?

In SQL Server 2005, are there any disadvantages to making all character fields nvarchar(MAX) rather than specifying a length explicitly, e.g. nvarchar(255)? (Apart from the obvious one that you aren't able to limit the field length at the database level)

This question is related to sql sql-server sql-server-2005

The answer is


One problem is that if you are having to work with multiple versions of SQL Server, the MAX will not always work. So if you are working with legacy DB's or any other situation that involves multiple versions, you better be very careful.


Think of it as just another safety level. You can design your table without foreign key relationships - perfectly valid - and ensure existence of associated entities entirely on the business layer. However, foreign keys are considered good design practice because they add another constraint level in case something messes up on the business layer. Same goes for field size limitation and not using varchar MAX.


This will cause a performance problem, although it may never cause any actual issues if your database is small. Each record will take up more space on the hard drive and the database will need to read more sectors of the disk if you're searching through a lot of records at once. For example, a small record could fit 50 to a sector and a large record could fit 5. You'd need to read 10 times as much data from the disk using the large record.


As of SQL Server 2019, NVARCHAR(MAX) still does not support SCSU “Unicode compression” — even when stored using In-Row data storage. SCSU was added in SQL Server 2008 and applies to any ROW/PAGE-compressed tables and indices.

As such, NVARCHAR(MAX) can take up to twice as much physical disk space as a NVARCHAR(1..4000) field with the same text content+ — even when not stored in the LOB. The non-SCSU waste depends on data and language represented.

Unicode Compression Implementation:

SQL Server uses an implementation of the Standard Compression Scheme for Unicode (SCSU) algorithm to compress Unicode values that are stored in row or page compressed objects. For these compressed objects, Unicode compression is automatic for nchar(n) and nvarchar(n) columns [and is never used with nvarchar(max)].

On the other hand, PAGE compression (since 2014) still applies to NVARCHAR(MAX) columns if they are written as In-Row data.. so lack of SCSU feels like a “missing optimization”. Unlike SCSU, page compression results can vary dramatically based on shared leading prefixes (ie. duplicate values).

However, it may still be “faster” to use NVARCHAR(MAX) even with the higher IO costs with functions like OPENJSON due to avoiding the implicit conversion. This is implicit conversion overhead depends on the relative cost of usage and if the field is touched before or after filtering. This same conversion issue exists when using 2019’s UTF-8 collation in a VARCHAR(MAX) column.

Using NVARCHAR(1-4000) also requires N*2 bytes of the ~8000 byte row quota, while NVARCHAR(MAX) only requires 24 bytes. Overall design and usage need to be considered together to account for specific implementation details.

+In my database / data / schema, by using two columns (coalesced on read) it was possible to reduce disk space usage by ~40% while still supporting overflowing text values. SCSU, while with its flaws, is an amazingly clever and underutilized method of storing Unicode more space-efficiently.


If all of the data in a row (for all the columns) would never reasonably take 8000 or fewer characters then the design at the data layer should enforce this.

The database engine is much more efficient keeping everything out of blob storage. The smaller you can restrict a row the better. The more rows you can cram in a page the better. The database just performs better when it has to access fewer pages.


Interesting link: Why use a VARCHAR when you can use TEXT?

It's about PostgreSQL and MySQL, so the performance analysis is different, but the logic for "explicitness" still holds: Why force yourself to always worry about something that's relevant a small percentage of the time? If you saved an email address to a variable, you'd use a 'string' not a 'string limited to 80 chars'.


Bad idea when you know the field will be in a set range- 5 to 10 character for example. I think I'd only use max if I wasn't sure what the length would be. For example a telephone number would never be more than a certain number of characters.

Can you honestly say you are that uncertain about the approximate length requirements for every field in your table?

I do get your point though- there are some fields I'd certainly consider using varchar(max).

Interestingly the MSDN docs sum it up pretty well:

Use varchar when the sizes of the column data entries vary considerably. Use varchar(max) when the sizes of the column data entries vary considerably, and the size might exceed 8,000 bytes.

There's an interesting discussion on the issue here.


Based on the link provided in the accepted answer it appears that:

  1. 100 characters stored in an nvarchar(MAX) field will be stored no different to 100 characters in an nvarchar(100) field - the data will be stored inline and you will not have the overhead of reading and writing data 'out of row'. So no worries there.

  2. If the size is greater than 4000 the data would be stored 'out of row' automatically, which is what you would want. So no worries there either.

However...

  1. You cannot create an index on an nvarchar(MAX) column. You can use full-text indexing, but you cannot create an index on the column to improve query performance. For me, this seals the deal...it is a definite disadvantage to always use nvarchar(MAX).

Conclusion:

If you want a kind of "universal string length" throughout your whole database, which can be indexed and which will not waste space and access time, then you could use nvarchar(4000).


Sometimes you want the data type to enforce some sense on the data in it.

Say for example you have a column that really shouldn't be longer than, say, 20 characters. If you define that column as VARCHAR(MAX), some rogue application could insert a long string into it and you'd never know, or have any way of preventing it.

The next time your application uses that string, under the assumption that the length of the string is modest and reasonable for the domain it represents, you will experience an unpredictable and confusing result.


I checked some articles and find useful test script from this: http://www.sqlservercentral.com/Forums/Topic1480639-1292-1.aspx Then changed it to compare between NVARCHAR(10) vs NVARCHAR(4000) vs NVARCHAR(MAX) and I don't find speed difference when using specified numbers but when using MAX. You can test by yourself. Hope This help.

SET NOCOUNT ON;

--===== Test Variable Assignment 1,000,000 times using NVARCHAR(10)
DECLARE @SomeString NVARCHAR(10),
        @StartTime DATETIME;
--=====         
 SELECT @startTime = GETDATE();
 SELECT TOP 1000000
        @SomeString = 'ABC'
   FROM master.sys.all_columns ac1,
        master.sys.all_columns ac2;
 SELECT testTime='10', Duration = DATEDIFF(ms,@StartTime,GETDATE());
GO
--===== Test Variable Assignment 1,000,000 times using NVARCHAR(4000)
DECLARE @SomeString NVARCHAR(4000),
        @StartTime DATETIME;
 SELECT @startTime = GETDATE();
 SELECT TOP 1000000
        @SomeString = 'ABC'
   FROM master.sys.all_columns ac1,
        master.sys.all_columns ac2;
 SELECT testTime='4000', Duration = DATEDIFF(ms,@StartTime,GETDATE());
GO
--===== Test Variable Assignment 1,000,000 times using NVARCHAR(MAX)
DECLARE @SomeString NVARCHAR(MAX),
        @StartTime DATETIME;
 SELECT @startTime = GETDATE();
 SELECT TOP 1000000
        @SomeString = 'ABC'
   FROM master.sys.all_columns ac1,
        master.sys.all_columns ac2;
 SELECT testTime='MAX', Duration = DATEDIFF(ms,@StartTime,GETDATE());
GO

legacy system support. If you have a system that is using the data and it is expected to be a certain length then the database is a good place to enforce the length. This is not ideal but legacy systems are sometime not ideal. =P


One disadvantage is that you will be designing around an unpredictable variable, and you will probably ignore instead of take advantage of the internal SQL Server data structure, progressively made up of Row(s), Page(s), and Extent(s).

Which makes me think about data structure alignment in C, and that being aware of the alignment is generally considered to be a Good Thing (TM). Similar idea, different context.

MSDN page for Pages and Extents

MSDN page for Row-Overflow Data


Interesting link: Why use a VARCHAR when you can use TEXT?

It's about PostgreSQL and MySQL, so the performance analysis is different, but the logic for "explicitness" still holds: Why force yourself to always worry about something that's relevant a small percentage of the time? If you saved an email address to a variable, you'd use a 'string' not a 'string limited to 80 chars'.


A reason NOT to use max or text fields is that you cannot perform online index rebuilds i.e. REBUILD WITH ONLINE= ON even with SQL Server Enterprise Edition.


The main disadvantage I can see is that let's say you have this:

Which one gives you the most information about the data needed for the UI?

This

            CREATE TABLE [dbo].[BusData](
                [ID] [int] IDENTITY(1,1) NOT NULL,
                [RecordId] [nvarchar](MAX) NULL,
                [CompanyName] [nvarchar](MAX) NOT NULL,
                [FirstName] [nvarchar](MAX) NOT NULL,
                [LastName] [nvarchar](MAX) NOT NULL,
                [ADDRESS] [nvarchar](MAX) NOT NULL,
                [CITY] [nvarchar](MAX) NOT NULL,
                [County] [nvarchar](MAX) NOT NULL,
                [STATE] [nvarchar](MAX) NOT NULL,
                [ZIP] [nvarchar](MAX) NOT NULL,
                [PHONE] [nvarchar](MAX) NOT NULL,
                [COUNTRY] [nvarchar](MAX) NOT NULL,
                [NPA] [nvarchar](MAX) NULL,
                [NXX] [nvarchar](MAX) NULL,
                [XXXX] [nvarchar](MAX) NULL,
                [CurrentRecord] [nvarchar](MAX) NULL,
                [TotalCount] [nvarchar](MAX) NULL,
                [Status] [int] NOT NULL,
                [ChangeDate] [datetime] NOT NULL
            ) ON [PRIMARY]

Or This?

            CREATE TABLE [dbo].[BusData](
                [ID] [int] IDENTITY(1,1) NOT NULL,
                [RecordId] [nvarchar](50) NULL,
                [CompanyName] [nvarchar](50) NOT NULL,
                [FirstName] [nvarchar](50) NOT NULL,
                [LastName] [nvarchar](50) NOT NULL,
                [ADDRESS] [nvarchar](50) NOT NULL,
                [CITY] [nvarchar](50) NOT NULL,
                [County] [nvarchar](50) NOT NULL,
                [STATE] [nvarchar](2) NOT NULL,
                [ZIP] [nvarchar](16) NOT NULL,
                [PHONE] [nvarchar](18) NOT NULL,
                [COUNTRY] [nvarchar](50) NOT NULL,
                [NPA] [nvarchar](3) NULL,
                [NXX] [nvarchar](3) NULL,
                [XXXX] [nvarchar](4) NULL,
                [CurrentRecord] [nvarchar](50) NULL,
                [TotalCount] [nvarchar](50) NULL,
                [Status] [int] NOT NULL,
                [ChangeDate] [datetime] NOT NULL
            ) ON [PRIMARY]

1) The SQL server will have to utilize more resources (allocated memory and cpu time) when dealing with nvarchar(max) vs nvarchar(n) where n is a number specific to the field.

2) What does this mean in regards to performance?

On SQL Server 2005, I queried 13,000 rows of data from a table with 15 nvarchar(max) columns. I timed the queries repeatedly and then changed the columns to nvarchar(255) or less.

The queries prior to the optimization averaged at 2.0858 seconds. The queries after the change returned in an average of 1.90 seconds. That was about 184 milliseconds of improvement to the basic select * query. That is an 8.8% improvement.

3) My results are in concurrence with a few other articles that indicated that there was a performance difference. Depending on your database and the query, the percentage of improvement can vary. If you don't have a lot of concurrent users or very many records, then the performance difference won't be an issue for you. However, the performance difference will increase as more records and concurrent users increase.


It's a fair question and he did state apart from the obvious…

Disadvantages could include:

Performance implications Query optimizer uses field size to determine most efficent exectution plan

"1. The space alloction in extends and pages of the database are flexible. Thus when adding information to the field using update, your database would have to create a pointer if the new data is longer than the previous inserted. This the database files would become fragmented = lower performance in almost everything, from index to delete, update and inserts. " http://sqlblogcasts.com/blogs/simons/archive/2006/02/28/Why-use-anything-but-varchar_2800_max_2900_.aspx

Integration implications - hard for other systems to know how to integrate with your database Unpredictable growth of data Possible security issues e.g. you could crash a system by taking up all disk space

There is good article here: http://searchsqlserver.techtarget.com/tip/1,289483,sid87_gci1098157,00.html


As was pointed out above, it is primarily a tradeoff between storage and performance. At least in most cases.

However, there is at least one other factor that should be considered when choosing n/varchar(Max) over n/varchar(n). Is the data going to be indexed (such as, say, a last name)? Since the MAX definition is considered a LOB, then anything defined as MAX is not available for indexing. and without an index, any lookup involving the data as predicate in a WHERE clause is going to be forced into a Full Table scan, which is the worst performance you can get for data lookups.


firstly I thought about this, but then thought again. There are performance implications, but equally it does serve as a form of documentation to have an idea what size the fields really are. And it does enforce when that database sits in a larger ecosystem. In my opinion the key is to be permissive but only within reason.

ok, here's my feelings simply on the issue of business and data layer logic. It depends, if your DB is a shared resource between systems that share business logic then of course it seems a natural place to enforce such logic, but its not the BEST way to do it, the BEST way is to provide an API, this allows the interaction to be tested and keeps business logic where it belongs, it keeps systems decoupled, it keeps your tiers within a system decoupled. If however your database is supposed to be serving only one application, then lets get AGILE in thinking, what's true now? design for now. If and when such access is needed, provide an API to that data.

obviously though, this is just the ideal, if you are working with an existing system the likelyhood is that you will need to do it differently at least in the short term.


It will make screen design harder as you will no longer be able to predict how wide your controls should be.


I had a udf which padded strings and put the output to varchar(max). If this was used directly instead of casting back to the appropriate size for the column being adjusted, the performance was very poor. I ended up putting the udf to an arbitrary length with a big note instead of relying on all the callers of the udf to re-cast the string to a smaller size.


My tests have shown that there are differences when selecting.

CREATE TABLE t4000 (a NVARCHAR(4000) NULL);

CREATE TABLE tmax (a NVARCHAR(MAX) NULL);

DECLARE @abc4 NVARCHAR(4000) = N'ABC';

INSERT INTO t4000
SELECT TOP 1000000 @abc4
    FROM
    master.sys.all_columns ac1,
    master.sys.all_columns ac2;

DECLARE @abc NVARCHAR(MAX) = N'ABC';

INSERT INTO tmax
SELECT TOP 1000000 @abc
    FROM
    master.sys.all_columns ac1,
    master.sys.all_columns ac2;

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT * FROM dbo.t4000;
SELECT * FROM dbo.tmax;

The only problem I found was that we develop our applications on SQL Server 2005, and in one instance, we have to support SQL Server 2000. I just learned, the hard way that SQL Server 2000 doesn't like the MAX option for varchar or nvarchar.


The job of the database is to store data so that it can be used by the enterprise. Part of making that data useful is ensuring that it is meaningful. Allowing someone to enter an unlimited number of characters for their first name isn't ensuring meaningful data.

Building these constraints into the business layer is a good idea, but that doesn't ensure that the database will remain intact. The only way to guarantee that the data rules are not violated is to enforce them at the lowest level possible in the database.


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to sql-server

Passing multiple values for same variable in stored procedure SQL permissions for roles Count the Number of Tables in a SQL Server Database Visual Studio 2017 does not have Business Intelligence Integration Services/Projects ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database How to create temp table using Create statement in SQL Server? SQL Query Where Date = Today Minus 7 Days How do I pass a list as a parameter in a stored procedure? SQL Server date format yyyymmdd

Examples related to sql-server-2005

Add a row number to result set of a SQL query SQL Server : Transpose rows to columns Select info from table where row has max date How to query for Xml values and attributes from table in SQL Server? How to restore SQL Server 2014 backup in SQL Server 2008 SQL Server 2005 Using CHARINDEX() To split a string Is it necessary to use # for creating temp tables in SQL server? SQL Query to find the last day of the month JDBC connection to MSSQL server in windows authentication mode How to convert the system date format to dd/mm/yy in SQL Server 2008 R2?