> The values in the non-clustered index are GUIDs and thus distributed
> throughout the index.
As Tom said, random GUID values are a real bulk insert performance killer.
Buffer efficiency with these random decreases proportionally with table size
until you get to the point of requiring at least one I/Os for every row
inserted.
One workaround I've used with SQL 2000 is to use a composite key instead of
GUID alone. I don't know of the details of your application but in my case,
I used a unique index consisting of FileId (a sequential integer) and
RecordId (a GUID value). This greatly improved bulk insert performance with
very large tables.
Hope this helps.
Dan Guzman
SQL Server MVP
"Scott Nichol" <snicholnews@.scottnichol.com> wrote in message
news:%23OYnHQQHIHA.2268@.TK2MSFTNGP02.phx.gbl...
> Good questions. Fill factor is 80% (the default?). The write pattern in
> general is no writes, then bursts every two minutes or so at checkpoint.
> During bursts, write queue goes to 100 or more. Each burst is about 8
> seconds. The read pattern is more consistent and lighter, with read queue
> never higher than 6, but normally 0 (and that's with other concurrent
> activity in the database). The reads are from the RAID 10 set with the
> non-clustered index. There is very little reading from the set with the
> clustered index.
> On a development machine where I can run the BULK INSERT without other
> activity, the disk activity on the RAID set with the clustered index
> consists of one write burst of 4 seconds. The disk activity on the RAID
> set with the non-clustered index shows consistent reads for the rest of
> the load, with intermiittent write bursts.
> The feeling we get from the reads is that SQL Server is reading through
> the existing index nodes to find leaves that need to be updated (splitting
> when necessary, etc.).
> The values in the non-clustered index are GUIDs and thus distributed
> throughout the index.
> --
> Scott Nichol
> "Tom Moreau" <tom@.dont.spam.me.cips.ca> wrote in message
> news:%23HGP4$LHIHA.1316@.TK2MSFTNGP02.phx.gbl...
>
Thanks. You confirm our suspicions about the GUID: we seem to have lots of single I/Os when the index gets updated. There's
virtually no I/O queuing and the machine does not seem taxed at all.
If we use the sequential integer approach, we only need the GUID in the index when we hit 4G rows and start duplicating the integer
values we generate (assuming we use negative as well as positive values), right?
Since the value of the index is used as a pseudo-FK from another table, we'd have some work there as well. Not to mention, of
course, adding the new sequential column to the existing half billion rows. Too bad we did not attack the problem earlier!
It sounds like NEWSEQUENTIALID() would help us tremendously if we moved to SQL Server 2005. Is that correct? We want to migrate
during 2008 anyway. If it might do at least nearly as well, it would certainly be easier than retrofitting with the sequential
integer.
Scott Nichol
"Dan Guzman" <guzmanda@.nospam-online.sbcglobal.net> wrote in message news:8187C560-FBFB-40E0-BE14-AB089E2936B3@.microsoft.com...
> As Tom said, random GUID values are a real bulk insert performance killer. Buffer efficiency with these random decreases
> proportionally with table size until you get to the point of requiring at least one I/Os for every row inserted.
> One workaround I've used with SQL 2000 is to use a composite key instead of GUID alone. I don't know of the details of your
> application but in my case, I used a unique index consisting of FileId (a sequential integer) and RecordId (a GUID value). This
> greatly improved bulk insert performance with very large tables.
> --
> Hope this helps.
> Dan Guzman
> SQL Server MVP
> "Scott Nichol" <snicholnews@.scottnichol.com> wrote in message news:%23OYnHQQHIHA.2268@.TK2MSFTNGP02.phx.gbl...
>
|||> If we use the sequential integer approach, we only need the GUID in the
> index when we hit 4G rows and start duplicating the integer values we
> generate (assuming we use negative as well as positive values), right?
You don't need the GUID in the index at all if you go with a bigint and let
SQL Server assign the values with IDENTITY. The application I mentioned in
my response provided both the integer and GUID values to SQL Server. This
allowed the application to assign the surrogate key and bulk insert into
related tables.
> Since the value of the index is used as a pseudo-FK from another table,
> we'd have some work there as well. Not to mention, of course, adding the
> new sequential column to the existing half billion rows. Too bad we did
> not attack the problem earlier!
I learned this same lesson the same lesson the hard way. I started with
only GUID clustered PK and found that bulk performance dropped at about 5M
rows. I then changed to a non-clustered index to mitigate the problem but
performance still dropped off with larger tables. Once I added the integer,
I was able to use the clustered primary key with good insert performance.
My experience was also a good lesson on PMON. The only way to really know
if you have a disk bottleneck with a batch application is to know the
maximum throughput the disk subsystem can sustain. Once the limit is hit,
the only ways to get more throughput is to either add hardware or revise the
indexing strategy/application design.
> It sounds like NEWSEQUENTIALID() would help us tremendously if we moved to
> SQL Server 2005. Is that correct? We want to migrate during 2008 anyway.
> If it might do at least nearly as well, it would certainly be easier than
> retrofitting with the sequential integer.
Yes, a NEWSEQUENTIALID() default constraint (available in SQL 2005 and
2008), a bigint identity, or the composite key method will certainly improve
insert performance.
Hope this helps.
Dan Guzman
SQL Server MVP
"Scott Nichol" <snicholnews@.scottnichol.com> wrote in message
news:Ovyu9OWHIHA.280@.TK2MSFTNGP03.phx.gbl...
> Thanks. You confirm our suspicions about the GUID: we seem to have lots
> of single I/Os when the index gets updated. There's virtually no I/O
> queuing and the machine does not seem taxed at all.
> If we use the sequential integer approach, we only need the GUID in the
> index when we hit 4G rows and start duplicating the integer values we
> generate (assuming we use negative as well as positive values), right?
> Since the value of the index is used as a pseudo-FK from another table,
> we'd have some work there as well. Not to mention, of course, adding the
> new sequential column to the existing half billion rows. Too bad we did
> not attack the problem earlier!
> It sounds like NEWSEQUENTIALID() would help us tremendously if we moved to
> SQL Server 2005. Is that correct? We want to migrate during 2008 anyway.
> If it might do at least nearly as well, it would certainly be easier than
> retrofitting with the sequential integer.
> --
> Scott Nichol
> "Dan Guzman" <guzmanda@.nospam-online.sbcglobal.net> wrote in message
> news:8187C560-FBFB-40E0-BE14-AB089E2936B3@.microsoft.com...
>
|||Trying to avoid the pain of making any "real" change, I implemented my own alternative to NEWSEQUENTIALID(), and it ROCKS.
Using the function to generate GUID values (during data massaging) to load, BULK INSERT on my development machine took just 23
seconds to load each of two files, one 110,766 rows, the other 118,829. That's 4991 rows/sec on average. For comparison, 117,755
rows loaded in 14 seconds (8411 rows/sec) with the non-clustered index dropped. With "normal" GUIDs, I was loading just 90 rows/sec
on that rig. I look forward to seeing this on our production box.
The caveats are
1. I had dropped the non-clustered index to run the test without it, so the index that subsequent tests ran against is a fresh
re-build, presumably as unfragmented as possible.
2. The function I used to generate GUIDs was taken from an article on the web at
http://www.informit.com/articles/article.aspx?p=25862&seqNum=7. It combines 10 bytes of a normally generated GUID with 6 bytes of
the current time. The time portion provides the sequential-ness; the GUID part the random-ness. The algorithm seems very unlikely
to produce a duplicate, but I will review it to confirm before trying in production.
Thanks for all the great feedback.
Scott Nichol
"Dan Guzman" <guzmanda@.nospam-online.sbcglobal.net> wrote in message news:B0C99286-2CBB-4C26-AC61-B09D02965D41@.microsoft.com...
> You don't need the GUID in the index at all if you go with a bigint and let SQL Server assign the values with IDENTITY. The
> application I mentioned in my response provided both the integer and GUID values to SQL Server. This allowed the application to
> assign the surrogate key and bulk insert into related tables.
>
> I learned this same lesson the same lesson the hard way. I started with only GUID clustered PK and found that bulk performance
> dropped at about 5M rows. I then changed to a non-clustered index to mitigate the problem but performance still dropped off with
> larger tables. Once I added the integer, I was able to use the clustered primary key with good insert performance.
> My experience was also a good lesson on PMON. The only way to really know if you have a disk bottleneck with a batch application
> is to know the maximum throughput the disk subsystem can sustain. Once the limit is hit, the only ways to get more throughput is
> to either add hardware or revise the indexing strategy/application design.
>
> Yes, a NEWSEQUENTIALID() default constraint (available in SQL 2005 and 2008), a bigint identity, or the composite key method will
> certainly improve insert performance.
> --
> Hope this helps.
> Dan Guzman
> SQL Server MVP
> "Scott Nichol" <snicholnews@.scottnichol.com> wrote in message news:Ovyu9OWHIHA.280@.TK2MSFTNGP03.phx.gbl...
>
|||> The algorithm seems very unlikely to produce a duplicate, but I will
> review it to confirm before trying in production.
I'm glad you found an easy solution. As long as you use a primary key to
guarantee unique values, I don't see a problem using it.
Hope this helps.
Dan Guzman
SQL Server MVP
"Scott Nichol" <snicholnews@.scottnichol.com> wrote in message
news:eYlG65bHIHA.4272@.TK2MSFTNGP06.phx.gbl...
> Trying to avoid the pain of making any "real" change, I implemented my own
> alternative to NEWSEQUENTIALID(), and it ROCKS.
> Using the function to generate GUID values (during data massaging) to
> load, BULK INSERT on my development machine took just 23 seconds to load
> each of two files, one 110,766 rows, the other 118,829. That's 4991
> rows/sec on average. For comparison, 117,755 rows loaded in 14 seconds
> (8411 rows/sec) with the non-clustered index dropped. With "normal"
> GUIDs, I was loading just 90 rows/sec on that rig. I look forward to
> seeing this on our production box.
> The caveats are
> 1. I had dropped the non-clustered index to run the test without it, so
> the index that subsequent tests ran against is a fresh re-build,
> presumably as unfragmented as possible.
> 2. The function I used to generate GUIDs was taken from an article on the
> web at http://www.informit.com/articles/article.aspx?p=25862&seqNum=7. It
> combines 10 bytes of a normally generated GUID with 6 bytes of the current
> time. The time portion provides the sequential-ness; the GUID part the
> random-ness. The algorithm seems very unlikely to produce a duplicate,
> but I will review it to confirm before trying in production.
> Thanks for all the great feedback.
> --
> Scott Nichol
> "Dan Guzman" <guzmanda@.nospam-online.sbcglobal.net> wrote in message
> news:B0C99286-2CBB-4C26-AC61-B09D02965D41@.microsoft.com...
>
Tuesday, March 20, 2012
BULK INSERT performance
Labels:
bulk,
database,
distributedgt,
guid,
guids,
index,
insert,
microsoft,
mysql,
non-clustered,
oracle,
performance,
random,
real,
server,
sql,
throughout,
thus,
tom,
values
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment