CRC-32C (Castagnoli) for C++ and .NET
This is a hardware-accelerated implementation of CRC-32C (Castagnoli, polynomial 0x1EDC6F41 / 0x82F63B78) for Windows (C++ and .NET),
opensourced under zlib license
and BSD license.
CRC32 instruction is used if available. Otherwise this library uses super fast software fallback.
.NET Framework doesn't provide any CRC algorithms and existing opensource libraries contain suboptimal algorithms. Creating really fast CRC algorithm is hard. This library encapsulates all that complexity for you and offers you a neat, simple API, so that every application can now afford fast CRC algorithm.
Note on CRC variants
Don't confuse CRC-32C (Castagnoli) with the older CRC-32 (also called CRC-32-IEEE). As you might know, CRC algorithm is defined by its so-called polynomial, essentially a 32-bit number. CRC-32C's polynomial can be seen in the box to the right.
Advantages of CRC-32C
The polynomial in CRC-32C was shown to have better error detection properties, which is the reason for its adoption in newer standards (iSCSI, SCTP, ext4).
Aside from higher reliability, CRC-32C now has the advantage of dedicated instruction on newer Intel processors. That's why it is being chosen for high-performance applications, for example Snappy compression algorithm.
Tutorial for C++
C++ NuGet package contains source code that is compiled together with your project. Your project will therefore have no DLL dependencies and there will be no C++ runtime issues. It however means that Debug build of your project will contain slower Debug build of CRC-32C.
PM> Install-Package Crc32C
Alternatively, you can download plain 7-Zip archive of the DLLs and associated LIBs and headers. Sources are available via Bitbucket repository. C++ code and binaries are distributed under zlib license.
After downloading the library, your first step is to include the header:
After that, you will get access to the following function:
/* Computes CRC-32C (Castagnoli) checksum. Uses Intel's CRC32 instruction if it is available. Otherwise it uses a very fast software fallback. */ extern "C" uint32_t crc32c_append( uint32_t crc, // Initial CRC value. Typically it's 0. // You can supply non-trivial initial // value here. Initial value can be used // to chain CRC from multiple buffers. const uint8_t *input, // Data to be put through the CRC // algorithm. size_t length); // Length of the data in the input // buffer.
You can call the function like this:
uint32_t crc = crc32c_append(0, input, 10000);
As the comments in above declaration suggest, you can compute crc from multiple chained buffers:
uint32_t intermediate = crc32c_append(0, input, 3000); uint32_t crc = crc32c_append(intermediate, input + 3000, 7000);
And that's it. You can add any pre/post-processing of the CRC value if your application requires it.
Tutorial for .NET
.NET DLL is AnyCPU, but it automatically forwards all calls to one of the two native DLLs depending on whether the current process is 32-bit or 64-bit. The two native DLLs are embedded as resources and unpacked into temporary location before first use.
PM> Install-Package Crc32C.NET
Alternatively, you can download plain 7-Zip archive of the DLLs. Sources are available via Bitbucket repository. .NET code is distributed under BSD license. Underlying C++ code is distributed under zlib license.
After downloading the library, your can use the CRC-32C namespace:
You will get access to class
Crc32CAlgorithm that implements
with all its features:
/// <summary> /// Implementation of CRC-32C (Castagnoli) with polynomial 0x1EDC6F41. /// It can detect errors more reliably than the older CRC-32-IEEE. /// This class will use CRC32 instruction on recent Intel processors /// if it is available. Otherwise it will transparently fall back to /// a very fast software implementation. Besides standard /// HashAlgorithm methods, this class supports several convenient /// static methods returning the CRC as UInt32. /// </summary> public class Crc32CAlgorithm : HashAlgorithm
And here are the handy static methods mentioned in the class documentation. They map directly to the underlying native library.
public static uint Compute(byte input); public static uint Compute(byte input, int offset, int length); public static uint Append(uint initial, byte input); public static uint Append(uint initial, byte input, int offset, int length);
You can then compute CRC-32C of a buffer with the following C# code:
uint crc = Crc32CAlgorithm.Compute(array);
You can also submit smaller section of a buffer:
uint crc = Crc32CAlgorithm.Compute(array, 0, 10000);
If you would like to chain multiple buffers, you can do it like this:
uint intermediate = Crc32CAlgorithm.Compute(array, 0, 3000); uint crc = Crc32CAlgorithm.Append(intermediate, array, 3000, 7000);
You can of course call Append as many times as you wish. You can add pre/post-processing of the CRC value depending on requirements of your application.
Recent Intel processors have specialized instruction for calculating CRC-32C: the CRC32 instruction. Fast software fallback is used on all other hardware, i.e. AMD processors and some older Intel processors.
Hardware performance (Core i7 3.4GHz):
- 64-bit mode: 21 GB/s
- 32-bit mode: 10 GB/s
Software performance (Core i7 3.4GHz with hardware acceleration disabled):
- 64-bit mode: 2 GB/s
- 32-bit mode: 1.8 GB/s
The library is optimized for larger buffers of several dozens of kilobytes. Above results have been measured on buffers ranging between 0KB and 64KB with 32KB average.
Recent AMD processors contain PCLMULQDQ instruction, which could be used to accelerage CRC-32C among other things. This library doesn't use the instruction, because there was no code at the time that could be easily incorporated.
I've nevertheless performed benchmarks and PCLMULQDQ-based code for CRC-32-IEEE performed at 2.8 GB/s or 40% faster than my software implementation for CRC-32C.
Intel has submitted PCLMULQDQ CRC code to zlib under permissive BSD license. Unfortunately, Intel's code is full of strange looking constants that have been generated from CRC-32-IEEE polynomial. In order to use Intel's code in my project, I would have to regenerate all these constants for CRC-32C polynomial.
I managed to do that for all constants but one. The constant-generating code is in the Hg repo in “constants” project. The one elusive constant is 0x9db42487. Jim Kukunas of Intel, who submitted the code to zlib, has described it as “magic” constant that produces FFFFFF00000…0000 when folded 4 times. My attempts to simulate this folding process yielded all kinds of strange constants but no FFFFFF00000…0000.
Anyway, even if I managed to generate this last constant, it still wouldn't be over. Firstly, it might turn out that the “magic” constant cannot be generated for CRC-32C polynomial. Secondly, CRC has fancy bit-flipped variants and I am not sure, maybe the zlib code is one such strange variant. Removing the flipping would be tricky without devoting time to fully understand Intel's algorithm. Maybe I will get back to it someday.
Feedback & resources
This implementation of CRC-32C was written by Robert Važan. At its core is a Windows port of Mark Adler's CRC-32C.
How to implement your own CRC algorithm:
- CRC primer by Ross Williams
- Mark Adler's CRC-32C code using CRC32 instruction with software fallback
- Intel's guide to computing CRC-32C using the CRC32 instruction
- Intel's guide to computing CRC for any polynomial using PCLMULQDQ instruction
- Zlib patch submitted by Intel utilizing PCLMULQDQ
Known bugs and issues:
- There's no managed .NET implementation. The library will fail under Mono, on ARM processors, and in various restricted contexts (ASP.NET, ClickOnce).
- PCLMULQDQ implementation would be desirable to get hardware acceleration on AMD processors.
- PCLMULQDQ can be also used to further optimize code that uses the CRC32 instruction.
- C++/CLI, mixed mode assembly, or something similar should be used to avoid unpacking the native DLLs in temporary folder.
- It would be of course better to create one great library that implements all the commonly used CRC variants instead of this specialized CRC-32C library.
- There are some fancy constants in the code that influence performance that I have taken unchanged from Mark Adler. It might be better to benchmarks various constants to see what works best.
- The code is optimized for long buffers. While performance for short buffers isn't bad, perhaps there are some interesting optimizations that can be done for short buffers.