In this case, the difference is that dd is constrained to reading 4096-byte blocks at a time, since you have used bs=4096. The likely effect is that dd will be much, much slower than cp. Try with a larger block size (10M, 50M?). 
The particular buffer size that's best suited for the current devices might be different from cp's (or cat's). You can't easily control cp's buffering. dd's utility shines when:
- you have very large devices to copy, so that experimenting to determine the best block-size is worthwhile.
- you have to copy only part of a disk. You can specify countto limit how many blocks are copied.
- you want to resume an interrupted copy. You can't do so with cp, but you can try withdd, by using theseekandskipoptions.
- you want to pipe it to the standard input of something (admittedly, - catwill work here too):
 - dd if=/dev/sda bs=10M | ssh host dd of=/dev/sdb
 
dd usefulness is very well discussed in this Unix and Linux post:
dd vs cat — is dd still relevant these days?