Tuesday, 14 May 2013

Back to skule: One Pad, Two Pad, Me Pad, You Pad - Cryptanalysis for beginners

A couple of weeks ago, Kev Sheldrake from Head Hacking gave a fascinating talk on NLP and Social Engineering at London's DEFCON group, DC4420 (called "Social Engineering Lies!"). Afterwards, over drinks, he told me about a free cryptography course that Stanford was running, and how much fun he and his workmates were having competing with each other to solve the 'homework' problems that were set each week...

Now, Cryptography is a subject close to my heart, and I spend a large proportion of my working life mucking about with it, or, to put it another way, breaking it. However, I've never formally studied it, and, until now, I've firmly believed I didn't need to. In fact, I felt that not knowing what every cryptographer knows actually gives me a slight advantage as I won't fall into the trap of making certain assumptions when faced with a cryptographic problem. Since I'm usually attacking an implementation, and not the crypto algorithm itself, knowledge of how the underlying maths works is not necessarily beneficial, as long as I understand the mechanics of it's implementation.

But my curiosity was piqued, so I decided to take a look...

Spoiler alert! If you are taking or thinking of taking Dan Boneh's excellent class, then step away from the blog now! I will be revealing details of the problems he sets, and my own solutions to them, and I don't want to spoil anything for anyone.

Unfortunately, the six week course already started several weeks ago, and I'm too impatient to wait for a new session to begin, so I'm already doomed to fail before I've even begun, due to late filing penalties for the homework assignments, but I wouldn't let that put you off - it's not all about the score you achieve. There will be plenty of reward in the learning itself (at least that was true for me).

OK, so the first week concentrates on Stream Ciphers. I'm not going to go over everything here, just the specifics that I think are of particular interest. One of these was One Time Pads. In the lectures, Dan proves that these are 'perfectly secure' as they have a unique key that is the same size as the message and are therefore impervious to cryptanalysis. The cryptographic operation itself is very simple: XOR the message with the key and you have the ciphertext. He goes into a lot of detail to prove the security of a message enciphered in this way, and the mathematics can get pretty daunting if, like me, you left school more years ago than you care to mention, and it's easy to get lost in the terminology of the worked through algorithms. To be fair to Dan, he does say he's going to blast through it as it's all on video and you can simply pause and rewind if you're not keeping up. But let's forget about the maths and focus instead on the outcome of the XOR, which is the only thing that really matters. There are three things we need to know:

Firstly, if you XOR two values together, either value can be recovered. In other words, if XORing A with B gives you X, then XORing X with A or B will give you the reciprocal value. If we represent XOR with '^', it looks like this:

  A ^ B = X
  X ^ A = B
  X ^ B = A

Secondly, if you XOR any value with itself, you get NULL:

  A ^ A = NULL
  B ^ B = NULL
  X ^ X = NULL

Finally, if you XOR something with NULL, nothing happens:

  A ^ NULL = A
  B ^ NULL = B
  X ^ NULL = X

That's it. The rest is just logic, and if we really do need to convince ourselves of any of the proofs, we can do so with a little bit of python instead of trying to understand the underlying maths. I'll give an example of that in a minute.

Right, so back to One Time Pads. The problem with such a pad comes if you re-use the key. We all know from best practice that you should never re-use these keys, and Dan stresses this several times during the lectures, but do we know exactly why? What is it about re-using the key that compromises it?

Well, the answer is pretty simple: since XORing anything with itself cancels out, XORing two messages that were originally XORed with the same key actually removes the key and leaves you with the XOR of the two messages. This is clear if we work it through:

If our first message is ABCD, and our key is 1234, our ciphertext will be:

  A^1 B^2 C^3 D^4

and if our second message is WXYZ, our ciphertext is:

  W^1 X^2 Y^3 Z^4

so now if we XOR pairs of bytes from the two messages, the keys cancel out:

  (A^1) ^ (W^1) = (A^W) ^ (1^1) = (A^W) ^ NULL = A^W

and we end up with:

  A^W B^X C^Y D^Z

OK, so we've now got a new ciphertext that is the product of the two messages. So what? At this point it's tempting to think we're close to an answer - we've got two messages so all we need to do is extract one and the other will reveal itself, but the truth is we still don't know anything about either of them.  In fact, all we've done is swapped one secret key for another - message A is now acting as the key for message B and vice-versa.

Since the ciphertext bytes reveal nothing about themselves, we are no further forward. We can prove this if we actually perform the XOR: A^W gives the same result as B^T, and, indeed, 255 other possible combinations:

  >>> count=0
  ... for x in range(256):
  ...   for y in range(256):
  ...     if x^y == ord('A')^ord('W'):
  ...       count += 1
  ... print "There are %d combinations that are the same as A^W" % count
  There are 256 combinations that are the same as A^W

So we can invent anything meaningful we like as message A, and we'll still get something out for message B.

Now what? 

In the problem set, Dan says:

"Hint: XOR the ciphertexts together, and consider what happens when a space is XORed with a character in [a-zA-Z]."

And here's another hint: all the above would be true if we didn't know anything about the messages (i.e. they were comprised of completely arbitrary characters in the range 0-255), but in this case we do know something about them: they are text messages, comprised only of readable characters. This changes everything. Now we can apply some rules to the XOR outcomes to learn something about the messages. Removing the key from the equation has done something magical - now we are no longer dealing with any arbitrary component (as we can assume the key was entirely random), but only with bytes in a specific range.

So what does happen when you XOR a character in the range [a-zA-Z] with a space? Well, let's knock up some python and see:

  >>> for x in range(26):
  ...   print '%s' % chr((ord('a')+x) ^ ord(' ')),
  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

  >>> for x in range(26):
  ...   print '%s' % chr((ord('A')+x) ^ ord(' ')),
  a b c d e f g h i j k l m n o p q r s t u v w x y z

Well, look at that! They swap case. So if I XOR an 'A' with a space, I get an 'a', and vice-versa. Excellent. That means that wherever I see a character in the range [a-zA-Z] in my new ciphertext, it is likely to be the product of a space and another character (this is also true if we see a NULL, as that means both messages had a space at that location). This is what is known as a 'crib', and, once we have a crib, we can use it to reveal something about the key. In this case, as it's a simple XOR, our crib gives us direct access to the key byte: simply XOR a space with the corresponding byte in the original ciphertext and you will reveal the secret key byte. 

However, we have a small problem. Although we now know that a space was used at a particular location in our newly created ciphertext, we don't know whether it was in message A or message B. But that's no big deal, right? If we try it against both message A and message B we'll get two key bytes, one of which will be correct. We can then simply try them both against all our messages and if they make sense in all cases then we know we've got the right one. True, but it's going to be hard work. The average word in English is five letters long, so we can expect to see a space every sixth letter. This means that if you take a correctly decoded message you'll reveal every sixth letter and it's going to look something like:


and (for example) an incorrectly decoded one:


On top of that, the correct and incorrect key bytes will be all mixed together, so some of the characters may be correct, and others not. So how do I tell them apart? How do I know which are the correct ones?

Well, once the messages start filling out and making a bit more sense, it will become pretty obvious when a character is wrong, but when we are this stage looking at single letters spaced so far apart it is a real problem. However, in the same way we know that the average word length in the English language is five, we also know which are the most common letters and should therefore appear the most times in our deciphered messages. We could, therefore, do some frequency analysis on the results of decrypting with each of our keys and work out which is likely to be the correct one. But on short messages or pads for which we only have a few samples, this is unlikely to bear a whole lot of fruit.

Another method is to 'expand' our crib. We are currently working with a single space, but what if we expand that to a common word and see if we get a good result for all the letters in that combination. If we have a crib that consists of two spaces three letters apart, it would be easy enough to try some common three letter words at that location, such as " and " or " man " or " day " etc. this will give you five consecutive key bytes instead of just one, so now our correctly decoded message will look like:

   _______sing ___________

and an incorrect one will be obviously incorrect:

In this case we can immediately see what looks like a 'hit' on our crib, which decodes as "sing ", which could be the last four characters of several common words, so it looks like we've correctly recovered five of our key bytes. You can actually try something similar across the whole message, just moving your crib one character at a time along the length of the message and seeing if anything sensible pops out. This is known as 'crib dragging'. You can also try common two-letter combinations (Bigrams), such as "th" or "he" or "in" and do the same thing.

Fine. So we can use crib dragging to find key bytes hidden amongst our XORed messages, and the more cribs we try against the more XOR combinations of messages, the more key bytes we're likely to find. Great! Progress!

But hang on a minute. This all seems a bit imprecise, and an awful lot of work. I was expecting this to be an exact science, not, what appears to the non-cryptanalyst layman to be, erm, how shall I put it?... "guessing".

Surely there must be a better way?

At this point I had something of a minor moral dilemma... The course is conducted under an 'Honour Code', in which you agree that all your answers are 'all your own work'. I wanted to go and do some research and see if anyone else had come to the same conclusion, but didn't want to taint my 'work' with someone else's ideas. The crib-dragging stuff is all discussed in the video lectures so there's no problem with that, but what if I come across some other technique? In the end, I decided to spend a couple of hours working on a solution, and then only after submitting my results going and taking a look to see if anyone had done the same.  I even came to the conclusion that once I'd decided on a specific methodology I could also go and search to see if someone had already coded it as it could be argued that I don't need to write every line of code as long as I independently came up with the concept (after all, I didn't write the python interpreter, and I don't feel guilty for using that bit of code). However, in the end I didn't find any examples of either discussion or implementation of the following method, so the question simply didn't arise. All the discussions I found pointed to 'crib dragging' as the solution, and I couldn't find any useful code at all for actually implementing a one time pad attack of any type (of course, this is not to say they're not out there. I just didn't find them).

I've previously discussed the trap of following a path that appears to be giving quick results and not bothering to look any further. This is a common thing, and I call it a trap because although it often leads to the desired outcome, it can mean you spend a lot more time and effort than was actually necessary, and which could have been avoided had you taken a step back and asked some more questions before diving in and going for the prize.

In this case, we got a quick answer by finding some possible key bytes through locating spaces in the messages, and it would be tempting to then dive in and start crib-dragging and/or guessing at the pairs of potential keys to see if we can start to recover all of the message.

I often find it helpful at this point to think about the original question that revealed something to us and ask the secondary question: "what didn't it reveal?". This will give us another goal, which, if we solve it, may make our life a whole lot easier.

So in this case, the thing that was revealed was that we could locate a space in message A or message B. The thing that wasn't revealed was which of the two messages the space was in. Was it message A or message B? Our goal, then, is to determine which message the space was in. If we can do that, we can go right back to extracting single key bytes without any guess work.

So how do we determine which message the space was in? Let's look at the process in detail. For this illustration, I don't care what the actual message is, just where the spaces are, so I'll represent a character with 'X' and a space with '_'. If we detect a space, we'll mark that with a 'Y'.

Say message A looks  like this:


and message B looks like this:


then if we XOR them together and check for the presence of a space, we'll get:

  S: ---YY----Y--Y-Y----

As expected we've found some spaces, but we don't know whether they were in A or B. So what if we now do the same thing again, only this time we XOR message A with message C: 

  S: -Y-Y---Y-Y-YY------

Now all we need to do is compare our 'hits', and since the only constant in both tests was message A, that should mean that any hits that are in the same location in both test results must have come from message A. We'll mark these with an 'M':

  S: ---YY----Y--Y-Y----
  S: -Y-Y---Y-Y-YY------
  M: ---M-----M--M------

and if we compare our M line with our original message A:

  M: ---M-----M--M------

we can see we were spot on. Bingo! We now have an accurate method for recovering every key byte wherever there was a space in message A. Of course, we can repeat the process for every message we've got, replacing message A with message B and so on, until we've tried all possible combinations, and then we will have recovered every key byte wherever there was a space in ANY message. Hopefully, that should be quite a lot of the key bytes.

This is all very well in theory, but what happens in practice? It's a very simple algorithm, so it was easy to knock up a test program in python (have I mentioned that python rocks? :) This program takes the ciphertexts as hex value arguments and then performs the above process using each message as the 'master' and building up the key. It then outputs all the messages as plaintext, decrypting only those bytes that it thinks it's recovered the key for. If we feed it the eleven ciphertexts in the course example, we get:

manypad.py 315c4eeaa8b5f8aaf9174145bf43e1784b8fa00dc71d885a804e5ee9fa40b16349c146fb778cdf2d3aff021dfff5b403b510d0d0455468aeb98622b137dae857553ccd8883a7bc37520e06e515d22c954eba5025b8cc57ee59418ce7dc6bc41556bdb36bbca3e8774301fbcaa3b83b220809560987815f65286764703de0f3d524400a19b159610b11ef3e 234c02ecbbfbafa3ed18510abd11fa724fcda2018a1a8342cf064bbde548b12b07df44ba7191d9606ef4081ffde5ad46a5069d9f7f543bedb9c861bf29c7e205132eda9382b0bc2c5c4b45f919cf3a9f1cb74151f6d551f4480c82b2cb24cc5b028aa76eb7b4ab24171ab3cdadb8356f 32510ba9a7b2bba9b8005d43a304b5714cc0bb0c8a34884dd91304b8ad40b62b07df44ba6e9d8a2368e51d04e0e7b207b70b9b8261112bacb6c866a232dfe257527dc29398f5f3251a0d47e503c66e935de81230b59b7afb5f41afa8d661cb 32510ba9aab2a8a4fd06414fb517b5605cc0aa0dc91a8908c2064ba8ad5ea06a029056f47a8ad3306ef5021eafe1ac01a81197847a5c68a1b78769a37bc8f4575432c198ccb4ef63590256e305cd3a9544ee4160ead45aef520489e7da7d835402bca670bda8eb775200b8dabbba246b130f040d8ec6447e2c767f3d30ed81ea2e4c1404e1315a1010e7229be6636aaa 3f561ba9adb4b6ebec54424ba317b564418fac0dd35f8c08d31a1fe9e24fe56808c213f17c81d9607cee021dafe1e001b21ade877a5e68bea88d61b93ac5ee0d562e8e9582f5ef375f0a4ae20ed86e935de81230b59b73fb4302cd95d770c65b40aaa065f2a5e33a5a0bb5dcaba43722130f042f8ec85b7c2070 32510bfbacfbb9befd54415da243e1695ecabd58c519cd4bd2061bbde24eb76a19d84aba34d8de287be84d07e7e9a30ee714979c7e1123a8bd9822a33ecaf512472e8e8f8db3f9635c1949e640c621854eba0d79eccf52ff111284b4cc61d11902aebc66f2b2e436434eacc0aba938220b084800c2ca4e693522643573b2c4ce35050b0cf774201f0fe52ac9f26d71b6cf61a711cc229f77ace7aa88a2f19983122b11be87a59c355d25f8e4 32510bfbacfbb9befd54415da243e1695ecabd58c519cd4bd90f1fa6ea5ba47b01c909ba7696cf606ef40c04afe1ac0aa8148dd066592ded9f8774b529c7ea125d298e8883f5e9305f4b44f915cb2bd05af51373fd9b4af511039fa2d96f83414aaaf261bda2e97b170fb5cce2a53e675c154c0d9681596934777e2275b381ce2e40582afe67650b13e72287ff2270abcf73bb028932836fbdecfecee0a3b894473c1bbeb6b4913a536ce4f9b13f1efff71ea313c8661dd9a4ce 315c4eeaa8b5f8bffd11155ea506b56041c6a00c8a08854dd21a4bbde54ce56801d943ba708b8a3574f40c00fff9e00fa1439fd0654327a3bfc860b92f89ee04132ecb9298f5fd2d5e4b45e40ecc3b9d59e9417df7c95bba410e9aa2ca24c5474da2f276baa3ac325918b2daada43d6712150441c2e04f6565517f317da9d3 271946f9bbb2aeadec111841a81abc300ecaa01bd8069d5cc91005e9fe4aad6e04d513e96d99de2569bc5e50eeeca709b50a8a987f4264edb6896fb537d0a716132ddc938fb0f836480e06ed0fcd6e9759f40462f9cf57f4564186a2c1778f1543efa270bda5e933421cbe88a4a52222190f471e9bd15f652b653b7071aec59a2705081ffe72651d08f822c9ed6d76e48b63ab15d0208573a7eef027 466d06ece998b7a2fb1d464fed2ced7641ddaa3cc31c9941cf110abbf409ed39598005b3399ccfafb61d0315fca0a314be138a9f32503bedac8067f03adbf3575c3b8edc9ba7f537530541ab0f9f3cd04ff50d66f1d559ba520e89a2cb2a83 32510ba9babebbbefd001547a810e67149caee11d945cd7fc81a05e9f85aac650e9052ba6a8cd8257bf14d13e6f0a803b54fde9e77472dbff89d71b57bddef121336cb85ccb8f3315f4b52e301d16e9f52f904

building masterkey:



We_can aac_or _he num_er __ wi_______tu____mputer__ We_can also factor the number____w_th a ____tra_n___to_ba__ t_r_e __me_ __R___r________
Eu_er whul_ pr_bably _njo__tha_______is____orem b__ome_ a corner stone of crypto ____n_nymou____ Eu_e___ t_eo__m
Th_ nicb t_ing_about _eey__q i_______e ____tograp__rs _an drive a lot of fancy ca____ _an Bo___
Th_ cipoer_ext_produc_d b__a w_______ry____n algo__thm_looks as good as ciphertex____o_uced ____ st_o___en_ry__io_ _lg__it_m__ ___l__________a__
Yo_ don t _ant_to buy_a s__ of_______ys____m a gu__who_specializes in stealing ca____ _arc R____ber_ ___me_ti__ o_ _li__er
Th_re aue _wo _ypes o_ cr__tog_______ t____which __ll _eep secrets safe from your____t_e sis____ an_ ___t _hi__ w_l_ k__p _e__e___s__________o__ ___e_______br__u__ _____i__
Th_re aue _wo _ypes o_ cy__ogr_______ne____t allo__ th_ Government to use brute f____ _o bre____he _o___ a_d __e _h_t __qu_r__ ___ __________ __ ___ _______  __ __ _____ ________________
We_can tee_the_point _her__the_______s ____ppy if__ wr_ng bit is sent and consume____r_ powe____om _h___nv_ro__en_ _ A__ S_a__r
A _privfte_key_  encr_pti__ sc_______at____ algor__hms_ namely a procedure for ge____t_ng ke____a p_o___ur_ f__ e_c_yp__ng_ __d___p__________o__d___y_______
 T_e Coici_e O_fordDi_tio__ry _______de____es cry__o a_ the art of  writing o r s____n_ code___
Th_ secuet_mes_age is_ Wh__ us_______tr____cipher__nev_r use the key more than on__

which, if I do say so myself, is not bad for a first pass. Of the 83 bytes in the final 'challenge' message, we've decoded 60. That's 72%. Sweet! :)

But it's not perfect. There are large chunks missing where there clearly would have been at least one space, and, indeed, there are smaller chunks with what was clearly a space, such as "We_can" that, according to our theory, should have been detected. There are also very obviously incorrect bytes: "privfte" is almost certainly actually "private", and "secuet" is probably "secret". So where did we go wrong? Well, it turns out that it's not just XORing with a space that will produce '[a-zA-Z]'. Some other combinations of punctuation and/or numerical characters will do the same. This means we get some false positives that are messing up our results. However, not to worry. We now have enough plaintext that we are no longer blindly guessing, but rather filling in the blanks and correcting the errors. To do this, all we need to do is pick an obviously wrong byte, XOR the original message with the correct value and that will reveal the correct key byte. We then re-process all the messages with the updated key byte and all of them will be one character more correct. Each time we do this, the remaining errors or omissions should become a little more obvious and the true texts will quickly reveal themselves. A bit like filling in a crossword, as opposed to, erm... guessing.

Although this bit is 'easy', it's also tedious, fiddly and time-consuming. It involves a lot of looking up of hex values, recalculating XORs and counting byte positions etc., so I looked for a tool that would help me do it more efficiently. Again, I didn't find one. Accordingly, I give you 'cribtastic'. A tool for manually tweaking cribs and generally playing with ciphers. It also implements the above automated attack and I hope will evolve to do any others I come across during the course of the class.

It works like this:

You invoke it in the same was as my test program above, with one additional argument: the key (or partial key) if you know it. If you don't, just pass an empty string. This will then pop up a GUI...

The upper section shows the key value, followed by the ciphertexts and the decrypted plaintext. If a key value is 'known', it will be highlighted in green, and if not, in red.

Since we're starting with a blank key, we can hit the 'OTP' button in the commands window and it will perform the 'manypad' attack as above. Discovered key bytes and plaintexts will then be populated:

However, we can see that some of the bytes have not decrypted, and some of them are incorrect, so we can now edit them. Just click on an obviously incorrect value and type in the corrected version. For example, the first plaintext starts "We_can" which should probably be "We can", so we can simply replace the unknown key value (byte 3 of the key is highlighted in red) with a space. The program will recalculate the key value based on this new entry, reprocess all the ciphertexts, and we can then see how the rest of the plaintexts look using this new value:

Similarly, if we edit the obviously wrong "privfte" to "private" in the ninth plaintext:

and so on... As discussed, the more we correct the easier the rest of the errors are to spot as there will always be at least one word or space that stands out.

There are two other features worth mentioning. As you can see, when we edited the plaintext values, the corresponding key bytes were automatically 'locked'. This simply protects them from being recalculated if you were to hit the 'OTP' button again. The other feature is the set of tick-boxes to the left of the Cipher and Plaintext rows. These allow you to include or exclude them from the attack calculations. By experimenting with which ciphers are XORed together during the attack we may find that we can remove some of the false positives and reveal more of the key bytes.

Here we can see two chunks that have not decrypted at all:

Through experimentation, I found that if I lock all the key values I'm happy with and exclude the 10th cipher from the attack profile, I get this when I repeat the OTP attack:

and we've revealed a further 8 bytes of the key.

The tool was pretty trivial (if a little time-consuming) to write - it's just a bunch of text boxes, after all - but was definitely worth it with regards to the time and effort it saved once working on the actual problem. When put to use, recovery of the challenge text took less than three minutes, and it was possible to recover all messages excluding the last 14 bytes of the longest one with as good as absolute certainty that all solutions were correct - i.e. we recovered all bytes where there was any overlap. Fun stuff!

For what it's worth, the code is published here. As always, it's very much a work in progress (i.e. a filthy hack), and may turn into something more elegant and well written if it proves useful (and I may even implement crib-dragging :)...