Level 6: Break repeating-key XOR
Task
It is officially on, now.
This challenge isn't conceptually hard, but it involves actual error-prone coding. The other challenges in this set are there to bring you up to speed. This one is there to qualify you. If you can do this one, you're probably just fine up to Set 6.
There's a file here. It's been base64'd after being encrypted with repeating-key XOR.
Decrypt it.
Here's how:
Let KEYSIZE be the guessed length of the key; try values from 2 to (say) 40.
Write a function to compute the edit distance/Hamming distance between two strings. The Hamming distance is just the number of differing bits. The distance between:
this is a test
and
wokka wokka!!!
is 37. Make sure your code agrees before you proceed.
For each KEYSIZE, take the first KEYSIZE worth of bytes, and the second KEYSIZE worth of bytes, and find the edit distance between them. Normalize this result by dividing by KEYSIZE.
The KEYSIZE with the smallest normalized edit distance is probably the key. You could proceed perhaps with the smallest 2-3 KEYSIZE values. Or take 4 KEYSIZE blocks instead of 2 and average the distances.
Now that you probably know the KEYSIZE: break the ciphertext into blocks of KEYSIZE length.
Now transpose the blocks: make a block that is the first byte of every block, and a block that is the second byte of every block, and so on.
Solve each block as if it was single-character XOR. You already have code to do this.
For each block, the single-byte XOR key that produces the best looking histogram is the repeating-key XOR key byte for that block. Put them together and you have the key.
This code is going to turn out to be surprisingly useful later on. Breaking repeating-key XOR ("Vigenere") statistically is obviously an academic exercise, a "Crypto 101" thing. But more people "know how" to break it than can actually break it, and a similar technique breaks something much more important.
No, that's not a mistake.
We get more tech support questions for this challenge than any of the other ones. We promise, there aren't any blatant errors in this text. In particular: the "wokka wokka!!!" edit distance really is 37.
Explanation
This challenge involves decrypting a file that has been encrypted using repeating-key XOR and then encoded in Base64. The goal is to reverse the process by:
Decoding the Base64-encoded file to get the ciphertext.
Guessing the key length (KEYSIZE) by analyzing the Hamming distance between blocks.
Breaking the ciphertext into blocks based on the guessed key length.
Solving each block as a single-byte XOR problem to recover the key.
Reassembling the key and decrypting the ciphertext.
What is Hamming distance?
Hamming distance is a metric that measures the number of differing bits between two strings of equal length. In other words, it counts how many bits need to be changed (from 0 to 1 or 1 to 0) to turn one string into the other.
How It works:
Convert both strings to binary.
XOR the two binaty strings bit by bit.
Count the number of 1s in the XOR result.
The final count represents de Hamming distance.
Resolution
First, we are going to create a file named task6.py
Understanding the code
This script contains six main functions:
fixed_xor(a,b)
This function performs a bitwise XOR operation between two byte strings (
a
andb
).If the strings have different lenghts, it returns an error. Otherwise, it returns the XOR result.
max_frequency_score(plaintext_list)
This function receives a list of possible deciphered texts and calculates a "frequency score" for each one based on the frequency of common English letters (using a predefined frequency table at the beginning of the script).
It returns the text with the highest score, which is most likely to be readable English.
single_byte_xor_bruteforce(ciphertext)
This function performs a brute-force attack to decipher a ciphertext encrypted with a single-byte XOR key.
It tries all possible keys from 0 to 255, decrypting the ciphertext with each one.
It calculates the frequency score for each result.
Finally, it selects and returns the highest-scoring text, which is most likely to be the original plaintext.
find_hamming_distance(a,b)
This function calculates the Hamming distance between two byte strings (a and
b
).The Hamming distance measures how many bits differ between the two strings.
It's useful for finding how similar two fragments of text are, which helps determine the key size in the Vigenère cipher.
find_vigenere_keysize(ciphertext)
This function estimates the key size used to encrypt the ciphertext with the Vigenère cipher.
It calculates the Hamming distance between consecutive ciphertext blocks for various key sizes (typically 2 to 40).
It normalizes the distance by dividing it by the key size.
The key size that minimizes the average normalized distance is most likely the correct one.
A smaller value usually indicates a repetitive pattern that is easier to decipher.
break_vigerene(ciphertext)
This is the main function to break the Vigenère cipher.
It first finds the key size by calling
find_vigenere_keysize(ciphertext)
.Then, it splits the ciphertext into blocks where each block contains the bytes corresponding to one key byte.
It performs a brute-force attack on each block using
single_byte_xor_bruteforce(ciphertext)
.Finally, it reconstructs the original plaintext by combining the deciphered blocks.
For the last, it downloads the .txt provided by Cryptopals, decodes in base64, calls to the break_vigerene(ciphertext)
and prints the decipher text.
Result
The result is a very popular song by Wild Cherry — the perfect way to relax after completing this challenge.

Last updated