Machine learning is one of the most exciting new features used in technology today. However, it is undoubtedly controversial too. At the moment, this controversy does not really come from the prospect of ultra-intelligent robots dramatically taking over the human race; instead, it’s because powerful technology like this can be used just as negatively by criminals, as it can be used positively by those with good intention. In this piece, I would like to explore the darker side: how criminals use AI illegally.
Before computers had the ability to solve heuristic problems, many security systems were designed with the principle that this wouldn’t change. The idea that a computer could guess a password, read a graphical Captcha, or learn how real traffic behaves was simply not considered. Now, we are surrounded by security that has been pushed out of date by AI.
Captchas and image classification
There are many, many times a system will need to confirm that a user is actually a human. This is because all functionality that a computer offers a human can also be utilized or simulated by a computer program. If you attempt to login to Facebook more than three times, you’ll notice Facebook asking to confirm you’re a human, and not a computer program attempting to input millions of passwords a second. The way Facebook and many services do this, is via a captcha method:
For years, these successfully separated programs from humans, until AI came along. Now, basic convolutional neural networks can be used, where a huge dataset of captcha images are used. Each captcha has a specified target, and upon training a convnet, they can work out proposed captchas in the future. This is a more trivial example, where the basic principles of neural networks are all that’s required. Now that captchas can be bypassed, brute force attacks are much more possible. You may have also come across the “select all pictures containing a bus” type of captcha, which is equally as easy for AI to bypass. We all know how great object detection is – Google even as it integrated into their search engine as a very fundamental and successful feature.
Passwords with generative adversarial networks
Few of us have passwords that look like this: 5f2#V0”P?oz3
More of us have passwords that look like this: Kronenbourg1664
And the rest of us even have passwords that look like this: password
It is still the case that those that follow my first example are very safe from their passwords being guessed, by a human or a GTX 1080 GPU. However, everyone else is vulnerable. So, how can these passwords be guessed? Most simply, we could use a dictionary, and apply each word to a password input. We might succeed with a very small percentage of our attempts – because of those people that follow my last example. If you are one of those people, I have complete confidence you’ll change your password by the end of this article.
Now, let’s look at the more modern, and even sinister approach (using AI, obviously). Instead of using a dictionary, neural networks are used to produce a huge list of likely passwords. It is this list that’s used to apply to an authentication form. Taken from PassGAN: A Deep Learning Approach for Password Guessing, here is how that list can be produced:
If you are familiar with neural networks (which, I should mention, is quite important for the following few paragraphs), this still may look unusual. Instead of simply predicting an item based on an input, we are learning from data, and then teaching a generator to produce several further examples. This is known as a generative adversarial network, where two neural networks are used; one to differentiate correct and incorrect inputs, and then one that learns from this to produce new correct data via random noise.
Firstly, we use an existing dataset that contains real human passwords, perhaps from a historic password leak that has since been made available. These will collectively demonstrate what human passwords look like (a few capital letters, a date, a random number, a name, etc).
Secondly, we use a noise generator (G) that (at first) will output random data. These two possible inputs (fake and real passwords), are the inputs to the neural network (or Discriminator D). The targets are designed as simple binary outputs. This means that during training, the neural network is told whether the inputted password is a fake or real one. On each feed forward, the produced output value is then compared to the target value (the truth), and then back propagated to adjust weight values based on the margin of error. The Generator also gets affected by this, as it’s random input noise will start to optimise closer to password outputs.
Once the Generator is changed, any further noise inputted into the network will result in strings that look like passwords. So, if we leave it running for a few hours, we could compile a huge list of intelligently compiled passwords.
Phishing is a very common form of hacking. Have you ever received an email that doesn’t look quite right, but claims to be your bank, phone service, or social media platform? Any novice programmer, who knows a little HTML combined with just a touch of backend code such as PHP can pull this one off. It involves sending an email that is visually designed to look like, say, Facebook, and use similar formal language. It will claim you need to update, view or change something, and ask for your login details to do so. Whatever you type in will be sent to the criminal’s server. Anyway, how does AI come into this?
Machine learning can improve phishing, by crawling any platform, learning how they look and communicate language, and then mass produce fake emails based on certain observations to be sent out automatically on a large scale. However, this is not the only way. Hackers can also use the same principals described earlier for guessing passwords, to guess email addresses. Millions of email addresses can be produced, which increases the chance of finding technically gullible people.
Many email services, namely Gmail, have advanced systems in place to detect phishing emails, however, machine learning can be used to create emails that do not get detected by these systems. The training set would be a compilation of emails, some of which failed to reach a user due to phishing detection, and others that were successful. a neural network can learn how phishing is being detected, by understanding which were caught and which were not. In the future, emails could be generated based on rules that are not caught out by phishing detection, see here for reference.
Firstly, these are only three cases. Worryingly, there are a lot more in other areas such as fraudulent advertisement, simulating fake traffic, and more. However, I like to think the use of AI in the legal world very much outweighs the criminal world. Ironically, AI is being used to detect criminal activity in many amazing ways from street policing, to online fraud. To conclude, please change your password if a generative adversarial network could guess it; please do not follow any links sent to you unless you’ve double checked the sender identification; and finally, do not use any of these techniques yourself to break the law!