Elliot Alderson hides secret information in audio CD files. However, the technique used by the fictional hacker protagonist of “Mr Robot” is far from being a TV whimsy. This is just one of the many steganography techniques used by hackers and cybercriminals to evade security systems.

From the Greek steganos (hidden) and graphos (writing), steganography is a method of hiding data. To analyze how to best handle this surreptitious threat, we spoke with Daniel Lerch, who has a PhD in Computer Science from the Universitat Oberta de Catalunya (UOC), and is one of the top steganography experts in Spain.

Panda Security: How would you define steganography? How is it different from cryptography?

Daniel Lerch: Steganography studies how to hide information in a carrier object (an image, an audio file, a text or a network protocol). While in cryptography the intention is that the message sent cannot be read by an attacker, in steganography the goal is to hide even the fact that any communication is taking place.

The two sciences are not mutually exclusive. In fact, steganography usually uses cryptography to encrypt the message before hiding it. But their objectives are different: not everyone who needs to protect information, also needs to hide it. So steganography would be an additional layer of security.

PS: Who would benefit more from steganography: cybercriminals or security providers?

Daniel Lerch

DL: Without a doubt, cybercriminals. Those responsible for the security of companies and institutions do not need to hide their communications. To keep them safe, cryptography is enough.

Steganography is a tool of great interest for different types of criminals, since it allows communication without being detected. Typical examples are communications between terrorist cells, the dissemination of illegal material, the extraction of business secrets, or their use as a tool to hide malware or the commands that remotely control the malware.

PS: How has this technique evolved in recent times?

DL: Depending on the medium by which steganography is applied, the evolution has been varied.

The medium that has evolved the most is steganography in images. They are so difficult to model statistically that it is very easy to make changes to them without anyone noticing. For example, the value of a pixel in a black and white image can be represented by a byte, that is, a number between 0 and 255. If that value is modified in a unit (hiding a bit) the human eye cannot perceive it. But the issue is that it’s not easy for statistical analysis of the image to detect this alteration either. Images are an excellent way of hiding data, such as video and audio.

Another medium that has received a lot of attention is steganography in network protocols. However, unlike what happens with the images, network protocols are well defined. If we change information in a package it is noticeable, so there is less wiggle room when it comes to hiding data. Although they may seem easy to detect from the outset, these techniques can be effective because of the difficulty of analyzing the large amount of traffic in existing networks.

One of the oldest media carriers, and one which has evolved least in the digital age, is the text. However, steganography in text could make a significant leap thanks to machine learning. In the techniques developed in recent years, the process of hiding information is tedious and requires the user’s manual input to generate a harmless text that makes sense and carries a hidden message. However, the current advances in deep learning applied to NLP allow us to generate more and more realistic texts, so it is possible that we will soon see steganography in text that is really difficult to detect.

PS: What applications does steganalysis have in the field of computer security? What techniques are usually used?

DL: From the point of view of business security, the main applications are the detection of malware that uses steganography to hide itself and the detection of malicious users trying to extract confidential information.

From the point of view of national security agencies, the main applications of steganalysis are the detection of terrorist or espionage communications.

Although most of the steganography tools that can be found on the Internet are unsophisticated and could be detected with simple and known attacks, there are no quality public tools that allow us to automate the process, detecting steganography in network protocols, in images, in video, audio, text, etc.

Maybe this is not possible yet. For example, in the field of steganography in images, the advanced techniques with which it is currently being investigated can hardly be detected using machine learning. If, in addition, the information is distributed among different media, significantly reducing the amount of information per carrier object, its detection with current technology becomes practically impossible.

PS: What role do you believe that steganography will play in the coming years? Will it be used more as an attack weapon, or a defense tool?

DL: Steganography as a defense tool would be unusual, although there are examples, such as the extraction of information by activists in a totalitarian country.

The main role of steganography in the next few years will be seen in its application as a tool to hide malware and to send control commands to the malware. This is already being done, although with fairly rudimentary techniques. The use of modern steganography techniques to hide malicious code will greatly hinder detection, forcing security tools to use advanced steganalysis techniques.

PS: What advice would you give to a computer security professional who is thinking of using steganalysis?

DL: He would probably be interested in detecting malware or exfiltrating data. The first thing is to keep good track of everything, to know what tools exist and when and how to use them. Then, it comes down to practice. Test and validate the technologies that we implement using a wealth of data.

If you use machine learning to perform steganalysis, you must be careful with what data you use to train the system. The model has to be able to predict data it has never seen. It would produce an error if, to validate the model, it were to use data that was used to train it. In machine learning, it is often said that a model is as good as the training data. So if our training data are not complete, the predictions that our model will make will not be reliable. The more data we use to train the model, the less likely it is that it will be incomplete. Otherwise, we run the risk of ending up developing tools that only work well in the laboratory, with our test data.

PS: What role will artificial intelligence and machine learning play in business cybersecurity strategies?

DL: An example would be the automatic detection of security flaws in the software. Also, replacing antivirus software that detects the signatures of known viruses with an artificial intelligence system that identifies viruses based on common characteristics and behavior.

PS: In an environment in where there are more and more connected devices, what security measures should be adopted to protect the privacy of data at the enterprise level?

DL: Security measures in IoT devices have to be the same as those applied to other devices connected to the same network. It may seem strange to have to manage the security of the office thermostat at the same level as a PC, but from the point of view of an attacker, this is as good a point of access to the network as any other.