
If you work with virtual infrastructures, it's only a matter of time before you ask yourself the same question: What do I do with all those snapshots that keep accumulating on the machines? In some IT teams, the philosophy is to keep them "just in case," while in others, the norm is to delete them as soon as possible. The reality is that improper use of snapshots not only fails to improve security, but also... It can increase storage consumption, degrade performance, and even corrupt a VM..
To make good decisions you need to understand what a snapshot is, what it isn't, how it works under the hood, and What are the best practices to prevent them from corrupting machines or ruining your disk cabinet?Let's look at it calmly, but with technical expertise and our feet on the ground, as is done in real data centers.
What a snapshot really is (and what it isn't)
A snapshot, whether in VMware, Hyper-V, or storage systems, is a snapshot of the state of a virtual machine or volume at an exact momentIt freezes the situation: disk, and optionally memory and application state, so you can return to that point if something goes wrong.
The key is that It is not a complete, standalone copy of the dataThe snapshot acts as a "logical photo" of the base disk and then records the changes that occur from that point onward. This allows for very fast restoration, because it only requires redirecting pointers, without moving terabytes of data.
In many environments, the mistake has been made of treating snapshots as if they were a backup system. This confusion is dangerous.The snapshot depends on the original disk, the storage array, and the entire chain of differential disks. If something fails in that chain, you might find that neither the snapshot nor the VM will boot.
On platforms like VMware, a distinction is made between memory snapshots (which also save the contents of RAM and the exact execution state) and silenced or application-consistent snapshots, which They ensure that the file system and certain applications are in a consistent state using tools such as VMware Tools.
How a snapshot works at the block and disk level
To understand why overusing snapshots can corrupt machines or degrade performance, it's helpful to visualize what happens to the disk blocks. Imagine a VM with a single virtual disk made up of blocks A, B, C, D, and E. As long as there are no snapshots, any changes are written directly to that disk.If block A is modified, it becomes A', and if we add a new block F, the disk grows.
When you create the first snapshot, the original disk is "frozen" as read-only. From that moment on, All changes are redirected to a new difference disk (an additional VMDK in VMware, for example). The snapshot does not duplicate all the data; it only starts storing the modified blocks on that child disk.
If after that first snapshot you change A' and B, and add a new block G, A, B, C, D, E, and F remain on the base disk, while A'' (the new A), B', and G are stored on the differential disk. The actual logical size of the VM would now be 7 blocks, but In the datastore you are already occupying 9 blocks between base and differentialThat's where the silent growth party begins.
If you create another snapshot, the current differential disk is also frozen, and a second difference disk is created where the new changes (A''', D', E', H, etc.) will be stored. The result is a chain of linked disks where each one depends on the previous oneIn the example of the original content, the base disk occupies 6 blocks and the differentials already total 7, so the total VM consumes 13 blocks in the datastore, more space in snapshots than the original disk size.
This architecture has another side: performance. When the system needs to read a block, for example C, First, look at the most recent differential disc.If it's not there, it moves to the next disk in the chain, and so on until the base disk. With long chains, each read involves multiple hops, resulting in higher latencies and slower operations, especially with mechanical disks and very active VMs.
Copy-on-Write, Redirect-on-Write and their impact
Snapshot implementations typically rely on two main techniques: Copy-On-Write (COW) and Redirect-On-Write (ROW)Both seek the same goal (preserving the previous state) but with different compromises between performance and fragmentation.
In Copy-On-Write based systems, when you modify an existing block, the system first copies the old block to the snapshot area and then writes the new data in its original position. This preserves the previous version but penalizes each write.because it becomes a read-copy-write sequence.
The Redirect-On-Write approach, used in many modern systems and some NAS and advanced file systems, works in reverse: The old data is left where it is, and the new data is written to a different physical location.Snapshots point to the "old" blocks, and the active volume updates its pointers to the new ones. Write performance is much better, but this comes at the cost of increased fragmentation.
This fragmentation is not usually dramatic with all-flash storage, but in classic or hybrid HDD arrays, A large number of snapshots and sustained changes can degrade read speed.That's why many manufacturers recommend combining these technologies with automatic reorganization systems, or simply opting for flash if you want to work with many points in time.
Why snapshots are NOT backups
One of the most important messages is that, however comfortable they may be, Snapshots do not replace traditional backupsTechnically they do not fulfill the same role, nor do they offer the same level of protection against disasters.
A full backup creates a replica of the data on another medium: another storage array, another datastore, tape, cloud, etc. The 3-2-1 rule can be applied to minimize risks (three copies, two different media, one off-site). If the data center burns down, the cabinet breaks, or someone erases the volume, the copy is still there. because he lives somewhere else.
The snapshot, on the other hand, depends on the same hardware and the same storage pool. If the storage array fails, if the datastore becomes corrupted, or if an attack wipes the volume, all snapshots fly with himFurthermore, as we have already seen, in many implementations all snapshots depend on a chain of disks; if one link in that chain becomes corrupted, you can lose all associated versions.
They are also not equivalent when it comes to recovering individual items. A snapshot is usually designed for return a complete VM to a point in timeThis is not for restoring a single file, a database table, or a mailbox in a granular way. That capability is offered by backup solutions that work with agents, application integrations, or backup browsers.
In serious environments, both approaches must be combined: snapshots to have quick and short-distance recovery pointsand backups to provide history, long retention, out-of-band protection, and granular recovery.
Best practices for using snapshots to avoid corrupting machines
Assuming that a snapshot is not a backup, the next question is how to use them safely. There are a number of recommendations shared by manufacturers such as VMware, Microsoft, and storage providers that They help avoid problems of corruption, uncontrolled space, and performance drops..
The first rule is common sense: Don't accumulate old snapshots if you're not going to revert to them.If you know you wouldn't restore a critical VM to a point in time from months ago (because its data state would no longer be relevant), that snapshot is unnecessary and only adds risk. The sensible thing to do is delete it.
Manufacturers typically recommend that each snapshot be kept for a limited time. In VMware, for example, it is expressly discouraged to keep a single snapshot for more than 72 hours, because the diff file will grow as the VM makes changes. The longer it lives, the larger it becomes and the more expensive it will be to consolidate it..
Another clear good practice is limit the number of simultaneous snapshots per machineAlthough technically you can create dozens (in VMware the theoretical maximum is around 32), in practice it's best to stick to 2 or 3 at most. More points mean longer chains, higher read latency, and more complex consolidation.
It is highly recommended to create snapshots when the VM is not subjected to very high I/O spikes. Generating the snapshot right in the middle of a flurry of writings can cause errors in its creation. or leave the system in an inconsistent state. In critical databases or applications, it's best to coordinate with periods of lower activity or use application-consistent snapshots.
How long to keep them, how many to use, and how to manage them
Managing snapshots is not just about clicking "take snapshot" and forgetting about it. A clear policy on creation, use, and disposal is needed.especially in environments with dozens or hundreds of VMs.
In production, the most sensible approach is to create snapshots just before specific operations: configuration changes, system updates, application patches, internal migrations, etc. Once everything is validated as working correctly, The recommendation is to delete the snapshot without delay.The fewer chains there are, the less likely it is to mess things up.
It's also advisable to supplement manual management with automated cleanup mechanisms. In vCenter or Hyper-V managers, it's possible to periodically review the snapshot tree, see which snapshots are active, their size and age, and eliminate those that have exceeded the maximum time set in the internal policy.
It is essential to always delete snapshots using official tools (vSphere, Hyper-V Manager, storage array manufacturer's CLI, etc.) and let the system handle consolidation. Deleting them in any other way or interrupting the consolidation process midway will result in the problem. It can leave the disks in an inconsistent state and even corrupt the VM.
Regarding quantity, a balanced approach is to work with a maximum of one or two active points in sensitive environments, and a more relaxed approach (three or four) in laboratory environments, where the risk is lower and flexibility is valued more than extreme performance.
Impact on performance, space, and disk consolidation
Besides the risk of corruption, the most common side effect of snapshot abuse is poor performance. Each time you add a new difference disk, you also add one more jump in the read chain.This is especially noticeable in VMs with many random disk operations, such as databases or file servers.
In read operations, the hypervisor or storage array has to search for the requested block, starting with the most recent snapshot and working backward to the base disk. With two snapshots, the impact might be small, but With long chains and highly fragmented disks, the user experience can become frustrating..
In storage, real-world examples are illustrative. It's relatively common to find VMs whose logical disk is, for example, 2 TB, but which The datastore is taking up 4 or 5 TB due to forgotten snapshots For years. Every daily change to a database, every file that comes in or goes out, adds up the discrepancy disks without anyone noticing until the storage array runs out of space.
When it's time to delete old snapshots, the consolidation process can be lengthy and delicate. Consolidation involves Merge all changes stored on the difference disks to their parent diskOne by one, until the entire chain is integrated into the base disk. On VMs with a large volume of data and years of changes, this translates to hours of intensive read/write operations.
This prolonged consolidation not only affects performance, but also Any interruption or error in the middle of the process can leave the machine in an inconsistent state.Hence the importance of not letting snapshots grow indefinitely and planning consolidation tasks during periods of lower load and with adequate monitoring.
Snapshots, ransomware and cyber resilience
In recent years, ransomware has added another layer of complexity to the issue. Attackers are no longer content with simply encrypting user files: They search for administrator credentials, locate backups, delete snapshots, and try to leave you with no clean point to revert to..
This highlights the limitations of models based solely on traditional backups without extra protection. If the backups are accessible using the same compromised credentials, or if the snapshots can be easily deleted from the administration panel, The attacker can leave the environment without a safety net in a matter of minutes..
In this context, many storage manufacturers are opting for immutable snapshots integrated into the storage array itself. These points in time, once created and locked, They cannot be modified or deleted during the retention period, not even by accounts with maximum privileges.It's a kind of WORM (Write Once, Read Many) applied to snapshots.
Immutability provides a very powerful defense: even if the attacker gains administrative access, You will not be able to delete protected snapshots until their lock window expires.This ensures that you will have at least one set of clean recovery points to rebuild systems after the attack.
In addition to immutability, some advanced systems use entropy analysis and AI on the blocks written to detect typical patterns of mass encryptionWhen they detect an anomalous spike in random writing and abrupt changes in entropy, they can automatically trigger the creation of protected snapshots or even cut off write access to the suspicious client.
Storage snapshots vs. operating system snapshots
A snapshot managed on the hypervisor or in the storage array is not the same as operating system volume shadow copies (such as VSS in Windows). This distinction is also relevant in ransomware scenarios.
Many modern malware programs, upon entering a Windows server, They execute commands to delete all shadow copies and prevent the system itself from being restored to a previous point using its native tools. This leaves the user without the easy option of "recovering a previous version".
However, snapshots managed on NAS or SAN are typically operating system independent. They are controlled out-of-band, directly from the storage itself. Even if the attacker deletes VSS within Windows, those storage array snapshots will remain intact.And as long as we have a well-defined restoration procedure, we can return to a previous point.
That is why modern architecture incorporates a double level of protection: snapshots in storage for resilience against operating system attacksand external backups, with immutability and logical separation, to cover the scenario of physical disaster or deep compromise.
In any case, it must be emphasized: even the most sophisticated storage snapshots still depend on the system where they reside. If the cockpit is lost or completely corrupted, its snapshots are also lost.This reinforces once again the need to maintain backup policies that comply with the 3-2-1 principle and, in critical environments, to even work with air-gapped copies.
Snapshots in test, development, and production environments
Not all environments benefit from snapshots in the same way, nor should they be managed in the same way. In laboratories and development environments, snapshots are pure gold.because they allow you to test aggressive changes, deploy experimental software versions, or simulate failures and return to the starting point in seconds.
In these scenarios, there is usually less pressure for performance and more tolerance for risk. Even so, it's still sensible. limit excessively long strings and clean up old snapshots so as not to fill the cabin with technical junk that nobody remembers what it was for.
In production, the philosophy should be much more conservative. Here, snapshots should be used as... temporary safety net around controlled changesBefore a major patch, database update, sensitive configuration change, etc. Once stability is validated, the snapshot is deleted and protection is left to the backup system.
Application servers with multiple dependencies (external databases, web front ends, LDAP, etc.) require special care. Reverting only one of the nodes to an old snapshot can leave the set in an inconsistent state.Before using snapshots in this type of architecture, it is necessary to assess whether the application supports this type of partial rollback or if it is preferable to use finer restores.
Finally, in mission-critical systems with high availability and very strict RTO requirements, many organizations combine immutable snapshots in flash storage with mechanisms for Instant recovery from backupsThis achieves a balance between very low turnaround times and robust protection against larger-scale disasters.
Managing snapshots effectively involves understanding that they are a powerful tool, but with a catch: When misused, they increase disk space consumption, degrade performance, and can leave you without a working VM just when you need it most.Used wisely, with short retention policies, short chains, planned consolidations, and always supported by a good backup system (ideally with immutability and logical separation), they become a key ally against human errors, risky changes, and ransomware attacks, without corrupting machines or compromising the stability of the environment.
