Ensure torch.save() deterministic output (#57536)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42163.
## {emoji:1f525} Pitch
Currently, the binary outputs produced by `torch.save()` are non-deterministic (as pointed out in https://github.com/pytorch/pytorch/issues/42163). This means that running a simple snippet that creates a tensor (or a model) twice will produce output files with a different `md5` sum.
**Why does this occur?**
The cause of this behavior lies in the fact that the `obj._cdata` is used to identify a tensor and is written to a file, but the `_cdata` attribute is of course non-deterministic:
https://github.com/pytorch/pytorch/blob/a80b215a9ac089cdd3586060467615fa3c4bffe2/torch/serialization.py#L416
**Why does this matter?**
Reproducibility is essential for many Machine Learning projects.
For instance, when using [`dvc`](https://dvc.org/) you would expect that if none of the dependencies of a stage of a ML pipeline has changed, then running the same stage another time will produce the same binary output. For the reasons explained above, with `torch` this was not the case, so this PR tries to fix this issue.
## {emoji:1f4cc} Content of this PR
### What changes?
- The `persistent_id()` function now returns a deterministic value, rather than `obj._cdata` (which depends on runtime).
- As a consequence, `torch.save(obj, "output.pt")` produces a deterministic output, i.e. the `md5` hash of `output.pt` is determinstic. See **Test 1** and **Test 2** below.
### What does not change?
- If an `obj` contains several tensors that share the same underlying data (e.g. they are views of the same tensor),the `obj_key` returned by `persistent_id()` is still going to be the same for all of them
- As a consequence, serialization optimizes disk storage by storing only necessary tensors, rather than writing one tensor per view. See **Test 3** below.
## � How to test
### Test 1: snipped from https://github.com/pytorch/pytorch/issues/42163
Consider the following `snippet_1.py` (from https://github.com/pytorch/pytorch/issues/42163).
```python
import hashlib
import torch
def get_sha256_hash(file: str, chunk_size: int = 4096) -> str:
hasher = hashlib.sha256()
with open(file, "rb") as fh:
for chunk in iter(lambda: fh.read(chunk_size), b""):
hasher.update(chunk)
return hasher.hexdigest()
file = "tensor.pt"
hashes = []
for _ in range(5):
obj = torch.ones(1)
torch.save(obj, file)
hashes.append(get_sha256_hash(file)[:8])
del obj
hash = hashes[0]
assert all(other == hash for other in hashes[1:])
print(hash)
```
On `master` you obtain an error
```bash
$ python snippet_1.py
Traceback (most recent call last):
File "save_tensor.py", line 84, in <module>
assert all(other == hash for other in hashes[1:])
AssertionError
```
while on this PR branch you should get the following consistent behaviour:
```bash
$ for run in {1..2}; do python snippet_1.py; done
600a83cb
600a83cb
```
### Test 2: Deterministic save of `Tensor` and `nn.Module` instances
Consider the following `snippet_2.py`
```python
import torch
torch.manual_seed(0)
x = torch.tensor([8., 8., 5., 0.])
torch.save(x, "out_tensor.pt")
model = torch.nn.Sequential(
torch.nn.Linear(3, 1),
torch.nn.Flatten(0, 1)
)
torch.save(model, "out_model.pt")
```
On `master` branch, the `md5` hash of `out_tensor.pt` and `out_model.pt` are non-determinstic, for instance you may get
```bash
$ for run in {1..2}; do python snippet_2.py; md5 out_*pt; done
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_model.pt) = 92dca4a310b691e893f3cb41d64d5af1
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_tensor.pt) = a4ef290583f50a9c203a42d0cfc078af
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_model.pt) = de3cb9791a66af8aed77ed7224bd1d5c
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_tensor.pt) = 3b8a6009d3a0be5b9dd94152dcc0c7cb
```
while on this PR branch you should get the following consistent behaviour:
```bash
$ for run in {1..2}; do python snippet_2.py; md5 out_*pt; done
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_model.pt) = dba75fd50a190e4e7fa89b7a2477bab7
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_tensor.pt) = 029f52f0706d6c813cc796d3cdcd3eb0
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_model.pt) = dba75fd50a190e4e7fa89b7a2477bab7
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_tensor.pt) = 029f52f0706d6c813cc796d3cdcd3eb0
```
### Test 3: Views of the same tensor are not re-written to file
Consider the following `snippet_3.py`.
```python
import torch
torch.manual_seed(0)
x = torch.rand(1_000, 1_000)
y = x.T
z = x.view(1_000_000, 1)
torch.save({"x": x}, "out_tensor_x.pt")
torch.save({"x": x, "y": y, "z": z}, "out_tensor_xyz.pt")
```
Both on `master` branch and on this PR branch you should get two output files with same size:
```bash
$ python snippet_3.py && du -sh out_tensor*pt && md5 out_*pt
3.8M out_tensor_x.pt
3.8M out_tensor_xyz.pt
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_tensor_x.pt) = eda516d9156177b27bdc2a75c9064d9b
MD5 (https://github.com/pytorch/pytorch/commit/bc9e8af21875dafafe9bbd25c8f542b20b2e660f) (out_tensor_xyz.pt) = 333b869f5b93ced7b8649ab1571eb8e3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57536
Reviewed By: bdhirsh
Differential Revision: D28304728
Pulled By: ailzhang
fbshipit-source-id: 49788e566a3cd2c6c36dc801e6bdd8f42c9459cb