PersistentJSONDict survive with empty file by Jimilian · Pull Request #199 · frerich/clcache

Jimilian · 2016-08-10T12:15:49Z

Now PersistentJSONDict stops to work then it finds empty file.
Empty file could be generated due some issues (OOM killer, aborted builds in CI, etc).

codecov-io · 2016-08-10T12:20:40Z

Current coverage is 89.02% (diff: 100%)

Merging #199 into master will increase coverage by 0.05%

@@             master       #199   diff @@
==========================================
  Files             1          1          
  Lines          1006       1011     +5   
  Methods           0          0          
  Messages          0          0          
  Branches        164        166     +2   
==========================================
+ Hits            895        900     +5   
+ Misses           83         82     -1   
- Partials         28         29     +1

Powered by Codecov. Last update a84b2ae...166551a

webmaster128 · 2016-08-10T12:43:26Z

Looks good. Could you please

Move test to unittests. integrationtests are used to call clcache via the command line interface.
Check in an empty_file.txt in git and don't create it in the test. This simplifies the test code

webmaster128 · 2016-08-10T14:19:50Z

At least for Python 3.4, a ValueError is raised

If the data being deserialized is not a valid JSON document, a ValueError will be raised.

https://docs.python.org/3.4/library/json.html#json.load

Python 3.5 introduces a subclass json.JSONDecodeError

If the data being deserialized is not a valid JSON document, a JSONDecodeError will be raised.

https://docs.python.org/3.5/library/json.html#json.load

There is nothing documented for Python 3.3: https://docs.python.org/3.3/library/json.html#json.load

So I think we need to catch ValueError and hope it's the same for Python 3.3.

…lueError

Jimilian · 2016-08-10T14:30:17Z

@webmaster128, yes, it's same. I checked it with python3.3. I think it's better to squash my commits :)

webmaster128 · 2016-08-10T14:43:35Z

I see that we can easily treat an empty file like a non-existing file. But do we really want to silently throw away broken JSON content? I mean, whenever a user-edited JSON file misses a comma, the entire content is replaced with default values and no error is shown at all.

Are half-written JSON files a real world problem?

Jimilian · 2016-08-10T14:55:13Z

I can change behaviour:

if content is broken - print content, warning and continue execution.
if file is empty - just ignore.

I know that we had empty file two hours ago after 2-3 aborted execution, and I don't see big difference in probability between empty file and half-written, because root cause is same.

I'm thinking mainly about CI system. Nobody should try to edit anything manually there :)

Jimilian · 2016-08-11T11:33:35Z

@webmaster128 , yes, it looks like half-written JSON file is real world problem.

         Traceback (most recent call last):
           File "clcache.py", line 1518, in <module>
           File "clcache.py", line 1409, in main
           File "clcache.py", line 1439, in processCompileRequest
           File "clcache.py", line 1472, in processDirect
           File "clcache.py", line 129, in getManifest
           File "json\__init__.py", line 268, in load
           File "json\__init__.py", line 318, in loads
           File "json\decoder.py", line 343, in decode
           File "json\decoder.py", line 359, in raw_decode
         ValueError: Expecting ',' delimiter: line 279 column 112 (char 32708)
         Failed to execute script clcache

I know that it's a different part of code, but root cause is same. It's interesting that job was not aborted in this time, so, file should be ok. But it's not.

I would very appreciate any ideas how to fix such issues or at least minimise negative effect. Because I think we can't just ignore broken manifest file, because it will break existed cache.

webmaster128 · 2016-08-11T11:41:47Z

This one is tricky ;) I did not answer yet because I did not have a good idea.

Is line 279 column 112 (char 32708) the position in the JSON file?

I think manifest are somehow easy, because you can just throw away broken ones and let them be recreated.

But why the hack are your ccache processes crashing? Wouldn't the solution be to run less jobs in parallel or upgrade ram?

…ent skipping broken json files

Jimilian · 2016-08-11T12:12:14Z

Is line 279 column 112 (char 32708) the position in the JSON file?

Yes, we have a lot of files in repo :)
So, it's unreal to write file in one call.

But why the hack are your ccache processes crashing?

Very good question, I'm not familiar with Windows IO API at all, so I even don't have any ideas. All calls of manifestSection.setManifest is under lock defence. json.dump didn't raise any exception, it means that system call passed and returned OK. So I really don't know why...

Wouldn't the solution be to run less jobs in parallel or upgrade ram?

Yes, right now we are running a lot of jobs in parallel, but in any case file could be broken in multiple ways in Continuous Integration System. The most stupid one (and frequently one I think) is not enough free space. We can decrease number of jobs or rent new nodes, but it would not guarantee anything. Because in last case build passed, clcache didn't print any issues, but manifest file became broken.

webmaster128 · 2016-08-11T12:28:55Z

Yes, we have a lot of files in repo :)

(off topic but still interesting) The amount of line in a given manifest is the number of included files for one compilation unit (a .c/.cpp file). This is typically 100–1000 includes per compilation unit, no matter how many compilation units there are.

We can decrease number of jobs or rent new nodes

Do you share the cache between different machines? This is not supported

Jimilian · 2016-08-11T12:33:44Z

No, not yet at least. My colleague is working on such support, but now we are using pure clcache.pyfrom master. Each node has one instance of cache and each node can execute only one build at same time. One thing that could be interesting - we are using clcache as a frontend for Incdedibuild. That's why we have a lot of jobs/processes.

webmaster128 · 2016-08-11T12:45:44Z

we are using clcache as a frontend for Incdedibuild. That's why we have a lot of jobs/processes

What does IncrediBuild do exactly? Does clcache call IncrediBuild or does IncrediBuild call clcache?

Jimilian · 2016-08-11T13:16:57Z

Incredebuild calls ccache

Jimilian · 2016-08-18T13:51:45Z

So, I can split this PR on two:

ignore empty files
ignore broken files

In this way you can merge first PR and keep second one in pending status for awhile.

webmaster128 · 2016-08-18T13:53:53Z

We need to wait for frerich anyway, but it makes definitely sense to me. Treating empty files like non-existing ones should be safe to do.

frerich · 2016-08-22T13:34:06Z

Thanks for your work, I'm now back from vacation and slowly catching up on what happened during the last weeks.

This PR seems to deal with a special case of the cache getting corrupted, right? I.e. a very similar situation might arise if e.g. an object file is missing from a cache entry or something like that.

In general, I'd love if we could reduce the chance of this happening in the first place. Maybe it would be sufficient to write the JSON file (or any other files) to a temporary location and do a final 'rename' step (which is very quick) to reduce the likelyhood of being interrupted?

That aside, of course it would be good if clcache would not choke on corrupted caches. In the case of the JSON files, if they are unexpected in some way (e.g. they don't exist, the JSON is syntactically invalid or some expected entry is missing) - how about we just log a warning message and go with some defaults? E.g. both empty or missing JSON files correspond to a file with all the fields set to zero.

The only concern I have is that it's easy to miss warning messages during a build with thousands of compiler invocations. Future queries for the statistics won't show that these statistics don't actually reflect the real situation, so you may wonder why there should be statistics at all...

Jimilian · 2016-08-22T14:00:28Z

First of all, I'm totally agree that it's better to reduce chance to corrupt file, but:

I'm not sure that rename is atomic on Windows( http://stackoverflow.com/questions/167414/is-an-atomic-file-rename-with-overwrite-possible-on-windows).
I'm not sure that we can achieve it (because some bugs would be in code in any case)

According your suggestion:

We are using clcache in our CI system. We build more than 5k builds with clcache + this PR and everything looks good. Without this PR node started to fail after 10 builds, because it was not able to recover json. If I understood correctly, empty file / default value and non-existed file are same thing on the most of cases.
We can introduce new statistics - something like "clcache warnings" - and increment it only in such cases.

Jimilian · 2016-11-23T14:09:09Z

Just short note. On last master still incorrect json could be produced (in this case it was empty file).

13:28:23 Traceback (most recent call last):
13:28:23   File "C:\Python34\lib\site-packages\cx_Freeze\initscripts\Console.py", line 27, in <module>
13:28:23   File "clcache.py", line 1639, in <module>
13:28:23   File "clcache.py", line 1466, in main
13:28:23   File "clcache.py", line 1317, in cleanCache
13:28:23   File "clcache.py", line 597, in __enter__
13:28:23   File "clcache.py", line 508, in __init__
13:28:23   File "C:\Python34\lib\json\__init__.py", line 268, in load
13:28:23   File "C:\Python34\lib\json\__init__.py", line 318, in loads
13:28:23   File "C:\Python34\lib\json\decoder.py", line 343, in decode
13:28:23   File "C:\Python34\lib\json\decoder.py", line 361, in raw_decode
13:28:23   ValueError: Expecting value: line 1 column 1 (char 0)

I will re-work this change to keep the safest behaviour (empty file == doesn't exist) and provide new PR.

frerich · 2016-11-23T14:14:06Z

I'm sorry, I must admit that the lack of activity in this PR made me kind of forget about it... it would be much appreciated if you could rebase the PR so that we can resume the work again. I'll re-read the past comments later today to refresh my memories of what this is all about :-)

frerich · 2016-11-23T19:28:12Z

Thanks! I think this is certainly moving things into the right direction. Ideally, I'd love to be able to avoid this issue in the common case (some research suggests that the ReplaceFile function on Windows would be useful for this purpose), but it's better to have at least some handling in place than none. Thanks a lot for bringing this up again1

PersistentJSONDict survive with empty file

53d1653

Jimilian added 2 commits August 10, 2016 15:57

Skip broken json as well as empty file

b51535e

Explicitly import JSONDecodeError

e6f1327

Jimilian added 2 commits August 10, 2016 16:28

JSONDecodeError is supported in Python3.5 only. Before it was pure Va…

3cbad12

…lueError

Remove redundant import

7cb95f1

Manifest file could be broken too. Print error message instead of sil…

71dbe6d

…ent skipping broken json files

Jimilian mentioned this pull request Aug 16, 2016

Global cache lock hurts cache hits during heavily concurrent builds #160

Closed

Jimilian mentioned this pull request Aug 22, 2016

Don't hold the cache lock while computing the manifest hash for the current object #208

Closed

Alexander Akbashev added 3 commits November 23, 2016 16:26

Resolve conflict

c77517e

Merge branch 'frerich-master'

c4e2b03

Remove trailing space

166551a

frerich merged commit 41354f1 into frerich:master Nov 23, 2016

Conversation

Jimilian commented Aug 10, 2016

Uh oh!

codecov-io commented Aug 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 89.02% (diff: 100%)

Uh oh!

webmaster128 commented Aug 10, 2016

Uh oh!

webmaster128 commented Aug 10, 2016

Uh oh!

Jimilian commented Aug 10, 2016

Uh oh!

webmaster128 commented Aug 10, 2016

Uh oh!

Jimilian commented Aug 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jimilian commented Aug 11, 2016

Uh oh!

webmaster128 commented Aug 11, 2016

Uh oh!

Jimilian commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

webmaster128 commented Aug 11, 2016

Uh oh!

Jimilian commented Aug 11, 2016

Uh oh!

webmaster128 commented Aug 11, 2016

Uh oh!

Jimilian commented Aug 11, 2016

Uh oh!

Jimilian commented Aug 18, 2016

Uh oh!

webmaster128 commented Aug 18, 2016

Uh oh!

frerich commented Aug 22, 2016

Uh oh!

Jimilian commented Aug 22, 2016

Uh oh!

Jimilian commented Nov 23, 2016

Uh oh!

frerich commented Nov 23, 2016

Uh oh!

frerich commented Nov 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Aug 10, 2016 •

edited

Loading

Jimilian commented Aug 10, 2016 •

edited

Loading

Jimilian commented Aug 11, 2016 •

edited

Loading