Skip to content

Fix some race conditions in xca6ll/hot/amcssth.#83

Open
gareth-rees wants to merge 2 commits intomasterfrom
59-memory-aware
Open

Fix some race conditions in xca6ll/hot/amcssth.#83
gareth-rees wants to merge 2 commits intomasterfrom
59-memory-aware

Conversation

@gareth-rees
Copy link
Copy Markdown
Member

@gareth-rees gareth-rees commented Dec 24, 2022

  • Use memory-model-aware builtins in GCC and Clang when a memory location may be written by one thread and read by another, avoiding race conditions due to out-of-order updates on ARM.
  • Call dylan_make_wrappers while the test is still single-threaded, preventing multiple threads from racing to call it.
  • Prevent dylan_init from creating a padding object, as we must not have an exact root pointing at a padding object.

Note that this does not fix all the race conditions—there is at least one more that we have not figured out—but it reduces the frequency of failures in amcssth to less than one in a hundred on my 8-core Apple M2.

Work towards #59 (Rare failure in amcssth in hot variety on Linux)

@gareth-rees gareth-rees changed the title [#59] Fix some race conditions in xca6ll/hot/amcssth. Fix some race conditions in xca6ll/hot/amcssth. Dec 24, 2022
@gareth-rees gareth-rees force-pushed the branch/2022-12-23/hardened-runtime branch from f66d805 to 2ab4580 Compare January 6, 2023 18:32
@rptb1 rptb1 linked an issue Jan 13, 2023 that may be closed by this pull request
@rptb1 rptb1 added test Issue with a test case arch.a6 Relates to arm64/aarch64 labels Jan 13, 2023
@gareth-rees
Copy link
Copy Markdown
Member Author

@rptb1 You added #59 to the "Successfully merging this pull request may close these issues" list in the GitHub interface. I've removed it again because as far as I know this pull request does not fix that issue—it fixes some of the races but not all of them.

@gareth-rees gareth-rees changed the base branch from branch/2022-12-23/hardened-runtime to master January 13, 2023 20:31
* Use memory-model-aware builtins in GCC and Clang when a memory
  location may be written by one thread and read by another, avoiding
  race conditions due to out-of-order updates on ARM.
* Call dylan_make_wrappers while the test is still single-threaded,
  preventing multiple threads from racing to call it.
* Prevent dylan_init from creating a padding object, as we must not
  have an exact root pointing at a padding object.
@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Jan 14, 2023

@rptb1 You added #59 to the "Successfully merging this pull request may close these issues" list in the GitHub interface. I've removed it again because as far as I know this pull request does not fix that issue—it fixes some of the races but not all of them.

Do you think it's better practice (in GitHub) to only link completely fixed issues using the "Development" field, and link partially fixed issues from comments? It would make sense if GitHub closes those issues automatically.

(This could affect #97 )

@gareth-rees
Copy link
Copy Markdown
Member Author

Do you think it's better practice (in GitHub) to only link completely fixed issues using the "Development" field, and link partially fixed issues from comments? It would make sense if GitHub closes those issues automatically.

GitHub automatically closes the issues in the "Development" field when the pull request is merged, so if you don't want the issue to be closed automatically then you shouldn't put it there.

@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Feb 1, 2023

Executing proc.review.entry

  1. Applying entry.universal and entry.impl

  2. Source documents are available, and include: "Built-in Functions for Memory Model Aware Atomic Operations" (GCC), "LLVM Atomic Instructions and Concurrency Guide" (LLVM), and "Clang Language Extensions", section on "Atomic operations". Review status unknown! Two of these are nicely referenced from the code.

Entry passed.

Entry took 15 mins.

@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Feb 1, 2023

Executing proc.review.plan.

  1. proc.review.plan.time: Less than 100 lines have changed. The change is low risk, since it only introduces some barriers, but reviewing them will need a good understanding of the source documents, which are approx 6500 words. So 100 lines @ 10 lines/min is about 10 mins of checking the code lines, plus 50 mins for source checking, so 1 hour for checking.

  2. proc.review.plan.roles: @thejayps @rptb1 and @UNAA008 reviewing. Let's have at least two people doing proc.review.role.check.source, and one other for all the others.

  3. proc.review.plan.schedule: Review will start 2023-02-02 15:00 UTC for about 2h.

Planning took 15 mins.

@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Feb 2, 2023

Executing proc.review.kickoff

  1. Start 15:05.
  2. proc.review.ko.roles: @UNAA008 and @thejayps concentrate on source consistency, @rptb1 does the rest and is proc.review.role.leader.
  3. Reconvene for logging at 16:00.

Kickoff took 15 minutes.

Copy link
Copy Markdown
Contributor

@UNAA008 UNAA008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concentrated on understanding consistency of the atomic operations with the source material.
Macros defined in testlib.h appear consistent.
I was troubled by the comment about MSVC being Intel only.
I don't know if the MPS in general is asserted to conform to the C++11 memory model.
It took me 30 minutes to be reasonably sure I understood what the source material was saying.
I didn't understand the purpose of the changes in fmtdytst.c.
Possible minor defect in amcss.c - multiple delclarations on one line? https://github.com/Ravenbrook/mps/pull/83/files#diff-2fe9be832421a20cd9426718eb3a93a0c04a930fa37eed15207bdeace863dfefR111

Copy link
Copy Markdown
Member

@rptb1 rptb1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executing proc.review.check

  1. Read #59
  2. Partially read https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
  3. Examining use of __atomics.
  4. M: I think amcssth may not be consistent with its purpose, which is to stress the AMC pool. Instead, it seems to be busy stressing itself, and not acting like a common mutator program by using locks. This is major because we might spend a lot of time fixing amcssth. #59 (comment) says "Still occurs if we turn off garbage collection by parking the area." Why should we be debugging such a case? Also, rule.code.simple, rule.code.justified, rule.code.independent.
  5. M: The fact that the above arises suggests a problem with rule.generic.purpose.
  6. Transformations appear to preserve correctness.
  7. mi: branch/2021-01-25/amcssth-races seems to be an undocumented source?
  8. proc.review.check.metrics: 2 Major, 3 minor, 35 mins checking, read entire doc, but focussed on diffs really. Problem: poor concentration due to ME/CFS.

/* See <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>
* and <https://clang.llvm.org/docs/LanguageExtensions.html> */
#define atomic_load(SRC, DEST) __atomic_load(SRC, DEST, __ATOMIC_ACQUIRE)
#define atomic_store(DEST, SRC) __atomic_store(DEST, SRC, __ATOMIC_RELEASE)
Copy link
Copy Markdown
Member

@rptb1 rptb1 Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Unclear why __ATOMIC_ACQUIRE and __ATOMIC_RELEASE as opposed to other possibilities. rule.generic.clear

#elif defined(MPS_BUILD_MV)

/* Microsoft Visual C/C++ does not need memory-model-aware load and store as
* loads and stores of register-sized values are atomic on Intel. */
Copy link
Copy Markdown
Member

@rptb1 rptb1 Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Is the atomicity important, or is it memory ordering? Should at least say which and why. Possible reasoning error also. rule.generic.self

Copy link
Copy Markdown
Contributor

@thejayps thejayps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 minor (2m)
40 mins checking
Checked source code diffs, PR comments and partially consulted 3 interface documents for atomic operations.

@@ -292,6 +292,25 @@ extern void randomize(int argc, char *argv[]);
extern void testlib_init(int argc, char *argv[]);


Copy link
Copy Markdown
Contributor

@thejayps thejayps Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Two separate documents describing the interface to llvm atomic operations exist and describe different interfaces.
https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-operations
https://www.llvm.org/docs/Atomics.html

The former directs us to use a part of the interface to check that atomic operations exist. It seems that the code below is using the interface described in the second document. If my understanding of this is correct, then it isn't clear to me why one is used and not the other. My knowledge of this area isn't well developed so the clarity issue may be in the documentation of the interfaces rather than in the documentation of this pull request. However, some explanation by way of comment on this pull request of how these interfaces were identified and chosen could be useful for adding extra clarity.

@rptb1 summary: It's not clear to @thejayps why A not B. rule.generic.clarity

@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Feb 2, 2023

Executing proc.review.log

  1. Start time 16:03.
  2. proc.review.log.sums: @UNAA008 0M, 1m. @thejayps 0M, 2m. @rptb1 2M, 3m.
  3. NI for proc.review: You're not restricted to thinking about defects introduced by a change. Find any major defects.
  4. NI for proc.review: The editor can escalate stuff and still get their change through. Review exit doesn't say "did you fix everything mentioned" it says "did you take action..."
  5. "The road to hell is paved with correctness preserving diffs."
  6. NM: @thejayps If this test does not fulfill its purpose then there isn't one that does the job.
  7. Nm: @thejayps Tests are not readily identifiable based on their location within the directory structure. You need knowledge to know what might be a test and what is actually part of the MPS library and what isn't.
  8. End time 16:45. Brainstorm at 16:55.

Logging took 42 mins.

@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Feb 2, 2023

Executing proc.review.brainstorm

  1. Start time 16:55.
  2. Fix some race conditions in xca6ll/hot/amcssth. #83 (review) point 4: @thejayps Regularly, randomly select documents, and argue for their deletion. @rptb1 I like this because it combats accretion. @rptb1 Similar: every document should have a review date, that review should include your point, and purpose checks, etc. Not just "does it work". Would need resource allocation. Regularly put @thejayps in charge of regularly doing things. :) There is a rule that we don't copy and paste code, but amcssth is totally copy-pasted. I, @rptb1 broke that and we're paying the price. How'd that happen? @UNAA008 says perhaps it was lack of rigor.
  3. Fix some race conditions in xca6ll/hot/amcssth. #83 (comment) point 6: What jobs? What jobs are needed? How can we tell if any are not being done? There's a gap there. (That's just broadening the issue.) @UNAA008 Whenever you write a non-trivial function you should write test code alongside it. Connecting that code would justify that code, and we'd know there was coverage. Some sort of coverage mapping. @thejayps Tests need to be driven by requirements. There should be links between the two. @rptb1 These are breaks in the chain of justification: the "why links". Maybe we could be better at enforcing that any new stuff (or changed stuff) has an unbroken "why" chain. Especially new files.
  4. @thejayps Brainstorm doesn't seem to permit new issues to come up. We could mention this possibility in proc.review.brainstorm. Capture and move on.
  5. @UNAA008 What about stuff that occurs to me a day later? proc.review could mention that possibility. The conversation should continue, but how? Or maybe a followup meeting or one-to-ones by the leader. Or the proc.review.role.improver could ask people.

Possible new issues:

  • there's no index of tests
  • design.mps.tests doesn't talk about individual tests
  • tests designs are missing

End at 17:25. 30 mins bang on!

@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Feb 2, 2023

We haven't assigned proc.review.role.editor or proc.review.role.improver and @thejayps is away next week. We will perhaps do some of these as a group for training purposes and pair programming.

rptb1 added a commit that referenced this pull request Feb 3, 2023
@rptb1 rptb1 added the blocked Unable to proceed. See comment for reason. label Feb 14, 2023
@rptb1 rptb1 added the pending Something needs doing, even if closed. label Feb 17, 2023
@rptb1
Copy link
Copy Markdown
Member

rptb1 commented Feb 19, 2023

4. @thejayps Brainstorm doesn't seem to permit new issues to come up. We could mention this possibility in proc.review.brainstorm. Capture and move on.

Fixed in 8848b70

@thejayps
Copy link
Copy Markdown
Contributor

thejayps commented Feb 21, 2023

5. @UNAA008 What about stuff that occurs to me a day later? proc.review could mention that possibility. The conversation should continue, but how? Or maybe a followup meeting or one-to-ones by the leader. Or the proc.review.role.improver could ask people.

Comment written by @rptb1 on @thejayps session:

So this is a bit meta-circular: @thejayps asked me about improving proc.review to mention capture of brainstorming thoughts that come up later. (And his asking me is an example of that!) I'm writing this to demonstrate what I would do about that:

  1. located the relevant brainstorm (above) Fix some race conditions in xca6ll/hot/amcssth. #83 (comment)
  2. I checked that the review log (this pull request conversation) is still active (marked pending) and so will be seen again at some point
  3. I added my thought (this comment).
  4. The above could be summarised in proc.review as something like "If you think of something after review then append to the review log as long as it hasn't been closed, otherwise raise an issue."
  5. Done!

@thejayps thejayps added needs analaysis The issue needs analysis before it can be resolved. optional Will cause failures / of benefit. Worth assigning resources. labels Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch.a6 Relates to arm64/aarch64 blocked Unable to proceed. See comment for reason. needs analaysis The issue needs analysis before it can be resolved. optional Will cause failures / of benefit. Worth assigning resources. pending Something needs doing, even if closed. test Issue with a test case

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants