gh-121459: Deferred LOAD_GLOBAL#123128
Conversation
picnixz
left a comment
There was a problem hiding this comment.
I can only nitpick on things I understand :') But I will, at some point, try to understand the magic of bytecodes.c.
|
|
||
|
|
||
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None: | ||
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, allow_unbalanced_parens: bool = False) -> None: |
There was a problem hiding this comment.
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, allow_unbalanced_parens: bool = False) -> None: | |
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None: |
Objects/dictobject.c
Outdated
| goto read_failed; | ||
| } | ||
|
|
||
| if (ix >= 0) { |
There was a problem hiding this comment.
Maybe put the case if (ix < 0) first to reflect the if (rc < 0) constructions that we usually do?
Objects/dictobject.c
Outdated
| ix = _Py_dict_lookup(mp, key, hash, &value); | ||
| *value_addr = value == NULL ? PyStackRef_NULL : PyStackRef_FromPyObjectNew(value); |
There was a problem hiding this comment.
Can't you use the _Py_dict_lookup_threadsafe_stackref helper here since you acquired the lock? This would save the PyObject *value declaration (and you could consider inlining the helper actually)
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
…ython into deferred_globals_final
Objects/dictobject.c
Outdated
| goto read_failed; | ||
| } | ||
| if (ix >= 0) { | ||
| *value_addr = PyStackRef_FromPyObjectNew(DK_ENTRIES(dk)[ix].me_value); |
There was a problem hiding this comment.
I'm not quite sure I understand the TSAN failures on dictionaries here. @colesbury do you have a clue?
There was a problem hiding this comment.
This is not thread-safe it the value is not deferred or immortal.
There was a problem hiding this comment.
Wait but doesn't PyStackRef_FromPyObjectNew incref the value if it's not deferred or immortal, which make it thread safe? Or maybe I'm missing something.
There was a problem hiding this comment.
Calling Py_INCREF() on a value that might be concurrently freed is not thread-safe.
There was a problem hiding this comment.
Ahh okay that makes sense now. So one possible solution would be to introduce another variant that tries to incref and is allowed to fail? Like in https://github.com/python/cpython/blob/main/Objects/dictobject.c#L1435
Or we can check if it's deferred/immortal, and only if it's not we do the TryXGetRef thing, what do you think?
There was a problem hiding this comment.
Yeah, do the check that it's deferred/immortal and if not do the TryXGetRef thing.
I think it's best to write it out explicitly first. If there are other uses of it, we can refactor it into a function later.
Objects/dictobject.c
Outdated
| dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
| kind = dk->dk_kind; | ||
|
|
||
| if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
This is complicated and duplicates logic elsewhere. I think it might be worth only supporting a a single case or two. For example, unicode-only and non-split dict. Fallback to _Py_dict_lookup_threadsafe + PyStackRef_FromPyObjectSteal for other cases.
There was a problem hiding this comment.
I don't think it is worth changing anything. LOAD_GLOBAL is specialized incredibly well, so this only applies to a rare slow path.
There was a problem hiding this comment.
It's not specializing at all in the free-threaded build so LOAD_GLOBAL is a scaling bottleneck now that fewer types are immortalized.
I don't think we want to wait for specialization to be made thread-safe, but we can treat this as a temporary measure and structure the code to reflect that.
There was a problem hiding this comment.
I changed this case to fall back to _Py_dict_lookup_threadsafe
There was a problem hiding this comment.
This looks backwards to me. The common case should be DICT_KEYS_UNICODE for globals dictionaries, but this only handles DICT_KEYS_GENERAL efficiently.
There was a problem hiding this comment.
I think this is still too complicated. We only need to handle the common case of unicode keys. Globals are almost never split dictionaries.
I think this should look something like the following. You can refactor it further by extracting a _Py_TryXGetStackRef() that's like _Py_TryXGetRef but returns a _PyStackRef.
PyDictKeysObject *dk = _Py_atomic_load_ptr(&mp->ma_keys);
if (dk->dk_kind == DICT_KEYS_UNICODE && PyUnicode_CheckExact(key)) {
Py_ssize_t ix = unicodekeys_lookup_unicode_threadsafe(dk, key, hash);
if (ix == DKIX_EMPTY) {
*value_addr = PyStackRef_NULL;
return ix;
}
else if (ix >= 0) {
PyObject **addr_of_value = &DK_UNICODE_ENTRIES(dk)[ix].me_value;
PyObject *value = _Py_atomic_load_ptr(addr_of_value);
if (value == NULL) {
*value_addr = PyStackRef_NULL;
return DKIX_EMPTY;
}
if (_Py_IsImmortal(value) || _PyObject_HasDeferredRefcount(value)) {
*value_addr = (_PyStackRef){ .bits = (uintptr_t)value | Py_TAG_DEFERRED };
return ix;
}
if (_Py_TryIncrefCompare(addr_of_value, value)) {
*value_addr = PyStackRef_FromPyObjectSteal(value);
return ix;
}
}
}
PyObject *obj;
Py_ssize_t ix = _Py_dict_lookup_threadsafe(mp, key, hash, &obj);
if (ix >= 0 && obj != NULL) {
*value_addr = PyStackRef_FromPyObjectSteal(obj);
}
else {
*value_addr = PyStackRef_NULL;
}
return ix;There was a problem hiding this comment.
I don't really see the point of these changes. LOAD_GLOBAL is something like 1 in 5000 of executed instructions. Its performance is irrelevant.
What is STACK_ENTRY for? All outputs are already declared in the stack comment for the op.
Include/internal/pycore_dict.h
Outdated
| extern Py_ssize_t _PyDict_LookupIndex(PyDictObject *, PyObject *); | ||
| extern Py_ssize_t _PyDictKeys_StringLookup(PyDictKeysObject* dictkeys, PyObject *key); | ||
| PyAPI_FUNC(PyObject *)_PyDict_LoadGlobal(PyDictObject *, PyDictObject *, PyObject *); | ||
| PyAPI_FUNC(void)_PyDict_LoadGlobalStackRef(PyDictObject *, PyDictObject *, PyObject *, _PyStackRef *); |
There was a problem hiding this comment.
| PyAPI_FUNC(void)_PyDict_LoadGlobalStackRef(PyDictObject *, PyDictObject *, PyObject *, _PyStackRef *); | |
| PyAPI_FUNC(_PyStackRef )_PyDict_LoadGlobalStackRef(PyDictObject *, PyDictObject *, PyObject *); |
Likewise
Python/bytecodes.c
Outdated
| (PyDictObject *)BUILTINS(), | ||
| name); | ||
| if (v_o == NULL) { | ||
| _PyDict_LoadGlobalStackRef((PyDictObject *)GLOBALS(), |
There was a problem hiding this comment.
This is inconsistent with PyMapping_GetOptionalItem above.
I think it would be better to change the whole instruction to use stack refs, or leave it as it is.
There was a problem hiding this comment.
I removed stackrefs from the instruction to leave it as is.
Python/bytecodes.c
Outdated
| _PyDict_LoadGlobalStackRef((PyDictObject *)GLOBALS(), | ||
| (PyDictObject *)BUILTINS(), | ||
| name, | ||
| STACK_ENTRY(v)); |
There was a problem hiding this comment.
What does STACK_ENTRY do? Does it have any side effects, or is it just equivalent to &v?
There was a problem hiding this comment.
Practically equivalent to &stack_pointer[whatever_v_index_is]. It has no side effects.
|
|
||
|
|
||
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None: | ||
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None: |
There was a problem hiding this comment.
Don't we want balanced parentheses? What is the construct that doesn't balance parentheses?
There was a problem hiding this comment.
Specifically it's when we replace something in the middle of the line but still want the rest of the line (and all it's parantheses)
There was a problem hiding this comment.
Can you give me an example?
There was a problem hiding this comment.
E.g. we replace the before foo(STACK_ENTRY(x));, with foo(stack_pointer[-1]);, we want the replacement function to write until the semicolon ;, ignoring all balanced brackets from the replacement point till that semicolon.
There was a problem hiding this comment.
Instead of adding allow_unbalanced_parens can we push the logic into the caller (stack_entry)? Like replace up to the matching ) and then consume/emit tokens up to the end of the statement?
Objects/dictobject.c
Outdated
| dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
| kind = dk->dk_kind; | ||
|
|
||
| if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
I don't think it is worth changing anything. LOAD_GLOBAL is specialized incredibly well, so this only applies to a rare slow path.
|
@colesbury I think the PR is in better shape now. Can you take a look please? |
colesbury
left a comment
There was a problem hiding this comment.
Lets ensure that this actually improves the scalability of LOAD_GLOBAL calls.
Python/bytecodes.c
Outdated
| PyObject *name = GETITEM(FRAME_CO_NAMES, oparg>>1); | ||
| PyObject *res_o = _PyEval_LoadGlobal(GLOBALS(), BUILTINS(), name); | ||
| ERROR_IF(res_o == NULL, error); | ||
| _PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, STACK_ENTRY(res)); |
There was a problem hiding this comment.
Two suggestions:
- Make
_PyEval_LoadGlobalStackRefand_PyDict_LoadGlobalStackRefreturninterror codes - Use
&STACK_ENTRY(res)because it makes it more clear that the value is a pointer and because it means theSTACK_ENTRY()transformation preserves the type.
There was a problem hiding this comment.
I will apply 2. but not 1. The reasoning is to keep consistency with _PyEval_LoadGlobal (non-stackref version).
|
|
||
|
|
||
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None: | ||
| def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None: |
There was a problem hiding this comment.
Instead of adding allow_unbalanced_parens can we push the logic into the caller (stack_entry)? Like replace up to the matching ) and then consume/emit tokens up to the end of the statement?
Objects/dictobject.c
Outdated
| dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
| kind = dk->dk_kind; | ||
|
|
||
| if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
This looks backwards to me. The common case should be DICT_KEYS_UNICODE for globals dictionaries, but this only handles DICT_KEYS_GENERAL efficiently.
|
Main: This branch: Create_pyobject became a lot faster! |
|
Benchmarking results courtesy of Mike: no slowdown on JIT build https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20240828-3.14.0a0-bfd4400-JIT |
colesbury
left a comment
There was a problem hiding this comment.
Sorry for the delay in reviewing this. I think we should simplify this further to only focus on the case we care about.
Objects/dictobject.c
Outdated
| dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
| kind = dk->dk_kind; | ||
|
|
||
| if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
I think this is still too complicated. We only need to handle the common case of unicode keys. Globals are almost never split dictionaries.
I think this should look something like the following. You can refactor it further by extracting a _Py_TryXGetStackRef() that's like _Py_TryXGetRef but returns a _PyStackRef.
PyDictKeysObject *dk = _Py_atomic_load_ptr(&mp->ma_keys);
if (dk->dk_kind == DICT_KEYS_UNICODE && PyUnicode_CheckExact(key)) {
Py_ssize_t ix = unicodekeys_lookup_unicode_threadsafe(dk, key, hash);
if (ix == DKIX_EMPTY) {
*value_addr = PyStackRef_NULL;
return ix;
}
else if (ix >= 0) {
PyObject **addr_of_value = &DK_UNICODE_ENTRIES(dk)[ix].me_value;
PyObject *value = _Py_atomic_load_ptr(addr_of_value);
if (value == NULL) {
*value_addr = PyStackRef_NULL;
return DKIX_EMPTY;
}
if (_Py_IsImmortal(value) || _PyObject_HasDeferredRefcount(value)) {
*value_addr = (_PyStackRef){ .bits = (uintptr_t)value | Py_TAG_DEFERRED };
return ix;
}
if (_Py_TryIncrefCompare(addr_of_value, value)) {
*value_addr = PyStackRef_FromPyObjectSteal(value);
return ix;
}
}
}
PyObject *obj;
Py_ssize_t ix = _Py_dict_lookup_threadsafe(mp, key, hash, &obj);
if (ix >= 0 && obj != NULL) {
*value_addr = PyStackRef_FromPyObjectSteal(obj);
}
else {
*value_addr = PyStackRef_NULL;
}
return ix;Co-Authored-By: Sam Gross <655866+colesbury@users.noreply.github.com>
markshannon
left a comment
There was a problem hiding this comment.
Can we go back to defining res as an array of one.
Python/bytecodes.c
Outdated
| PyObject *name = GETITEM(FRAME_CO_NAMES, oparg>>1); | ||
| PyObject *res_o = _PyEval_LoadGlobal(GLOBALS(), BUILTINS(), name); | ||
| ERROR_IF(res_o == NULL, error); | ||
| _PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &STACK_ENTRY(res)); |
There was a problem hiding this comment.
| _PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &STACK_ENTRY(res)); | |
| _PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &res[0])); |
Can we go back to the earlier design of making res an array?
Please also add a comment explaining why we res needs to be trated specially (we need pass a pointer to _PyEval_LoadGlobalStackRef).
The code generator already understands arrays, and it avoids confusion as to where it should insert stack spills for fully deferred reference counting.
You'll able to remove def stack_entry below.
Sorry for the back and forth on this.
| # unused portions of the stack to NULL. | ||
| stack.flush_single_var(self.out, target, uop.stack.outputs) | ||
|
|
||
| def stack_entry( |
There was a problem hiding this comment.
I'm not sure that the stack calculations are correct here, and I don't think it will be able to handle code that writes to multiple results, like UNPACK....
If we go back to using arrays, we can remove this.
|
When you're done making the requested changes, leave the comment: |
|
This reverts commit 8810e28.
This reverts commit 8810e28.
This implements PEP 703 deferred refcounting for LOAD_GLOBAL in an agnostic way using the cases generator. So it won't block full deferring of references in the future.
For now, we need to write directly to a stackref entry to keep the global alive. Changing the dictionary lookup to stackref variants is also a prerequisite for LOAD_ATTR and friends.