Questions about Verifier Development, Search as Data Generation Tool, and Model Family Alignment
Really fascinated by your research on test-time optimization! After reading through the implementation details, I have a few technical questions:
- Regarding PRMs and verifier strength:
- How do current PRMs handle problems with multiple valid solution approaches where initial steps might look very different?
- How are stronger verifiers currently being developed - is it primarily through diverse training data, or are there other architectural/methodological approaches?
- Could you expand on using search as a data generation tool? I'm particularly interested in:
- How the search process might generate more diverse/higher-quality training examples
- Whether the search-generated data could help improve verifier robustness
- Regarding model architecture:
- How important is it for the PRM and base model to share the same architecture family?
- Would using a PRM from a different model family significantly impact performance or model interaction?
Thanks for sharing such detailed insights into your approaches!
Hello @bird-of-paradise , thank you for the questions! Here's some partial answers:
How do current PRMs handle problems with multiple valid solution approaches where initial steps might look very different?
If you're scoring complete solutions with a PRM, you will generally find that solutions with correct answers AND reasoning are scored higher than correct answers with incorrect reasoning (e.g. due to hallucinations). As long at the initial steps are valid, the PRM should score them highly.
How are stronger verifiers currently being developed
This is still an under-explored topic, but the current best recipe is Math-Shepherd's, which uses MCTS to generate the stepwise annotations.
@plaguss
has implemented this in distilabel
if you're interested: https://distilabel.argilla.io/dev/sections/pipeline_samples/papers/math_shepherd/
Aside from that, better domain-specific base models are likely the key to having improved annotations.
How the search process might generate more diverse/higher-quality training examples
The basic idea here is that people typically generate synthetic datasets in a Best-of-N approach, so if you have a method to obtain better solutions (e.g. beam search), then you get higher-quality data that can be used for SFT etc.
Whether the search-generated data could help improve verifier robustness
I'm not sure about this since the existing methods like Math-Shepherd already use search in the form of MCTS.
How important is it for the PRM and base model to share the same architecture family?
I don't think this matters too much unless you're doing something like online training with RL where it is common to unify the policy and reward model to the same architecture (often the same weights to initialise)
Would using a PRM from a different model family significantly impact performance or model interaction?
Less so on the model family, but the more important point is the quality of the base model and training data.
Great questions @bird-of-paradise and thanks for the answers + helpful resources @lewtun
I would like to be a part of the HF community and previously contributed to BigScience. Please suggest similar initiatives @lewtun
Hi
@lewtun
,
Thank you for pointing to the Math-Shepherd paper! And thank you
@plaguss
for the implementation!
Reading through it, I've noticed some fascinating patterns that made me wonder about broader implications:
Both test-time optimization and Math-Shepherd achieve better results through more thorough exploration (beam search/multiple completions) rather than requiring more training data. They developed PRMs without human-annotated intermediate steps, and their experiments show that quality of reasoning might be more important than quantity of examples (as shown in the completer experiments with different training sets). Is this move towards 'compute over data' a promising direction for improving reasoning /mathematical tasks?
The paper demonstrates MATH-SHEPHERD being successful as both a reward model and verifier. This made me wonder: Could we push this further and use similar frameworks to improve model's generation capabilities? I'm thinking of how AlphaGo used self-play, learning by evaluating its own moves. While there might be technical challenges (balancing multiple objectives, architecture constraints, training dynamics), could this kind of self-evaluation approach help develop stronger mathematical reasoning capabilities?
The core idea is whether focusing on developing strong evaluation/critical thinking capabilities might naturally lead to better problem-solving abilities, similar to how humans often learn mathematics through understanding why solutions work rather than just memorizing more examples.
Has there been any research exploring these directions? Or am I thinking about this in the wrong way?
Thank you again for your insights!