Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model (microsoft.com)

91 points by tosh 5 days ago | 11 comments

onlyrealcuzzo 1 days ago [-]

Does anyone else get more excited by the progress of these small models than the frontier models?

It seems like a lot of the most exciting research is happening here - making unbelievable progress with such small parameter sizes.

barneybooroo 1 days ago [-]

Absolutely. I look forward to a time where we have on-device small models as an OS-level service you can rely on (a bit like what Apple's doing with Foundation Models). I was recently playing around with some game dev prototyping where I wish I could rely on a player having access to a local model for doing some classification tasks or generating small amounts of playthrough-specific copy without just populating the same few templates.

DonsDiscountGas 4 hours ago [-]

I'm very happy to read about this progress but I don't find it particularly surprising. The big labs optimize for accuracy/high scores on benchmarks first; I automatically expect that (with some research effort) a model with 100x few parameters can achieve the same scores.

nextzck 1 days ago [-]

I get excited for every new vision model, especially those that work better and more efficiently. Vision is where we are so very far behind.. I can’t wrap my head around it

thot_experiment 23 hours ago [-]

What do you mean far behind? Far behind what? The new (actually the old one too) Qwen can give you bounding rectangular prisms around things in a scene, OCR text with ink spilled on it correctly, read graphs and understand spatial relationships, I think it's pretty impressive for something I'm running on like a 5 year old GPU.

nextzck 22 hours ago [-]

yeah i know lol, that’s kind of my point. impressive that it runs on your gpu, but it still can’t tell you what happens if you tilt a glass. that’s what world models are working toward. but even then..so what? you get a perfect simulator. it knows the glass tips. it still doesn’t know why someone tipped it, or what happens if they don’t. A four year old can do this and we’re just barely on step one and a half.

kanemcgrath 21 hours ago [-]

Small local models are the only thing that still have that magic feeling to me. While large models are still useful and impressive, it makes more sense that they are happening on a giant super computer in a datacenter somewhere. But all the intelligence and capability that can run on my mid level gaming PC is astonishing to me.

htsh 1 days ago [-]

yes! especially b/c i want to process a lot of email and directories full of old, personal documents

mlnj 1 days ago [-]

I absolutely love it.

Am so much more excited about tiny models gaining real intelligence. Just today I have been running Qwen3.5 0.8B model on images and am pleasantly surprised by how good it is compared to even 4B and 8B models from a few months ago.

lemonish97 1 days ago [-]

Always love a new good SLM. Reasoning + vision under 20B sounds promising, will be testing this out

lostmsu 5 hours ago [-]

It seems worse than Qwen3-VL-8B in every way?

Rendered at 20:53:17 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.