This LLM is a test maxer, not a general purpose AI model.

#17

by phil111 - opened 4 days ago

•

You should probably add descriptors to your model names, such as STEM, Math, or Code, if you're going to grossly overfit your models to a handful of domains rather than make a balanced general purpose AI model.

For example, the general popular knowledge of this model is extremely low for its size and scores lower on broad knowledge tests, including my own, than much smaller general purpose AI models like Gemma 4 26b-a4b, and even Qwen3.5 34b-a3b.

And it's not only that your corpus was very lopsided for the sake of STEM test maxing (e.g. the MMLU-pro), but you didn't even account for broad popular knowledge when training the model for thinking; causing it to loop endless during thinking following thinking tangents that don't make sense considering the simple knowledge retrieval question being asked of it.

It's perfectly understandable to make a model for specific tasks like STEM, coding, and math, but please don't pass it off as a general purpose AI model because outside of the domains you test maxed for this model is reduced to little more than an hallucination generator which almost always enters an endless thinking loop in response to simple prompts outside of this model's overfit domains.

Edit: You can even see a huge regression compared to v2.5 on the LMsys arena, such as going from a 79 to 108 in creative writing, and 60 to 91 in math. As you well know LLMs are a balancing act. You can't just train hard on coding and advanced math without laying waste to the model's general abilities and knowledge. I'm not trying to single you out since the bulk of the AI community has shifted from making general purpose AI models to test maxing specialist tools as they increased the parameter count. Training on countless coding and math tokens, followed by a large number of COT examples while enabling thinking is an easy way to climb higher on STEM tests like coding, but in the end you're left with an unreliable and hallucinating LLM across most other domains despite using 100s of parameters.

domcx

4 days ago

Blog posts and readme are heavily biased towards software engineering. MiniMax in name is a reference to the MiniMax algorithm.
Materials released with the model are explicit in its use for software engineering, changing the name is pedantic and it should be obvious what this model is for.

phil111

4 days ago

•

edited 4 days ago

@domcx That's true. The model card was clear about its focus, as was the tests touting its abilities. However, this doesn't change the fact that the general shift toward test maxing, especially coding and math, has all but ended the era of general purpose OS AI models. Even Gemma 4 has a much smaller knowledge horizon than Gemma 3, which in turn has a smaller horizon than Gemma 2.

Since the OS community is >95% autistic coders there's a strong pressure to trade a notable amount of generalist abilities and knowledge for small gains on select domains (e.g. test maxing coding).

And the end result is Microsoft and Google trying to integrate grossly overfit models into their respective OSes and contributing to the general population's dislike of AI. For example, <95% of the population codes, and those who do wouldn't code on their phones or with an edge model, yet Google wasted tons of parameter capacity on coding when making E4B and E2B, resulting in models that do nothing but hallucinate about humanity's most popular knowledge. So when normal people use E4B on their phones it's going to lead to a very frustrating experience. Point being, this widespread shift from generalist to autistic specialist is doing great damage to how the general population views AI.

domcx

4 days ago

•

edited 4 days ago

Domain specific models are a good thing. This provides better relative results for the use case with lower inference costs. A model trained for software engineering will do well at software engineering. You pick the tool for the job. Why hit everything with a hammer?

The rest of your argument does not resonate with me.

MartinPatterson

1 day ago

•

edited 1 day ago

You should probably add descriptors to your model names, such as STEM, Math, or Code, if you're going to grossly overfit your models to a handful of domains rather than make a balanced general purpose AI model.

For example, the general popular knowledge of this model is extremely low for its size and scores lower on broad knowledge tests, including my own, than much smaller general purpose AI models like Gemma 4 26b-a4b, and even Qwen3.5 34b-a3b.

And it's not only that your corpus was very lopsided for the sake of STEM test maxing (e.g. the MMLU-pro), but you didn't even account for broad popular knowledge when training the model for thinking; causing it to loop endless during thinking following thinking tangents that don't make sense considering the simple knowledge retrieval question being asked of it.

It's perfectly understandable to make a model for specific tasks like STEM, coding, and math, but please don't pass it off as a general purpose AI model because outside of the domains you test maxed for this model is reduced to little more than an hallucination generator which almost always enters an endless thinking loop in response to simple prompts outside of this model's overfit domains.

Edit: You can even see a huge regression compared to v2.5 on the LMsys arena, such as going from a 79 to 108 in creative writing, and 60 to 91 in math. As you well know LLMs are a balancing act. You can't just train hard on coding and advanced math without laying waste to the model's general abilities and knowledge. I'm not trying to single you out since the bulk of the AI community has shifted from making general purpose AI models to test maxing specialist tools as they increased the parameter count. Training on countless coding and math tokens, followed by a large number of COT examples while enabling thinking is an easy way to climb higher on STEM tests like coding, but in the end you're left with an unreliable and hallucinating LLM across most other domains despite using 100s of parameters.

You never look a gift horse in the mouth.

phil111

1 day ago

@MartinPatterson It's only logical.

EclipseMist

about 23 hours ago

You should probably add descriptors to your model names, such as STEM, Math, or Code, if you're going to grossly overfit your models to a handful of domains rather than make a balanced general purpose AI model.

For example, the general popular knowledge of this model is extremely low for its size and scores lower on broad knowledge tests, including my own, than much smaller general purpose AI models like Gemma 4 26b-a4b, and even Qwen3.5 34b-a3b.

And it's not only that your corpus was very lopsided for the sake of STEM test maxing (e.g. the MMLU-pro), but you didn't even account for broad popular knowledge when training the model for thinking; causing it to loop endless during thinking following thinking tangents that don't make sense considering the simple knowledge retrieval question being asked of it.

It's perfectly understandable to make a model for specific tasks like STEM, coding, and math, but please don't pass it off as a general purpose AI model because outside of the domains you test maxed for this model is reduced to little more than an hallucination generator which almost always enters an endless thinking loop in response to simple prompts outside of this model's overfit domains.

Edit: You can even see a huge regression compared to v2.5 on the LMsys arena, such as going from a 79 to 108 in creative writing, and 60 to 91 in math. As you well know LLMs are a balancing act. You can't just train hard on coding and advanced math without laying waste to the model's general abilities and knowledge. I'm not trying to single you out since the bulk of the AI community has shifted from making general purpose AI models to test maxing specialist tools as they increased the parameter count. Training on countless coding and math tokens, followed by a large number of COT examples while enabling thinking is an easy way to climb higher on STEM tests like coding, but in the end you're left with an unreliable and hallucinating LLM across most other domains despite using 100s of parameters.

The model has been working well for me curious what quant are you running it at? Could be quantized down too far. Also whats the KV cache quantization? I have noticed anything below Q8 for KV cache quantization will lobotomize models. Even after the llama cpp update.

phil111

about 22 hours ago

@EclipseMist In what areas has it worked well for you?

This model, by admission, was optimized for code. It's simply not a general purpose AI model and scores much lower on LMsys at most things than far smaller models like Gemma 4, such as in creative writing. And when I asked it basic questions about very popular things, such as popular movies, music, TV shows, and games, the full float version of MiniMax M2.7 hallucinated like crazy, especially for its size (>200b parameters).

In short, this model is incredibly weak for its size across most tasks and domains of knowledge, so anybody attempting to use it as a general purpose AI model is oblivious. This is for coding. Nothing more.

weisunding

about 9 hours ago

•

edited about 9 hours ago

This model is not good for coding as claimed by the benchmarks, it cannot solve my Rust + TUI coding issues for days, I am evaluating the Qwen/Qwen3.6-35B-A3B, I'll either go with qwen or rollback to M2.5

phil111

about 9 hours ago

@weisunding I came here to ask why not Qwen3.6 vs 3, but you already made the correction.

Also, why not Gemma 4. I found that across the board Qwen's benchmarks don't reflect real-world performance. For example, tiny Qwen3.5 4b has a very impressive MMLU-pro score of 80, but when testing the same domains covered by the test in the real world it hallucinates like crazy and performs notably worse than large models with MMLU-pro scores of only 70, like Llama 3.1 70b. There's a large and highly suspicious disconnect between the test scores of Qwen models and their real-world performance.

edwarddddr

about 8 hours ago

Quantization might be the issue. I have been using M2.7 with their coding plan and for a 229B parameter it's pretty solid. Compared to Stepfun 3.5 Flash which is also @200Bparameters, it's slower but better at coding.
I also use it to test the app I am creating, as a personal assistant and it's doing okay. I don't understand what creative writing are you expecting.
It's twice the size of gpt-oss-120b but it feels ten times more powerful.

phil111

about 8 hours ago

•

edited about 8 hours ago

@edwarddddr Yeah, gpt-oss is no good, and certainly in my experience issues with model/kv quantization, chat template, llama.cpp bugs... have all significantly degraded performance, and from what I'm hearing this model is unusually good at coding, at least in certain use cases, but that's outside my wheelhouse. My primary interest is in general performance per active/total parameter count, and that's clearly not where this model shines.

weisunding

about 8 hours ago

•

edited about 8 hours ago

@weisunding I came here to ask why not Qwen3.6 vs 3, but you already made the correction.

Also, why not Gemma 4. I found that across the board Qwen's benchmarks don't reflect real-world performance. For example, tiny Qwen3.5 4b has a very impressive MMLU-pro score of 80, but when testing the same domains covered by the test in the real world it hallucinates like crazy and performs notably worse than large models with MMLU-pro scores of only 70, like Llama 3.1 70b. There's a large and highly suspicious disconnect between the test scores of Qwen models and their real-world performance.

Hey guys, I tested Gemma4-31B, coding ability is good ( better than m2.5, and now even m2.7), but since it's a dense model, running it with A100 is still a bit slow.

Now the Qwen/Qwen3.6-35B-A3B was just out, I am struggling to run it with A100 GPU, the fp8 version is not supported by the A100, I have to download the bf16 version, from the benchmark, the coding ability is better than Gemma4, and it's a MoE, running with vLLM will be fast, I will try it out soon.

edwarddddr

about 5 hours ago

@edwarddddr Yeah, gpt-oss is no good, and certainly in my experience issues with model/kv quantization, chat template, llama.cpp bugs... have all significantly degraded performance, and from what I'm hearing this model is unusually good at coding, at least in certain use cases, but that's outside my wheelhouse. My primary interest is in general performance per active/total parameter count, and that's clearly not where this model shines.

Would you perhaps share a performant model ?
I thought the same that models maxing on performance bench tests do not reflect in daily usage performance, but truth seems to be the better they score the tests the better they are overall.
I used qwen 3.6 plus before seeing any graphs and I really enjoyed, along with mimo v2 pro. Mimo flash was also nice but way less accurate.
So what is your take on this ?

edwarddddr

about 5 hours ago

@weisunding I came here to ask why not Qwen3.6 vs 3, but you already made the correction.

Also, why not Gemma 4. I found that across the board Qwen's benchmarks don't reflect real-world performance. For example, tiny Qwen3.5 4b has a very impressive MMLU-pro score of 80, but when testing the same domains covered by the test in the real world it hallucinates like crazy and performs notably worse than large models with MMLU-pro scores of only 70, like Llama 3.1 70b. There's a large and highly suspicious disconnect between the test scores of Qwen models and their real-world performance.

Hey guys, I tested Gemma4-31B, coding ability is good ( better than m2.5, and now even m2.7), but since it's a dense model, running it with A100 is still a bit slow.

Now the Qwen/Qwen3.6-35B-A3B was just out, I am struggling to run it with A100 GPU, the fp8 version is not supported by the A100, I have to download the bf16 version, from the benchmark, the coding ability is better than Gemma4, and it's a MoE, running with vLLM will be fast, I will try it out soon.

I used models only from providers, did not run them myself, but saying gemma 4 is better than m2.7 is nuisance; it barely does action calls to read files and understand project, to think solutions and make edits is hard to believe in real usage (big project, kilo code)

nawoalanor

about 3 hours ago

Benchmarks showed programming-related scores, they didn't suggest it would win at Jeopardy.

If you want a general-purpose chatbot lots of those exist. The main thing people want local models for is programming.

EclipseMist

about 3 hours ago

Benchmarks showed programming-related scores, they didn't suggest it would win at Jeopardy.

If you want a general-purpose chatbot lots of those exist. The main thing people want local models for is programming.

I have not been having issues with even general-purpose things using a recent llama cpp build with vulkan the model does great with both general questions and technical/coding problems. I had a issue where if my nvidia EGPU was detected as vulkan0 instead of my 8060s being vulkan0 the accuracy and quality dropped off a cliff. Like the browser based OS test ran 3 times on it with the NVIDIA card detected as vulkan0 all 3 times it was just a skeleton that had non existent/terrible icons and no functionality whatsoever with a average token output of 5k-6k each time. When it was set to have the 8060s as vulkan0 and the nvidia card as vulkan1 I reran the same exact test, same settings same quant and all 3 times got a fully function output that looked good with a average output token length of 11k-14k. Absolutely no idea why this happened. I also noticed that when I use ngram spec decoding it also makes the model dumber, again no idea why.

phil111

about 3 hours ago

•

edited about 3 hours ago

@edwarddddr Gemma 4 26b is currently my favorite.

It's much faster than G4 31b and nearly as good, and in some ways better. For example, since it has dedicated experts it's better at narrow domains like math, as well as long form recollection (e.g. quotes, movie casts, and song lyrics).

Also, one of the biggest weaknesses of G4 26b relative to 31b (periodically falling of the rails) really isn't a problem if the user is capable of constructing effective and unambiguous prompts, especially the initial one. This ensures the router activates the appropriate experts.

They both can do more than just regurgitate stories and poems from their training data, which is all many models can handle, and can also construct original ones that align with complex prompts filled with inclusions and exclusions. Same goes for explaining science and technology and responding appropriate to the nuance in any follow up prompt. And the coding help is reliably on point and usually provides multiple options. And so on, such as good grammar checking and synonym lists.

The biggest weakness of both of them is pop-culture. I understand the benefit that comes from training on high brow data, but any model capable of serving the needs of the general population needs to at least include the core pop culture knowledge (RAG isn't a viable solution).

Qwen3.5 34b (and likely Qwen3.6) is a close second. Their abilities are comparable, but with the slight edge reliably going to Gemma. Qwen3 used to be way behind in broad knowledge, but Qwen kept adding more and more, while Gemma kept losing a little from G2 to G3 to G4. So now they're about equal in broad knowledge. Lastly, Qwen3.5 identified the subject in a broader spectrum of images than Gemma 4.

Bear in mind that I'm not a real coder. These are just what I consider to be the best general purpose AI models when factoring in broad ability, speed, and size.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment