AI & ML interests

The AI community building the future.

Recent Activity

Articles

evalstate 
updated a bucket about 3 hours ago
alvarobartt 
posted an update 6 days ago
view post
Post
236
Open agents on AWS SageMaker AI with open models from the Hugging Face Hub!

> Deploy an open model from the Hugging Face Hub on SageMaker AI
> Connect the deployed model to Strands Agents
> Add built-in and custom tools for tool calling
> Expose external capabilities through MCP integration
> Bonus: talk to your agent and visualize traces with Gradio

https://alvarobartt.com/agents-on-aws-sagemaker
danieldk 
posted an update 8 days ago
view post
Post
162
Two large changes in kernel-builder this week:

kernel-builder now links libstdc++ dynamically. To support a wide range of systems, we build against libstdc++ from manylinux_2_28 (EL 8 and later).

Following our Torch support policy that the current and previous Torch versions are supported, Torch 2.10 support was removed. We will soon also support the Torch stable ABI, so that it is possible to write kernels that support a large number of Torch versions.
alvarobartt 
posted an update 10 days ago
view post
Post
3248
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
🏗️ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚡ Active params isn't the same as memory footprint, especially for sparse architectures
📦 Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
📚 KV cache can still dominate depending on context length, batch size, and concurrency
🔀 Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
🚀 Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem
erikkaum 
posted an update 10 days ago
view post
Post
3106
Releasing my first kernel 🔥 MaxSim

Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.

The result is 3–5× speedup compared to naive PyTorch baseline 🔥

Benchmarks:
- SmallRerank (B=32, C=10): up to 3.2× (M3 Pro) / 2.8× (A100)
- HeavyRerank (B=32, C=100): up to 3.8× (M3 Pro) / 5.3× (A100)
- LongDocStress (Ld=1024): up to 6.2× (L4)

Try it out 👇
https://huggingface.co/kernels/erikkaum/maxsim
evalstate 
posted an update 12 days ago
view post
Post
273
Hugging Face MCP Server v0.3.13
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The hf_jobs tool allows mounting of volumes with hf:// URI's and some additional notes about use for data analysis.
evalstate 
posted an update 22 days ago
view post
Post
2197
Hugging Face MCP Server v0.3.12
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The hub_repo_details tool now enables Dataset inspection (view splits, sample rows).
evalstate 
posted an update 29 days ago
view post
Post
295
Hugging Face MCP Server v0.3.10
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reverted mcp bucket in favour of upcoming MCP App integration.
evalstate 
posted an update about 1 month ago
view post
Post
973
Hugging Face MCP Server v0.3.9
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Users with a bucket named mcp will get an additional list_files tool that returns the public URL of contained files. This is primarily intended for use with Gradio Spaces that need URLs as inputs.