- Implementation and Applications of WakeWords Integrated with Speaker Recognition: A Case Study This paper explores the application of artificial intelligence techniques in audio and voice processing, focusing on the integration of wake words and speaker recognition for secure access in embedded systems. With the growing prevalence of voice-activated devices such as Amazon Alexa, ensuring secure and user-specific interactions has become paramount. Our study aims to enhance the security framework of these systems by leveraging wake words for initial activation and speaker recognition to validate user permissions. By incorporating these AI-driven methodologies, we propose a robust solution that restricts system usage to authorized individuals, thereby mitigating unauthorized access risks. This research delves into the algorithms and technologies underpinning wake word detection and speaker recognition, evaluates their effectiveness in real-world applications, and discusses the potential for their implementation in various embedded systems, emphasizing security and user convenience. The findings underscore the feasibility and advantages of employing these AI techniques to create secure, user-friendly voice-activated systems. 4 authors · Jul 24, 2024
- Implementation of the rROF denoising method in the cWB pipeline for gravitational-wave data analysis The data collected by the current network of gravitational-wave detectors are largely dominated by instrumental noise. Total variation methods based on L1-norm minimization have recently been proposed as a powerful technique for noise removal in gravitational-wave data. In particular, the regularized Rudin-Osher-Fatemi (rROF) model has proven effective to denoise signals embedded in either simulated Gaussian noise or actual detector noise. Importing the rROF model to existing search pipelines seems therefore worth considering. In this paper, we discuss the implementation of two variants of the rROF algorithm as two separate plug-ins of the coherent Wave Burst (cWB) pipeline designed to conduct searches of unmodelled gravitational-wave burst sources. The first approach is based on a single-step rROF method and the second one employs an iterative rROF procedure. Both approaches are calibrated using actual gravitational-wave events from the first three observing runs of the LIGO-Virgo-KAGRA collaboration, namely GW1501914, GW151226, GW170817, and GW190521, encompassing different types of compact binary coalescences. Our analysis shows that the iterative version of the rROF denoising algorithm implemented in the cWB pipeline effectively eliminates noise while preserving the waveform signals intact. Therefore, the combined approach yields higher signal-to-noise values than those computed by the cWB pipeline without the rROF denoising step. The incorporation of the iterative rROF algorithm in the cWB pipeline might hence impact the detectability capabilities of the pipeline along with the inference of source properties. 6 authors · Feb 21, 2022
7 A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step wall-clock time compared against standard diagonal-scaling-based adaptive gradient methods. We validate our implementation by performing an ablation study on training ImageNet ResNet50, demonstrating Shampoo's superiority over standard training recipes with minimal hyperparameter tuning. 8 authors · Sep 12, 2023
1 OpenDCVCs: A PyTorch Open Source Implementation and Performance Evaluation of the DCVC series Video Codecs We present OpenDCVCs, an open-source PyTorch implementation designed to advance reproducible research in learned video compression. OpenDCVCs provides unified and training-ready implementations of four representative Deep Contextual Video Compression (DCVC) models--DCVC, DCVC with Temporal Context Modeling (DCVC-TCM), DCVC with Hybrid Entropy Modeling (DCVC-HEM), and DCVC with Diverse Contexts (DCVC-DC). While the DCVC series achieves substantial bitrate reductions over both classical codecs and advanced learned models, previous public code releases have been limited to evaluation codes, presenting significant barriers to reproducibility, benchmarking, and further development. OpenDCVCs bridges this gap by offering a comprehensive, self-contained framework that supports both end-to-end training and evaluation for all included algorithms. The implementation includes detailed documentation, evaluation protocols, and extensive benchmarking results across diverse datasets, providing a transparent and consistent foundation for comparison and extension. All code and experimental tools are publicly available at https://gitlab.com/viper-purdue/opendcvcs, empowering the community to accelerate research and foster collaboration. 2 authors · Aug 6
1 Scattered Mixture-of-Experts Implementation We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We introduce ParallelLinear, the main component we use to build our implementation and the various kernels used to speed up the operation. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture of Attention. 4 authors · Mar 13, 2024
1 PINN surrogate of Li-ion battery models for parameter inference. Part I: Implementation and multi-fidelity hierarchies for the single-particle model To plan and optimize energy storage demands that account for Li-ion battery aging dynamics, techniques need to be developed to diagnose battery internal states accurately and rapidly. This study seeks to reduce the computational resources needed to determine a battery's internal states by replacing physics-based Li-ion battery models -- such as the single-particle model (SPM) and the pseudo-2D (P2D) model -- with a physics-informed neural network (PINN) surrogate. The surrogate model makes high-throughput techniques, such as Bayesian calibration, tractable to determine battery internal parameters from voltage responses. This manuscript is the first of a two-part series that introduces PINN surrogates of Li-ion battery models for parameter inference (i.e., state-of-health diagnostics). In this first part, a method is presented for constructing a PINN surrogate of the SPM. A multi-fidelity hierarchical training, where several neural nets are trained with multiple physics-loss fidelities is shown to significantly improve the surrogate accuracy when only training on the governing equation residuals. The implementation is made available in a companion repository (https://github.com/NREL/pinnstripes). The techniques used to develop a PINN surrogate of the SPM are extended in Part II for the PINN surrogate for the P2D battery model, and explore the Bayesian calibration capabilities of both surrogates. 9 authors · Dec 28, 2023
- Efficient Implementation of Gaussian Process Regression Accelerated Saddle Point Searches with Application to Molecular Reactions The task of locating first order saddle points on high-dimensional surfaces describing the variation of energy as a function of atomic coordinates is an essential step for identifying the mechanism and estimating the rate of thermally activated events within the harmonic approximation of transition state theory. When combined directly with electronic structure calculations, the number of energy and atomic force evaluations needed for convergence is a primary issue. Here, we describe an efficient implementation of Gaussian process regression (GPR) acceleration of the minimum mode following method where a dimer is used to estimate the lowest eigenmode of the Hessian. A surrogate energy surface is constructed and updated after each electronic structure calculation. The method is applied to a test set of 500 molecular reactions previously generated by Hermez and coworkers [J. Chem. Theory Comput. 18, 6974 (2022)]. An order of magnitude reduction in the number of electronic structure calculations needed to reach the saddle point configurations is obtained by using the GPR compared to the dimer method. Despite the wide range in stiffness of the molecular degrees of freedom, the calculations are carried out using Cartesian coordinates and are found to require similar number of electronic structure calculations as an elaborate internal coordinate method implemented in the Sella software package. The present implementation of the GPR surrogate model in C++ is efficient enough for the wall time of the saddle point searches to be reduced in 3 out of 4 cases even though the calculations are carried out at a low Hartree-Fock level. 5 authors · May 18
- An Implementation of Werewolf Agent That does not Truly Trust LLMs Werewolf is an incomplete information game, which has several challenges when creating a computer agent as a player given the lack of understanding of the situation and individuality of utterance (e.g., computer agents are not capable of characterful utterance or situational lying). We propose a werewolf agent that solves some of those difficulties by combining a Large Language Model (LLM) and a rule-based algorithm. In particular, our agent uses a rule-based algorithm to select an output either from an LLM or a template prepared beforehand based on the results of analyzing conversation history using an LLM. It allows the agent to refute in specific situations, identify when to end the conversation, and behave with persona. This approach mitigated conversational inconsistencies and facilitated logical utterance as a result. We also conducted a qualitative evaluation, which resulted in our agent being perceived as more human-like compared to an unmodified LLM. The agent is freely available for contributing to advance the research in the field of Werewolf game. 3 authors · Sep 2, 2024
- Real-time Object Detection: YOLOv1 Re-Implementation in PyTorch Real-time object detection is a crucial problem to solve when in comes to computer vision systems that needs to make appropriate decision based on detection in a timely manner. I have chosen the YOLO v1 architecture to implement it using PyTorch framework, with goal to familiarize with entire object detection pipeline I attempted different techniques to modify the original architecture to improve the results. Finally, I compare the metrics of my implementation to the original. 1 authors · May 28, 2023
45 SageAttention2++: A More Efficient Implementation of SageAttention2 The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention. 8 authors · May 27 3
20 FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development. 9 authors · Mar 9 7
6 Splatfacto-W: A Nerfstudio Implementation of Gaussian Splatting for Unconstrained Photo Collections Novel view synthesis from unconstrained in-the-wild image collections remains a significant yet challenging task due to photometric variations and transient occluders that complicate accurate scene reconstruction. Previous methods have approached these issues by integrating per-image appearance features embeddings in Neural Radiance Fields (NeRFs). Although 3D Gaussian Splatting (3DGS) offers faster training and real-time rendering, adapting it for unconstrained image collections is non-trivial due to the substantially different architecture. In this paper, we introduce Splatfacto-W, an approach that integrates per-Gaussian neural color features and per-image appearance embeddings into the rasterization process, along with a spherical harmonics-based background model to represent varying photometric appearances and better depict backgrounds. Our key contributions include latent appearance modeling, efficient transient object handling, and precise background modeling. Splatfacto-W delivers high-quality, real-time novel view synthesis with improved scene consistency in in-the-wild scenarios. Our method improves the Peak Signal-to-Noise Ratio (PSNR) by an average of 5.3 dB compared to 3DGS, enhances training speed by 150 times compared to NeRF-based methods, and achieves a similar rendering speed to 3DGS. Additional video results and code integrated into Nerfstudio are available at https://kevinxu02.github.io/splatfactow/. 3 authors · Jul 17, 2024 2
6 The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF) scaling behaviors reported in OpenAI's seminal TL;DR summarization work. We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in response quality that scale with model size, with our 2.8B, 6.9B models outperforming OpenAI's released 1.3B checkpoint. We publicly release the trained model checkpoints and code to facilitate further research and accelerate progress in the field (https://github.com/vwxyzjn/summarize_from_feedback_details). 6 authors · Mar 23, 2024
2 SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which successfully surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. SRPO achieves this using the same base model as DeepSeek (i.e. Qwen2.5-32B) and relies solely on RL, without prior Supervised Fine-Tuning (SFT). Building upon Group Relative Policy Optimization (GRPO), we introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples. Our comprehensive experiments validate the effectiveness of our approach, dedicating to offer valuable insights into scaling LLM reasoning capabilities across diverse tasks. 17 authors · Apr 19
1 GWKokab: An Implementation to Identify the Properties of Multiple Population of Gravitational Wave Sources The rapidly increasing sensitivity of gravitational wave detectors is enabling the detection of a growing number of compact binary mergers. These events are crucial for understanding the population properties of compact binaries. However, many previous studies rely on computationally expensive inference frameworks, limiting their scalability. In this work, we present GWKokab, a JAX-based framework that enables modular model building with independent rate for each subpopulation such as BBH, BNS, and NSBH binaries. It provides accelerated inference using the normalizing flow based sampler called flowMC and is also compatible with NumPyro samplers. To validate our framework, we generated two synthetic populations, one comprising spinning eccentric binaries and the other circular binaries using a multi-source model. We then recovered their injected parameters at significantly reduced computational cost and demonstrated that eccentricity distribution can be recovered even in spinning eccentric populations. We also reproduced results from two prior studies: one on non-spinning eccentric populations, and another on the BBH mass distribution using the third Gravitational Wave Transient Catalog (GWTC-3). We anticipate that GWKokab will not only reduce computational costs but also enable more detailed subpopulation analyses such as their mass, spin, eccentricity, and redshift distributions in gravitational wave events, offering deeper insights into compact binary formation and evolution. 3 authors · Sep 16
1 Nerva: a Truly Sparse Implementation of Neural Networks We introduce Nerva, a fast neural network library under development in C++. It supports sparsity by using the sparse matrix operations of Intel's Math Kernel Library (MKL), which eliminates the need for binary masks. We show that Nerva significantly decreases training time and memory usage while reaching equivalent accuracy to PyTorch. We run static sparse experiments with an MLP on CIFAR-10. On high sparsity levels like 99%, the runtime is reduced by a factor of 4times compared to a PyTorch model using masks. Similar to other popular frameworks such as PyTorch and Keras, Nerva offers a Python interface for users to work with. 5 authors · Jul 24, 2024
1 A Review and Efficient Implementation of Scene Graph Generation Metrics Scene graph generation has emerged as a prominent research field in computer vision, witnessing significant advancements in the recent years. However, despite these strides, precise and thorough definitions for the metrics used to evaluate scene graph generation models are lacking. In this paper, we address this gap in the literature by providing a review and precise definition of commonly used metrics in scene graph generation. Our comprehensive examination clarifies the underlying principles of these metrics and can serve as a reference or introduction to scene graph metrics. Furthermore, to facilitate the usage of these metrics, we introduce a standalone Python package called SGBench that efficiently implements all defined metrics, ensuring their accessibility to the research community. Additionally, we present a scene graph benchmarking web service, that enables researchers to compare scene graph generation methods and increase visibility of new methods in a central place. All of our code can be found at https://lorjul.github.io/sgbench/. 4 authors · Apr 15, 2024
1 Design and implementation of intelligent packet filtering in IoT microcontroller-based devices Internet of Things (IoT) devices are increasingly pervasive and essential components in enabling new applications and services. However, their widespread use also exposes them to exploitable vulnerabilities and flaws that can lead to significant losses. In this context, ensuring robust cybersecurity measures is essential to protect IoT devices from malicious attacks. However, the current solutions that provide flexible policy specifications and higher security levels for IoT devices are scarce. To address this gap, we introduce T800, a low-resource packet filter that utilizes machine learning (ML) algorithms to classify packets in IoT devices. We present a detailed performance benchmarking framework and demonstrate T800's effectiveness on the ESP32 system-on-chip microcontroller and ESP-IDF framework. Our evaluation shows that T800 is an efficient solution that increases device computational capacity by excluding unsolicited malicious traffic from the processing pipeline. Additionally, T800 is adaptable to different systems and provides a well-documented performance evaluation strategy for security ML-based mechanisms on ESP32-based IoT systems. Our research contributes to improving the cybersecurity of resource-constrained IoT devices and provides a scalable, efficient solution that can be used to enhance the security of IoT systems. 6 authors · May 30, 2023
- Advancing Multi-Agent Systems Through Model Context Protocol: Architecture, Implementation, and Applications Multi-agent systems represent a significant advancement in artificial intelligence, enabling complex problem-solving through coordinated specialized agents. However, these systems face fundamental challenges in context management, coordination efficiency, and scalable operation. This paper introduces a comprehensive framework for advancing multi-agent systems through Model Context Protocol (MCP), addressing these challenges through standardized context sharing and coordination mechanisms. We extend previous work on AI agent architectures by developing a unified theoretical foundation, advanced context management techniques, and scalable coordination patterns. Through detailed implementation case studies across enterprise knowledge management, collaborative research, and distributed problem-solving domains, we demonstrate significant performance improvements compared to traditional approaches. Our evaluation methodology provides a systematic assessment framework with benchmark tasks and datasets specifically designed for multi-agent systems. We identify current limitations, emerging research opportunities, and potential transformative applications across industries. This work contributes to the evolution of more capable, collaborative, and context-aware artificial intelligence systems that can effectively address complex real-world challenges. 1 authors · Apr 25
- CascadeV: An Implementation of Wurstchen Architecture for Video Generation Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4times increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV. 6 authors · Jan 27
- Navigating Knowledge Management Implementation Success in Government Organizations: A type-2 fuzzy approach Optimal information and knowledge management is crucial for organizations to achieve their objectives efficiently. As a rare and valuable resource, effective knowledge management provides a strategic advantage and has become a key determinant of organizational success. The study aims to identify critical success and failure factors for implementing knowledge management systems in government organizations. This research employs a descriptive survey methodology, collecting data through random interviews and questionnaires. The study highlights the critical success factors for knowledge management systems in government organizations, including cooperation, an open atmosphere, staff training, creativity and innovation, removal of organizational constraints, reward policies, role modeling, and focus. Conversely, failure to consider formality, staff participation, collaboration technologies, network and hardware infrastructure, complexity, IT staff, and trust can pose significant obstacles to successful implementation. 6 authors · Jun 18, 2024
- The Design and Implementation of XiaoIce, an Empathetic Social Chatbot This paper describes the development of Microsoft XiaoIce, the most popular social chatbot in the world. XiaoIce is uniquely designed as an AI companion with an emotional connection to satisfy the human need for communication, affection, and social belonging. We take into account both intelligent quotient (IQ) and emotional quotient (EQ) in system design, cast human-machine social chat as decision-making over Markov Decision Processes (MDPs), and optimize XiaoIce for long-term user engagement, measured in expected Conversation-turns Per Session (CPS). We detail the system architecture and key components including dialogue manager, core chat, skills, and an empathetic computing module. We show how XiaoIce dynamically recognizes human feelings and states, understands user intent, and responds to user needs throughout long conversations. Since her launch in 2014, XiaoIce has communicated with over 660 million active users and succeeded in establishing long-term relationships with many of them. Analysis of large scale online logs shows that XiaoIce has achieved an average CPS of 23, which is significantly higher than that of other chatbots and even human conversations. 4 authors · Dec 21, 2018
2 CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms CleanRL is an open-source library that provides high-quality single-file implementations of Deep Reinforcement Learning algorithms. It provides a simpler yet scalable developing experience by having a straightforward codebase and integrating production tools to help interact and scale experiments. In CleanRL, we put all details of an algorithm into a single file, making these performance-relevant details easier to recognize. Additionally, an experiment tracking feature is available to help log metrics, hyperparameters, videos of an agent's gameplay, dependencies, and more to the cloud. Despite succinct implementations, we have also designed tools to help scale, at one point orchestrating experiments on more than 2000 machines simultaneously via Docker and cloud providers. Finally, we have ensured the quality of the implementations by benchmarking against a variety of environments. The source code of CleanRL can be found at https://github.com/vwxyzjn/cleanrl 4 authors · Nov 16, 2021
36 Large Action Models: From Inception to Implementation As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence. In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications. The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/. 18 authors · Dec 13, 2024 5
7 TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using torch.unique, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning. Reactive AI · Oct 6 2
2 Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism The Mixture of Experts (MoE) model becomes an important choice of large language models nowadays because of its scalability with sublinear computational complexity for training and inference. However, existing MoE models suffer from two critical drawbacks, 1) tremendous inner-node and inter-node communication overhead introduced by all-to-all dispatching and gathering, and 2) limited scalability for the backbone because of the bound data parallel and expert parallel to scale in the expert dimension. In this paper, we systematically analyze these drawbacks in terms of training efficiency in the parallel framework view and propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them. PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering with a simple tensor index slicing and inner-node all-reduce. Besides, it is convenient for PPMoE to integrate pipeline parallel to further scale the backbone due to its flexible parallel architecture. Extensive experiments show that PPMoE not only achieves a more than 1.75times speed up compared to existing MoE architectures but also reaches 90% throughput of its corresponding backbone model that is 20times smaller. 6 authors · Apr 22, 2023
- FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as "vibe coding," where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research. 3 authors · Sep 26
- Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel Recent progress in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), a state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance gains while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA relies on a query-grouping strategy that is efficient only with large Grouped Query Attention (GQA) sizes, whereas modern LLMs typically adopt much smaller GQA groups, which limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), which includes an alternative kernel design that enables efficient NSA computation across a wide range of popular LLMs with varied smaller GQA group sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5times and on average 1.6times kernel-level latency reduction, (ii) up to 1.25times and 1.09times on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36times and 1.11times on average end-to-end prefill speedup on state-of-the-art LLMs. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention. 3 authors · Aug 25
- iEDA: An Open-Source Intelligent Physical Implementation Toolkit and Library Open-source EDA shows promising potential in unleashing EDA innovation and lowering the cost of chip design. This paper presents an open-source EDA project, iEDA, aiming for building a basic infrastructure for EDA technology evolution and closing the industrial-academic gap in the EDA area. iEDA now covers the whole flow of physical design (including Floorplan, Placement, CTS, Routing, Timing Optimization etc.), and part of the analysis tools (Static Timing Analysis and Power Analysis). To demonstrate the effectiveness of iEDA, we implement and tape out three chips of different scales (from 700k to 1.5M gates) on different process nodes (110nm and 28nm) with iEDA. iEDA is publicly available from the project home page http://ieda.oscc.cc. 56 authors · Aug 3, 2023
- Transfer of Representations to Video Label Propagation: Implementation Factors Matter This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency. In the literature, these methods have been evaluated with an array of inconsistent settings, making it difficult to discern trends or compare performance fairly. Starting with a unified formulation of the label propagation algorithm that encompasses most existing variations, we systematically study the impact of important implementation factors in feature extraction and label propagation. Along the way, we report the accuracies of properly tuned supervised and unsupervised still image baselines, which are higher than those found in previous works. We also demonstrate that augmenting video-based correspondence cues with still-image-based ones can further improve performance. We then attempt a fair comparison of recent video-based methods on the DAVIS benchmark, showing convergence of best methods to performance levels near our strong ImageNet baseline, despite the usage of a variety of specialized video-based losses and training particulars. Additional comparisons on JHMDB and VIP datasets confirm the similar performance of current methods. We hope that this study will help to improve evaluation practices and better inform future research directions in temporal correspondence. 6 authors · Mar 10, 2022
- Do Transformer Modifications Transfer Across Implementations and Applications? The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results. 16 authors · Feb 23, 2021
- Learning computationally efficient dictionaries and their implementation as fast transforms Dictionary learning is a branch of signal processing and machine learning that aims at finding a frame (called dictionary) in which some training data admits a sparse representation. The sparser the representation, the better the dictionary. The resulting dictionary is in general a dense matrix, and its manipulation can be computationally costly both at the learning stage and later in the usage of this dictionary, for tasks such as sparse coding. Dictionary learning is thus limited to relatively small-scale problems. In this paper, inspired by usual fast transforms, we consider a general dictionary structure that allows cheaper manipulation, and propose an algorithm to learn such dictionaries --and their fast implementation-- over training data. The approach is demonstrated experimentally with the factorization of the Hadamard matrix and with synthetic dictionary learning experiments. 2 authors · Jun 20, 2014
- Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope. 38 authors · Mar 9, 2024
- Practical randomness amplification and privatisation with implementations on quantum computers We present an end-to-end and practical randomness amplification and privatisation protocol based on Bell tests. This allows the building of device-independent random number generators which output (near-)perfectly unbiased and private numbers, even if using an uncharacterised quantum device potentially built by an adversary. Our generation rates are linear in the repetition rate of the quantum device and the classical randomness post-processing has quasi-linear complexity - making it efficient on a standard personal laptop. The statistical analysis is also tailored for real-world quantum devices. Our protocol is then showcased on several different quantum computers. Although not purposely built for the task, we show that quantum computers can run faithful Bell tests by adding minimal assumptions. In this semi-device-independent manner, our protocol generates (near-)perfectly unbiased and private random numbers on today's quantum computers. 5 authors · Sep 14, 2020
16 Neural Circuit Diagrams: Robust Diagrams for the Communication, Implementation, and Analysis of Deep Learning Architectures Diagrams matter. Unfortunately, the deep learning community has no standard method for diagramming architectures. The current combination of linear algebra notation and ad-hoc diagrams fails to offer the necessary precision to understand architectures in all their detail. However, this detail is critical for faithful implementation, mathematical analysis, further innovation, and ethical assurances. I present neural circuit diagrams, a graphical language tailored to the needs of communicating deep learning architectures. Neural circuit diagrams naturally keep track of the changing arrangement of data, precisely show how operations are broadcast over axes, and display the critical parallel behavior of linear operations. A lingering issue with existing diagramming methods is the inability to simultaneously express the detail of axes and the free arrangement of data, which neural circuit diagrams solve. Their compositional structure is analogous to code, creating a close correspondence between diagrams and implementation. In this work, I introduce neural circuit diagrams for an audience of machine learning researchers. After introducing neural circuit diagrams, I cover a host of architectures to show their utility and breed familiarity. This includes the transformer architecture, convolution (and its difficult-to-explain extensions), residual networks, the U-Net, and the vision transformer. I include a Jupyter notebook that provides evidence for the close correspondence between diagrams and code. Finally, I examine backpropagation using neural circuit diagrams. I show their utility in providing mathematical insight and analyzing algorithms' time and space complexities. 1 authors · Feb 8, 2024 1
2 Bridging Theory and Practice in Quantum Game Theory: Optimized Implementation of the Battle of the Sexes with Error Mitigation on NISQ Hardware Implementing quantum game theory on real hardware is challenging due to noise, decoherence, and limited qubit connectivity, yet such demonstrations are essential to validate theoretical predictions. We present one of the first full experimental realizations of the Battle of the Sexes game under the Eisert-Wilkens-Lewenstein (EWL) framework on IBM Quantum's ibm sherbrooke superconducting processor. Four quantum strategies (I, H, R(pi/4), R(pi)) were evaluated across 31 entanglement values gamma in [0, pi] using 2048 shots per configuration, enabling a direct comparison between analytical predictions and hardware execution. To mitigate noise and variability, we introduce a Guided Circuit Mapping (GCM) method that dynamically selects qubit pairs and optimizes routing based on real-time topology and calibration data. The analytical model forecasts up to 108% payoff improvement over the classical equilibrium, and despite hardware-induced deviations, experimental results with GCM preserve the expected payoff trends within 3.5%-12% relative error. These findings show that quantum advantages in strategic coordination can persist under realistic NISQ conditions, providing a pathway toward practical applications of quantum game theory in multi-agent, economic, and distributed decision-making systems. 5 authors · Aug 12 2
- BreastDCEDL: A Comprehensive Breast Cancer DCE-MRI Dataset and Transformer Implementation for Treatment Response Prediction Breast cancer remains a leading cause of cancer-related mortality worldwide, making early detection and accurate treatment response monitoring critical priorities. We present BreastDCEDL, a curated, deep learning-ready dataset comprising pre-treatment 3D Dynamic Contrast-Enhanced MRI (DCE-MRI) scans from 2,070 breast cancer patients drawn from the I-SPY1, I-SPY2, and Duke cohorts, all sourced from The Cancer Imaging Archive. The raw DICOM imaging data were rigorously converted into standardized 3D NIfTI volumes with preserved signal integrity, accompanied by unified tumor annotations and harmonized clinical metadata including pathologic complete response (pCR), hormone receptor (HR), and HER2 status. Although DCE-MRI provides essential diagnostic information and deep learning offers tremendous potential for analyzing such complex data, progress has been limited by lack of accessible, public, multicenter datasets. BreastDCEDL addresses this gap by enabling development of advanced models, including state-of-the-art transformer architectures that require substantial training data. To demonstrate its capacity for robust modeling, we developed the first transformer-based model for breast DCE-MRI, leveraging Vision Transformer (ViT) architecture trained on RGB-fused images from three contrast phases (pre-contrast, early post-contrast, and late post-contrast). Our ViT model achieved state-of-the-art pCR prediction performance in HR+/HER2- patients (AUC 0.94, accuracy 0.93). BreastDCEDL includes predefined benchmark splits, offering a framework for reproducible research and enabling clinically meaningful modeling in breast cancer imaging. 5 authors · Jun 13
- A Tale of Two Sides of Wafer: Physical Implementation and Block-Level PPA on Flip FET with Dual-sided Signals As the conventional scaling of logic devices comes to an end, functional wafer backside and 3D transistor stacking are consensus for next-generation logic technology, offering considerable design space extension for powers, signals or even devices on the wafer backside. The Flip FET (FFET), a novel transistor architecture combining 3D transistor stacking and fully functional wafer backside, was recently proposed. With symmetric dual-sided standard cell design, the FFET can deliver around 12.5% cell area scaling and faster but more energy-efficient libraries beyond other stacked transistor technologies such as CFET. Besides, thanks to the novel cell design with dual-sided pins, the FFET supports dual-sided signal routing, delivering better routability and larger backside design space. In this work, we demonstrated a comprehensive FFET evaluation framework considering physical implementation and block-level power-performance-area (PPA) assessment for the first time, in which key functions are dual-sided routing and dual-sided RC extraction. A 32-bit RISC-V core was used for the evaluation here. Compared to the CFET with single-sided signals, the FFET with single-sided signals achieved 23.3% post-P&R core area reduction, 25.0% higher frequency and 11.9% lower power at the same utilization, and 16.0 % higher frequency at the same core area. Meanwhile, the FFET supports dual-sided signals, which can further benefit more from flexible allocation of cell input pins on both sides. By optimizing the input pin density and BEOL routing layer number on each side, 10.6% frequency gain was realized without power degradation compared to the one with single-sided signal routing. Moreover, the routability and power efficiency of FFET barely degrades even with the routing layer number reduced from 12 to 5 on each side, validating the great space for cost-friendly design enabled by FFET. 10 authors · Jan 25
- Review of deep learning models for crypto price prediction: implementation and evaluation There has been much interest in accurate cryptocurrency price forecast models by investors and researchers. Deep Learning models are prominent machine learning techniques that have transformed various fields and have shown potential for finance and economics. Although various deep learning models have been explored for cryptocurrency price forecasting, it is not clear which models are suitable due to high market volatility. In this study, we review the literature about deep learning for cryptocurrency price forecasting and evaluate novel deep learning models for cryptocurrency stock price prediction. Our deep learning models include variants of long short-term memory (LSTM) recurrent neural networks, variants of convolutional neural networks (CNNs), and the Transformer model. We evaluate univariate and multivariate approaches for multi-step ahead predicting of cryptocurrencies close-price. We also carry out volatility analysis on the four cryptocurrencies which reveals significant fluctuations in their prices throughout the COVID-19 pandemic. Additionally, we investigate the prediction accuracy of two scenarios identified by different training sets for the models. First, we use the pre-COVID-19 datasets to model cryptocurrency close-price forecasting during the early period of COVID-19. Secondly, we utilise data from the COVID-19 period to predict prices for 2023 to 2024. Our results show that the convolutional LSTM with a multivariate approach provides the best prediction accuracy in two major experimental settings. Our results also indicate that the multivariate deep learning models exhibit better performance in forecasting four different cryptocurrencies when compared to the univariate models. 5 authors · May 18, 2024
- Perpetual Observational Studies: New strategies to support efficient implementation of observational studies and randomized trials in the infectious diseases arena The increasing threat of emerging infectious diseases and antimicrobial resistance requires more efficient, high-quality research. Perpetual Observational Studies (POS) nested within a clinical research network can improve planning, quality and efficiency of interventional and observational studies, although real-life benefits and challenges need to be assessed. Ecraid (European Clinical Research Alliance on Infectious Diseases) has initiated POS and will monitor the impact for five specific infectious syndromes. 20 authors · Aug 8, 2022
- Enhancing Multilingual Information Retrieval in Mixed Human Resources Environments: A RAG Model Implementation for Multicultural Enterprise The advent of Large Language Models has revolutionized information retrieval, ushering in a new era of expansive knowledge accessibility. While these models excel in providing open-world knowledge, effectively extracting answers in diverse linguistic environments with varying levels of literacy remains a formidable challenge. Retrieval Augmented Generation (RAG) emerges as a promising solution, bridging the gap between information availability and multilingual comprehension. However, deploying RAG models in real-world scenarios demands careful consideration of various factors. This paper addresses the critical challenges associated with implementing RAG models in multicultural environments. We delve into essential considerations, including data feeding strategies, timely updates, mitigation of hallucinations, prevention of erroneous responses, and optimization of delivery speed. Our work involves the integration of a diverse array of tools, meticulously combined to facilitate the seamless adoption of RAG models across languages and literacy levels within a multicultural organizational context. Through strategic tweaks in our approaches, we achieve not only effectiveness but also efficiency, ensuring the accelerated and accurate delivery of information in a manner that is tailored to the unique requirements of multilingual and multicultural settings. 1 authors · Jan 2, 2024
- A Novel Approach to Balance Convenience and Nutrition in Meals With Long-Term Group Recommendations and Reasoning on Multimodal Recipes and its Implementation in BEACON "A common decision made by people, whether healthy or with health conditions, is choosing meals like breakfast, lunch, and dinner, comprising combinations of foods for appetizer, main course, side dishes, desserts, and beverages. Often, this decision involves tradeoffs between nutritious choices (e.g., salt and sugar levels, nutrition content) and convenience (e.g., cost and accessibility, cuisine type, food source type). We present a data-driven solution for meal recommendations that considers customizable meal configurations and time horizons. This solution balances user preferences while accounting for food constituents and cooking processes. Our contributions include introducing goodness measures, a recipe conversion method from text to the recently introduced multimodal rich recipe representation (R3) format, learning methods using contextual bandits that show promising preliminary results, and the prototype, usage-inspired, BEACON system." 7 authors · Dec 23, 2024
- AI Foundation Model for Heliophysics: Applications, Design, and Implementation Deep learning-based methods have been widely researched in the areas of language and vision, demonstrating their capacity to understand long sequences of data and their usefulness in numerous helio-physics applications. Foundation models (FMs), which are pre-trained on a large-scale datasets, form the basis for a variety of downstream tasks. These models, especially those based on transformers in vision and language, show exceptional potential for adapting to a wide range of downstream applications. In this paper, we provide our perspective on the criteria for designing an FM for heliophysics and associated challenges and applications using the Solar Dynamics Observatory (SDO) dataset. We believe that this is the first study to design an FM in the domain of heliophysics. 17 authors · Sep 30, 2024
- FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations Full-duplex voice interaction allows users and agents to speak simultaneously with controllable barge-in, enabling lifelike assistants and customer service. Existing solutions are either end-to-end, difficult to design and hard to control, or modular pipelines governed by turn-taking controllers that ease upgrades and per-module optimization; however, prior modular frameworks depend on non-open components and external providers, limiting holistic optimization. In this work, we present a complete, practical full-duplex voice interaction system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates streaming personalized VAD (pVAD) to suppress false barge-ins from noise and non-primary speakers, precisely timestamp primary-speaker segments, and explicitly enable primary-speaker barge-ins; a semantic end-of-turn detector improves stop decisions. It upgrades heterogeneous half-duplex pipelines, cascaded, semi-cascaded, and speech-to-speech, to full duplex. Using internal models, we implement cascaded and semi-cascaded variants; the semi-cascaded one captures emotional and paralinguistic cues, yields more coherent responses, lowers latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We also propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency. Experiments show fewer false interruptions, more accurate semantic ends, and lower latency approaching industrial systems, enabling robust, natural, real-time full-duplex interaction. Demos: https://fireredteam.github.io/demos/firered_chat. 15 authors · Sep 8
- The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax. 1 authors · Jan 24, 2024