Upload feature extractor

Browse files

Files changed (3) hide show

README.md +199 -0
feature_extraction_gramt_binaural_time.py +145 -0
preprocessor_config.json +12 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

feature_extraction_gramt_binaural_time.py ADDED Viewed

	@@ -0,0 +1,145 @@

+from typing import Optional, Union
+import numpy as np
+from transformers import SequenceFeatureExtractor
+from transformers import BatchFeature
+from transformers.utils import TensorType
+import torch
+import torchaudio
+class BinauralFeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs a Audio Spectrogram Transformer (AST) feature extractor.
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+    This class extracts mel-filter bank features from raw speech using TorchAudio if installed or using numpy
+    otherwise, pads/truncates them to a fixed length and normalizes them using a mean and standard deviation.
+    Args:
+        feature_size (`int`, *optional*, defaults to 1):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`, *optional*, defaults to 16000):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
+        num_mel_bins (`int`, *optional*, defaults to 128):
+            Number of Mel-frequency bins.
+        max_length (`int`, *optional*, defaults to 1024):
+            Maximum length to which to pad/truncate the extracted features
+    """
+    in_channels = 2
+    feature_extractor_type = "gram-binaural"
+    def __init__(
+        self,
+        feature_size=1,
+        sampling_rate=32000,
+        num_mel_bins=128,
+        padding_value=0.0,
+        **kwargs,
+    ):
+        super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
+        self.num_mel_bins = num_mel_bins
+    def _extract_fbank_features(
+        self,
+        waveform: np.ndarray,
+    ) -> np.ndarray:
+        """
+        Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
+        and hence the waveform should not be normalized before feature extraction.
+        """
+        melspec = torchaudio.transforms.MelSpectrogram(
+                sample_rate=self.sampling_rate,
+                n_fft=1024,
+                win_length=1024,
+                hop_length=320,
+                f_min=50,
+                f_max=self.sampling_rate // 2,
+                n_mels=self.num_mel_bins,
+                power=2.0,
+            )
+        waveform = torch.tensor(waveform.clone().detach())
+        waveform = self._normalize_audio(waveform)
+        # If waveform has two channels, but the channel information is not the first dimension, transpose.
+        if (waveform.ndim == 2) and (waveform.shape[0] > 100):
+            waveform = waveform.transpose(1, 0)
+        if waveform.ndim == 1:
+            waveform = waveform.unsqueeze(0)
+        # Handle stereo/mono channels consistently
+        if waveform.shape[0] == 1:
+            mel = melspec(waveform).transpose(2, 1)
+            log_mel = (mel + torch.finfo().eps).log()
+            log_mel = torch.cat((log_mel, log_mel), dim=0)
+            return log_mel
+        elif waveform.shape[0] == 2:
+            mel = melspec(waveform).transpose(2, 1)
+            log_mel = (mel + torch.finfo().eps).log()
+            return log_mel
+        elif waveform.shape[0] == 4:
+            mel = melspec(waveform[[0]]).transpose(2, 1)
+            log_mel = (mel + torch.finfo().eps).log()
+            log_mel = torch.cat((log_mel, log_mel), dim=0)
+            return log_mel
+        else:
+            raise Exception("Unknowm channel count")
+    def _normalize_audio(self, audio_data, target_dBFS=-14.0):
+        rms = torch.sqrt(torch.mean(audio_data**2))  # Calculate the RMS of the audio
+        if rms == 0:  # Avoid division by zero in case of a completely silent audio
+            return audio_data
+        current_dBFS = 20 * torch.log10(rms)  # Convert RMS to dBFS
+        gain_dB = target_dBFS - current_dBFS  # Calculate the required gain in dB
+        gain_linear = 10 ** (gain_dB / 20)  # Convert gain from dB to linear scale
+        normalized_audio = audio_data * gain_linear  # Apply the gain to the audio data
+        return normalized_audio
+    def __call__(
+        self,
+        raw_speech: Union[np.ndarray, list[float], list[np.ndarray], list[list[float]]],
+        sampling_rate: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+        Args:
+            raw_speech (`np.ndarray`, `list[float]`, `list[np.ndarray]`, `list[list[float]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        """
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
+                    f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
+                    f" {self.sampling_rate} and not {sampling_rate}."
+                )
+        # extract fbank features and pad/truncate to max_length
+        features = [self._extract_fbank_features(waveform) for waveform in raw_speech]
+        features = torch.nn.utils.rnn.pad_sequence(features, batch_first=True)
+        inputs = BatchFeature({"input_values": features})
+        return inputs
+__all__ = ["ASTFeatureExtractor"]

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "auto_map": {
+    "AutoFeatureExtractor": "feature_extraction_gramt_binaural_time.BinauralFeatureExtractor"
+  },
+  "feature_extractor_type": "BinauralFeatureExtractor",
+  "feature_size": 1,
+  "num_mel_bins": 128,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": true,
+  "sampling_rate": 32000
+}