src.model.concept_history module

This module defines logic for initializing and updating the concept history.

src.model.concept_history.get_initial_concepts(n_best_concepts: int, n_random_concepts: int, model_layer_activations_path: str, neuron_id: int) Sequence[str][source]

Get initial concepts combining best and random selections.

Returns a sequence of initial concepts consisting of the top n_best_concepts that most strongly activate the neuron, plus n_random_concepts randomly selected from all available control concepts.

Parameters:
  • n_best_concepts – Number of top-activating concepts to include.

  • n_random_concepts – Number of random concepts to include.

  • model_layer_activations_path – Path to directory containing activation files.

  • neuron_id – Index of the neuron to select concepts for.

Returns:

Sequence of initial concept names combining best and random selections.

src.model.concept_history.update_concept_history(concept_history: Mapping[str, float], new_concept: str, score: float) Mapping[str, float][source]

Update concept history by replacing the worst or randomly selected concept.

If the new concept’s score is better than the worst existing concept, replaces the worst one. Otherwise, randomly removes a concept with probability weighted by its distance from the max score.

Parameters:
  • concept_history – Dictionary mapping concept names to their scores.

  • new_concept – Name of the new concept to add.

  • score – Score of the new concept.

Returns:

Updated concept history dictionary with the new concept added.