llvm-project
[BOLT][AArch64] Introduce SPE mode in BasicAggregation
#120741
Open

[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741

paschalis-mpeis wants to merge 4 commits into main from users/paschalis-mpeis/bolt-spe-mode
paschalis-mpeis
paschalis-mpeis149 days ago (edited 117 days ago)

BOLT gains the ability to process Arm SPE data using the BasicAggregation format.

Example usage is:

perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY

New branch data and compatibility:

perf since Linux 6.13 reports for SPE branch pairs (PCTGT) where:

  • PC:
    • it is the source branch; may be taken or not-taken.
    • Due to the nature of how SPE operates and what it can collect, any filtering on the
      PC (i.e., to consider only the taken branches) would result in a data loss that
      BOLT cannot later infer.
  • TGT:
    • it is the target address of the destination block.
    • this is the new information that perf can now report.

DataAggregator processes this information by creating two basic samples.
Any other event types will have ADDR field set to 0x0. For those a single sample
will be created.
Such events can be either SPE or non-SPE, like l1d-access and cycles respectively.

The format of the input perf entries is:

PID   EVENT-TYPE   ADDR   IP

When on SPE mode and:

  • host is not AArch64, BOLT will exit with a relevant message
  • ADDR field is unavailable, BOLT will exit with a relevant message
  • no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:

Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.
In the future we might restrict processing to just the branch packets.

Capture only SPE branch data events:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

Using more filters, some jitter, and specify count to control overheads/quality:

perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY

Capture any SPE events:

perf record -e 'arm_spe_0//u' -- BINARY

Capture any SPE events and cycles

perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY
paschalis-mpeis paschalis-mpeis requested a review from aaupov aaupov 149 days ago
paschalis-mpeis paschalis-mpeis requested a review from maksfb maksfb 149 days ago
paschalis-mpeis paschalis-mpeis requested a review from rafaelauler rafaelauler 149 days ago
paschalis-mpeis paschalis-mpeis requested a review from ayermolo ayermolo 149 days ago
paschalis-mpeis paschalis-mpeis requested a review from dcci dcci 149 days ago
paschalis-mpeis paschalis-mpeis requested a review from yota9 yota9 149 days ago
llvmbot llvmbot added BOLT
llvmbot
llvmbot149 days ago

@llvm/pr-subscribers-bolt

Author: Paschalis Mpeis (paschalis-mpeis)

Changes

BOLT gains the ability to process branch target information generated by
Arm SPE data, using the BasicAggregation format.

Example usage is:

perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY

New branch data and compatibility:

SPE branch entries in perf data contain a branch pair (IP -> ADDR)
for the source and destination branches. DataAggregator processes those
by creating two basic samples. Any other event types will have ADDR
field set to 0x0. For those a single sample will be created. Such
events can be either SPE or non-SPE, like l1d-access and cycles
respectively.

The format of the input perf entries is:

PID   EVENT-TYPE   ADDR   IP

When on SPE mode and:

  • host is not AArch64, BOLT will exit with a relevant message
  • ADDR field is unavailable, BOLT will exit with a relevant message
  • no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:

Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.

Capture only SPE branch data events:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

Capture any SPE events:

perf record -e 'arm_spe_0//u' -- BINARY

Capture any SPE events and cycles

perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY

More filters, jitter, and specify count to control overheads/quality.

perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY

Full diff: https://github.com/llvm/llvm-project/pull/120741.diff

7 Files Affected:

  • (modified) bolt/include/bolt/Profile/DataAggregator.h (+14)
  • (modified) bolt/lib/Profile/DataAggregator.cpp (+132-11)
  • (added) bolt/test/perf2bolt/AArch64/perf2bolt-spe.test (+14)
  • (added) bolt/test/perf2bolt/X86/perf2bolt-spe.test (+9)
  • (modified) bolt/tools/driver/llvm-bolt.cpp (+9)
  • (modified) bolt/unittests/Profile/CMakeLists.txt (+14)
  • (added) bolt/unittests/Profile/PerfSpeEvents.cpp (+173)
diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 320623cfa15af1..be6e0fbd6347a0 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -78,6 +78,8 @@ class DataAggregator : public DataReader {
   static bool checkPerfDataMagic(StringRef FileName);
 
 private:
+  friend struct PerfSpeEventsTestHelper;
+
   struct PerfBranchSample {
     SmallVector<LBREntry, 32> LBR;
     uint64_t PC;
@@ -294,6 +296,15 @@ class DataAggregator : public DataReader {
   /// and a PC
   ErrorOr<PerfBasicSample> parseBasicSample();
 
+  /// Parse an Arm SPE entry into the non-lbr format by generating two basic
+  /// samples. The format of an input SPE entry is:
+  /// ```
+  /// PID   EVENT-TYPE   ADDR   IP
+  /// ```
+  /// SPE branch events will have 'ADDR' set to a branch target address while
+  /// other perf or SPE events will have it set to zero.
+  ErrorOr<std::pair<PerfBasicSample,PerfBasicSample>> parseSpeAsBasicSamples();
+
   /// Parse a single perf sample containing a PID associated with an IP and
   /// address.
   ErrorOr<PerfMemSample> parseMemSample();
@@ -343,6 +354,9 @@ class DataAggregator : public DataReader {
   /// Process non-LBR events.
   void processBasicEvents();
 
+  /// Parse Arm SPE events into the non-LBR format.
+  std::error_code parseSpeAsBasicEvents();
+
   /// Parse the full output generated by perf script to report memory events.
   std::error_code parseMemEvents();
 
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 2b02086e3e0c99..7038ca5b1452ab 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -49,6 +49,13 @@ static cl::opt<bool>
                      cl::desc("aggregate basic samples (without LBR info)"),
                      cl::cat(AggregatorCategory));
 
+cl::opt<bool> ArmSPE(
+    "spe",
+    cl::desc(
+        "Enable Arm SPE mode. Used in conjuction with no-lbr mode, ie `--spe "
+        "--nl`"),
+    cl::cat(AggregatorCategory));
+
 static cl::opt<std::string>
     ITraceAggregation("itrace",
                       cl::desc("Generate LBR info with perf itrace argument"),
@@ -180,11 +187,19 @@ void DataAggregator::start() {
 
   findPerfExecutable();
 
-  if (opts::BasicAggregation) {
-    launchPerfProcess("events without LBR",
-                      MainEventsPPI,
+  if (opts::ArmSPE) {
+    if (!opts::BasicAggregation) {
+      errs() << "PERF2BOLT-ERROR: Arm SPE mode is combined only with "
+                "BasicAggregation.\n";
+      exit(1);
+    }
+    launchPerfProcess("branch events with SPE", MainEventsPPI,
+                      "script -F pid,event,ip,addr --itrace=i1i",
+                      /*Wait = */ false);
+  } else if (opts::BasicAggregation) {
+    launchPerfProcess("events without LBR", MainEventsPPI,
                       "script -F pid,event,ip",
-                      /*Wait = */false);
+                      /*Wait = */ false);
   } else if (!opts::ITraceAggregation.empty()) {
     std::string ItracePerfScriptArgs = llvm::formatv(
         "script -F pid,ip,brstack --itrace={0}", opts::ITraceAggregation);
@@ -192,10 +207,9 @@ void DataAggregator::start() {
                       ItracePerfScriptArgs.c_str(),
                       /*Wait = */ false);
   } else {
-    launchPerfProcess("branch events",
-                      MainEventsPPI,
+    launchPerfProcess("branch events", MainEventsPPI,
                       "script -F pid,ip,brstack",
-                      /*Wait = */false);
+                      /*Wait = */ false);
   }
 
   // Note: we launch script for mem events regardless of the option, as the
@@ -531,14 +545,20 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
               "not read one from input binary\n";
   }
 
-  auto ErrorCallback = [](int ReturnCode, StringRef ErrBuf) {
+  const Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
+                     "Cannot print 'addr' field.");
+
+  auto ErrorCallback = [&NoData](int ReturnCode, StringRef ErrBuf) {
+    if (opts::ArmSPE && NoData.match(ErrBuf)) {
+      errs() << "PERF2BOLT-ERROR: perf data are incompatible for Arm SPE mode "
+                "consumption. ADDR attribute is unset.\n";
+      exit(1);
+    }
     errs() << "PERF-ERROR: return code " << ReturnCode << "\n" << ErrBuf;
     exit(1);
   };
 
   auto MemEventsErrorCallback = [&](int ReturnCode, StringRef ErrBuf) {
-    Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
-                 "Cannot print 'addr' field.");
     if (!NoData.match(ErrBuf))
       ErrorCallback(ReturnCode, ErrBuf);
   };
@@ -579,7 +599,8 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
     exit(0);
   }
 
-  if ((!opts::BasicAggregation && parseBranchEvents()) ||
+  if (((!opts::BasicAggregation && !opts::ArmSPE) && parseBranchEvents()) ||
+      (opts::BasicAggregation && opts::ArmSPE && parseSpeAsBasicEvents()) ||
       (opts::BasicAggregation && parseBasicEvents()))
     errs() << "PERF2BOLT: failed to parse samples\n";
 
@@ -1226,6 +1247,66 @@ ErrorOr<DataAggregator::PerfBasicSample> DataAggregator::parseBasicSample() {
   return PerfBasicSample{Event.get(), Address};
 }
 
+ErrorOr<
+    std::pair<DataAggregator::PerfBasicSample, DataAggregator::PerfBasicSample>>
+DataAggregator::parseSpeAsBasicSamples() {
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<int64_t> PIDRes = parseNumberField(FieldSeparator, true);
+  if (std::error_code EC = PIDRes.getError())
+    return EC;
+
+  constexpr PerfBasicSample EmptySample = PerfBasicSample{StringRef(), 0};
+  auto MMapInfoIter = BinaryMMapInfo.find(*PIDRes);
+  if (MMapInfoIter == BinaryMMapInfo.end()) {
+    consumeRestOfLine();
+    return std::make_pair(EmptySample, EmptySample);
+  }
+
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<StringRef> Event = parseString(FieldSeparator);
+  if (std::error_code EC = Event.getError())
+    return EC;
+
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<uint64_t> AddrResTo = parseHexField(FieldSeparator);
+  if (std::error_code EC = AddrResTo.getError())
+    return EC;
+  consumeAllRemainingFS();
+
+  ErrorOr<uint64_t> AddrResFrom = parseHexField(FieldSeparator, true);
+  if (std::error_code EC = AddrResFrom.getError())
+    return EC;
+
+  if (!checkAndConsumeNewLine()) {
+    reportError("expected end of line");
+    return make_error_code(llvm::errc::io_error);
+  }
+
+  auto genBasicSample = [&](uint64_t Address) {
+    // When fed with non SPE branch events the target address will be null.
+    // This is expected and ignored.
+    if (Address == 0x0)
+      return EmptySample;
+
+    if (!BC->HasFixedLoadAddress)
+      adjustAddress(Address, MMapInfoIter->second);
+    return PerfBasicSample{Event.get(), Address};
+  };
+
+  // Show more meaningful event names on boltdata.
+  if (Event->str() == "instructions:")
+    Event = *AddrResTo != 0x0 ? "branch-spe:" : "instruction-spe:";
+
+  return std::make_pair(genBasicSample(*AddrResFrom),
+                        genBasicSample(*AddrResTo));
+}
+
 ErrorOr<DataAggregator::PerfMemSample> DataAggregator::parseMemSample() {
   PerfMemSample Res{0, 0};
 
@@ -1703,6 +1784,46 @@ std::error_code DataAggregator::parseBasicEvents() {
   return std::error_code();
 }
 
+std::error_code DataAggregator::parseSpeAsBasicEvents() {
+  outs() << "PERF2BOLT: parsing SPE data as basic events (no LBR)...\n";
+  NamedRegionTimer T("parseSPEBasic", "Parsing SPE as basic events",
+                     TimerGroupName, TimerGroupDesc, opts::TimeAggregator);
+  uint64_t NumSpeBranchSamples = 0;
+
+  // Convert entries to one or two basic samples, depending on whether there is
+  // branch target information.
+  while (hasData()) {
+    auto SamplePair = parseSpeAsBasicSamples();
+    if (std::error_code EC = SamplePair.getError())
+      return EC;
+
+    auto registerSample = [this](const PerfBasicSample *Sample) {
+      if (!Sample->PC)
+        return;
+
+      if (BinaryFunction *BF = getBinaryFunctionContainingAddress(Sample->PC))
+        BF->setHasProfileAvailable();
+
+      ++BasicSamples[Sample->PC];
+      EventNames.insert(Sample->EventName);
+    };
+
+    if (SamplePair->first.PC != 0x0 && SamplePair->second.PC != 0x0)
+      ++NumSpeBranchSamples;
+
+    registerSample(&SamplePair->first);
+    registerSample(&SamplePair->second);
+  }
+
+  if (NumSpeBranchSamples == 0)
+    errs() << "PERF2BOLT-WARNING: no SPE branches found\n";
+  else
+    outs() << "PERF2BOLT: found " << NumSpeBranchSamples
+           << " SPE branch sample pairs.\n";
+
+  return std::error_code();
+}
+
 void DataAggregator::processBasicEvents() {
   outs() << "PERF2BOLT: processing basic events (without LBR)...\n";
   NamedRegionTimer T("processBasic", "Processing basic events", TimerGroupName,
diff --git a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
new file mode 100644
index 00000000000000..d7cea7ff769b8e
--- /dev/null
+++ b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
@@ -0,0 +1,14 @@
+## Check that Arm SPE mode is available on AArch64 with BasicAggregation.
+
+REQUIRES: system-linux,perf,target=aarch64{{.*}}
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-NO-LBR
+
+CHECK-SPE-NO-LBR: PERF2BOLT: Starting data aggregation job
+
+RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe
+RUN: not perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-LBR
+
+CHECK-SPE-LBR: PERF2BOLT-ERROR: Arm SPE mode is combined only with BasicAggregation.
diff --git a/bolt/test/perf2bolt/X86/perf2bolt-spe.test b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
new file mode 100644
index 00000000000000..f31c17f411137d
--- /dev/null
+++ b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
@@ -0,0 +1,9 @@
+## Check that Arm SPE mode is unavailable on X86.
+
+REQUIRES: system-linux,x86_64-linux
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: not perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s
+
+CHECK: BOLT-ERROR: -spe is available only on AArch64.
diff --git a/bolt/tools/driver/llvm-bolt.cpp b/bolt/tools/driver/llvm-bolt.cpp
index efa06cd68cb997..60b813f6f858d7 100644
--- a/bolt/tools/driver/llvm-bolt.cpp
+++ b/bolt/tools/driver/llvm-bolt.cpp
@@ -51,6 +51,8 @@ static cl::opt<std::string> InputFilename(cl::Positional,
                                           cl::Required, cl::cat(BoltCategory),
                                           cl::sub(cl::SubCommand::getAll()));
 
+extern cl::opt<bool> ArmSPE;
+
 static cl::opt<std::string>
 InputDataFilename("data",
   cl::desc("<data file>"),
@@ -245,6 +247,13 @@ int main(int argc, char **argv) {
       if (Error E = RIOrErr.takeError())
         report_error(opts::InputFilename, std::move(E));
       RewriteInstance &RI = *RIOrErr.get();
+
+      if (opts::AggregateOnly && !RI.getBinaryContext().isAArch64() &&
+          opts::ArmSPE == 1) {
+        errs() << "BOLT-ERROR: -spe is available only on AArch64.\n";
+        exit(1);
+      }
+
       if (!opts::PerfData.empty()) {
         if (!opts::AggregateOnly) {
           errs() << ToolName
diff --git a/bolt/unittests/Profile/CMakeLists.txt b/bolt/unittests/Profile/CMakeLists.txt
index e0aa0926b49c03..ce01c6c4b949ee 100644
--- a/bolt/unittests/Profile/CMakeLists.txt
+++ b/bolt/unittests/Profile/CMakeLists.txt
@@ -1,11 +1,25 @@
+set(LLVM_LINK_COMPONENTS
+  DebugInfoDWARF
+  Object
+  ${LLVM_TARGETS_TO_BUILD}
+  )
+
 add_bolt_unittest(ProfileTests
   DataAggregator.cpp
+  PerfSpeEvents.cpp
 
   DISABLE_LLVM_LINK_LLVM_DYLIB
   )
 
 target_link_libraries(ProfileTests
   PRIVATE
+  LLVMBOLTCore
   LLVMBOLTProfile
+  LLVMTargetParser
+  LLVMTestingSupport
   )
 
+foreach (tgt ${BOLT_TARGETS_TO_BUILD})
+  string(TOUPPER "${tgt}" upper)
+  target_compile_definitions(ProfileTests PRIVATE "${upper}_AVAILABLE")
+endforeach()
diff --git a/bolt/unittests/Profile/PerfSpeEvents.cpp b/bolt/unittests/Profile/PerfSpeEvents.cpp
new file mode 100644
index 00000000000000..807a3bb1e07f40
--- /dev/null
+++ b/bolt/unittests/Profile/PerfSpeEvents.cpp
@@ -0,0 +1,173 @@
+//===- bolt/unittests/Profile/PerfSpeEvents.cpp ---------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifdef AARCH64_AVAILABLE
+
+#include "bolt/Core/BinaryContext.h"
+#include "bolt/Profile/DataAggregator.h"
+#include "llvm/BinaryFormat/ELF.h"
+#include "llvm/DebugInfo/DWARF/DWARFContext.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/TargetSelect.h"
+#include "gtest/gtest.h"
+
+using namespace llvm;
+using namespace llvm::bolt;
+using namespace llvm::object;
+using namespace llvm::ELF;
+
+namespace opts {
+extern cl::opt<std::string> ReadPerfEvents;
+} // namespace opts
+
+namespace llvm {
+namespace bolt {
+
+/// Perform checks on perf SPE branch events combined with other SPE or perf
+/// events.
+struct PerfSpeEventsTestHelper : public testing::Test {
+  void SetUp() override {
+    initalizeLLVM();
+    prepareElf();
+    initializeBOLT();
+  }
+
+protected:
+  void initalizeLLVM() {
+    llvm::InitializeAllTargetInfos();
+    llvm::InitializeAllTargetMCs();
+    llvm::InitializeAllAsmParsers();
+    llvm::InitializeAllDisassemblers();
+    llvm::InitializeAllTargets();
+    llvm::InitializeAllAsmPrinters();
+  }
+
+  void prepareElf() {
+    memcpy(ElfBuf, "\177ELF", 4);
+    ELF64LE::Ehdr *EHdr = reinterpret_cast<typename ELF64LE::Ehdr *>(ElfBuf);
+    EHdr->e_ident[llvm::ELF::EI_CLASS] = llvm::ELF::ELFCLASS64;
+    EHdr->e_ident[llvm::ELF::EI_DATA] = llvm::ELF::ELFDATA2LSB;
+    EHdr->e_machine = llvm::ELF::EM_AARCH64;
+    MemoryBufferRef Source(StringRef(ElfBuf, sizeof(ElfBuf)), "ELF");
+    ObjFile = cantFail(ObjectFile::createObjectFile(Source));
+  }
+
+  void initializeBOLT() {
+    Relocation::Arch = ObjFile->makeTriple().getArch();
+    BC = cantFail(BinaryContext::createBinaryContext(
+        ObjFile->makeTriple(), std::make_shared<orc::SymbolStringPool>(),
+        ObjFile->getFileName(), nullptr, /*IsPIC*/ false,
+        DWARFContext::create(*ObjFile.get()), {llvm::outs(), llvm::errs()}));
+    ASSERT_FALSE(!BC);
+  }
+
+  char ElfBuf[sizeof(typename ELF64LE::Ehdr)] = {};
+  std::unique_ptr<ObjectFile> ObjFile;
+  std::unique_ptr<BinaryContext> BC;
+
+  /// Return true when the expected \p SampleSize profile data are generated and
+  /// contain all the \p ExpectedEventNames.
+  bool checkEvents(uint64_t PID, size_t SampleSize,
+                   const StringSet<> &ExpectedEventNames) {
+    DataAggregator DA("<pseudo input>");
+    DA.ParsingBuf = opts::ReadPerfEvents;
+    DA.BC = BC.get();
+    DataAggregator::MMapInfo MMap;
+    DA.BinaryMMapInfo.insert(std::make_pair(PID, MMap));
+
+    DA.parseSpeAsBasicEvents();
+
+    for (auto &EE : ExpectedEventNames)
+      if (!DA.EventNames.contains(EE.first()))
+        return false;
+
+    return SampleSize == DA.BasicSamples.size();
+  }
+};
+
+} // namespace bolt
+} // namespace llvm
+
+// Check that DataAggregator can parseSpeAsBasicEvents for branch events when
+// combined with other event types.
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranches) {
+  // Check perf input with SPE branch events.
+  // Example collection command:
+  // ```
+  // perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234          instructions:              a002    a001\n"
+      "1234          instructions:              b002    b001\n"
+      "1234          instructions:              c002    c001\n"
+      "1234          instructions:              d002    d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 10, {"branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranchesAndCycles) {
+  // Check perf input with SPE branch events and cycles.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234          instructions:              a002    a001\n"
+      "1234              cycles:u:                 0    b001\n"
+      "1234              cycles:u:                 0    c001\n"
+      "1234          instructions:              d002    d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 8, {"branch-spe:", "cycles:u:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeAnyEventAndCycles) {
+  // Check perf input with any SPE event type and cycles.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0//u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234              cycles:u:                0     a001\n"
+      "1234              cycles:u:                0     b001\n"
+      "1234          instructions:                0     c001\n"
+      "1234          instructions:                0     d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(
+      checkEvents(1234, 6, {"cycles:u:", "instruction-spe:", "branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeNoBranchPairsRecorded) {
+  // Check perf input that has no SPE branch pairs recorded.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0/load_filter=1,branch_filter=0/u' --
+  // BINARY
+  // ```
+
+  testing::internal::CaptureStderr();
+  opts::ReadPerfEvents =
+      "1234          instructions:                 0    a001\n"
+      "1234              cycles:u:                 0    b001\n"
+      "1234          instructions:                 0    c001\n"
+      "1234              cycles:u:                 0    d001\n"
+      "1234          instructions:                 0    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 5, {"instruction-spe:", "cycles:u:"}));
+
+  std::string Stderr = testing::internal::GetCapturedStderr();
+  EXPECT_EQ(Stderr, "PERF2BOLT-WARNING: no SPE branches found\n");
+}
+
+#endif
github-actions
github-actions149 days ago (edited 149 days ago)

✅ With the latest revision this PR passed the C/C++ code formatter.

paschalis-mpeis
paschalis-mpeis149 days ago

This PR is an implementation of the (4a) approach of:

We did some limited, quick testing and there was no clear winner between the two approaches, but the --spe flag is introduced in a way to accommodate both.

I believe @kaadam had some work on (4b)? Maybe at some point we could additionally have that merged, and community can test on a wider set of apps/workloads. I believe there won't be dramatic performance changes.

Please give SPE a try along with this patch and report any feedback. To check if SPE is available on your machine, see point (3) on the issue. Let us know if more information is needed on how to enable or use SPE!

yota9
yota9 commented on 2024-12-20
yota9149 days ago

Thanks for your amazing job!

Conversation is marked as resolved
Show resolved
bolt/lib/Profile/DataAggregator.cpp
yota9149 days ago👍 1
Suggested change
while (checkAndConsumeFS()) {
while (checkAndConsumeFS());
paschalis-mpeis123 days ago

Thanks for the suggestion. Could do that but clang-format would require the semicolon to be on the next line anyway. There is similar code to this in other parsing functions like parseBasicSample().

Conversation is marked as resolved
Show resolved
bolt/tools/driver/llvm-bolt.cpp
5151 cl::Required, cl::cat(BoltCategory),
5252 cl::sub(cl::SubCommand::getAll()));
5353
54
extern cl::opt<bool> ArmSPE;
yota9149 days ago

Probably should be moved to CommandLineOpts.h

Conversation is marked as resolved
Show resolved
bolt/tools/driver/llvm-bolt.cpp
247249 RewriteInstance &RI = *RIOrErr.get();
250
251 if (opts::AggregateOnly && !RI.getBinaryContext().isAArch64() &&
252
opts::ArmSPE == 1) {
yota9149 days ago
Suggested change
opts::ArmSPE == 1) {
opts::ArmSPE) {
Conversation is marked as resolved
Show resolved
bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
8
9CHECK-SPE-NO-LBR: PERF2BOLT: Starting data aggregation job
10
11
RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe
yota9149 days ago (edited 149 days ago)

Maybe it would be better to add static perf record rather then launching perf every time for this test...
UPD although it might be incompatible between machines, so probably NWMD

paschalis-mpeis123 days ago

Agreed, I would have preferred that as well. I was not able to work around this for the CHECK-SPE-LBR test.

Conversation is marked as resolved
Show resolved
bolt/lib/Profile/DataAggregator.cpp
yota9149 days ago

Please add spaces after coditions, loops with or without {};

paschalis-mpeis
paschalis-mpeis commented on 2024-12-20
Conversation is marked as resolved
Show resolved
bolt/lib/Profile/DataAggregator.cpp
1301
1302 // Show more meaningful event names on boltdata.
1303 if (Event->str() == "instructions:")
1304
Event = *AddrResTo != 0x0 ? "branch-spe:" : "instruction-spe:";
paschalis-mpeis149 days ago (edited 123 days ago)

Just adding that this might cause some incompatibility with instruction count normalization detection. Unrelated to this patch, but I see that cycles:u are not detected by that pattern either. (EDIT: with another look I see that this should be detected.)


Back to this patch, an alternative approach to 'tag' those events could be to avoid using --itrace=i1i for aggregation in perf2bolt, keep the event name given by perf and create the additional sample only for the branch type.

paschalis-mpeis
paschalis-mpeis123 days ago

Hey @yota9,

Thanks a lot for your review!

I addressed your comments except this one (left a comment there).
Please have another look and let me know of any further changes.

paschalis-mpeis paschalis-mpeis requested a review from yota9 yota9 123 days ago
maksfb
maksfb commented on 2025-01-16
Conversation is marked as resolved
Show resolved
bolt/lib/Profile/DataAggregator.cpp
180187
181188
findPerfExecutable();
maksfb121 days ago (edited 121 days ago)

What's the expected output of this perf script? On my system ip is always 0 for the data collected using perf record -e arm_spe_0/branch_filter=1/u -- ....

maksfb121 days ago

The zeros on my side could be b/c we don't have kernel patches listed in #115333.

paschalis-mpeis121 days ago

Hey Maks,

We expect:

  • branch entries to have non-zero values for both addr and ip
  • non-branch entries to have non-zero value only for ip

This unit test illustrates such an output.

On my system ip is always 0 for the data collected ...

The zeros you are seeing actually refer to addr and not ip. It is a bit confusing but the printed order is different than the fields order specified in CLI. Can verify using:

perf script -F pid,event,ip ..
perf script -F pid,event,addr .. # seeing zeros on unpatched perf

The zeros on my side could be b/c we don't have kernel patches listed in #115333.

Correct. We believe these patches might make it to the next kernel RC.
In the meantime, I've added some sample instructions on the issue to quickly use it locally, w/o requiring to re-flush a kernel.

paschalis-mpeis117 days ago (edited 117 days ago)

Update: As of yesterday, the required patches are part of Linux 6.13 (see comment on issue)

aaupov
aaupov121 days ago

Hi Paschalis, thank you for working on this.
The benefit that SPE has over IP sampling is the edge frequency information. So instead of creating two basic (IP) samples we should create branch samples (LBR) with stack depth one. Branch samples are later attached to CFG edges. This should improve the resulting performance when using SPE profiling.

aaupov
aaupov121 days ago

This PR is an implementation of the (4a) approach of:

We did some limited, quick testing and there was no clear winner between the two approaches, but the --spe flag is introduced in a way to accommodate both.

Missed this comment. Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try -infer-fall-throughs option with the latter?

maksfb
maksfb commented on 2025-01-17
Conversation is marked as resolved
Show resolved
bolt/tools/driver/llvm-bolt.cpp
250
251 if (opts::AggregateOnly && !RI.getBinaryContext().isAArch64() &&
252 opts::ArmSPE == 1) {
253
errs() << "BOLT-ERROR: -spe is available only on AArch64.\n";
maksfb121 days ago

nit: use ToolName instead of "BOLT-ERROR".

paschalis-mpeis
paschalis-mpeis121 days ago (edited 102 days ago)

Hey Amir and Maks,

Thank you for taking a look at this!

Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try -infer-fall-throughs option with the latter?

Correct, in some preliminary internal tests we found both approaches to be close to each other.

Thanks for your suggestion to use -infer-fall-throughs. I thought LBR mode was inferring fall-through branches by default. But it looks like this has to be manually specified?

Let me share my understanding on the LBR format to see if I got this right:
Each LBR event gets a contiguous stack of taken branches. And any other branches that may lay in between them are known to be fall-throughs, which BOLT can infer. eg, if we have:

  • $\bf\textsf{\color{blue}TK1}$ -> $\textsf{\color{blue}TK2}$ -> $\textsf{\color{blue}TK3}$, then BOLT can propagate CFG hotness to:
  • $\textsf{\color{blue}TK1}$ -> FT1a -> FT1b .. -> $\textsf{\color{blue}TK2}$ -> FT2a -> FT2b .. -> $\textsf{\color{blue}TK3}$

SPE on the other hand is a statistical sampling method, meaning all collected packets are not captured contiguously. Each pair comes from a packet that looks like:

.  00000040:         PC 0xAB0 el2 ns=1
.  00000049:         PAD
.  00000053:         B COND
.  00000055:         EV RETIRED NOT-TAKEN
.  0000005a:         LAT 7 ISSUE
.  0000005d:         LAT 8 TOT
.  00000060:         TGT 0xAB4 el2 ns=1
.  00000069:         PAD
.  00000077:         TS 1234

(note: you can inspect native SPE packets w/ perf script -D)

From this example we have 0xAB0 -> 0xAB4 (a src/tgt pair), where 0xAB0 is a branch that was NOT-TAKEN.
The tgt 0xAB0 is a target address of some block (ie, not a branch). We have no information whether the branch of that target block will be taken or not. Therefore, my understanding is that we cannot infer any branches in-between src/tgt. And I believe that is why we found the two approaches to be close to each other.

Please do share your thoughts on this.

Do you think there are any other benefits when using the LBR format? It can additionally utilize prediction information (miss/hit), but we haven't found this to be that beneficial for the quite-limited SPE branch data (when compared to LBR traces).

aaupov
aaupov121 days ago

@paschalis-mpeis is there a way to configure SPE to only collect taken branches? My impression was that it's possible, e.g. based on this: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/introduction-to-statistical-profiling-support-in-streamline

Event packets, which provide important information about each sampled instruction.
This information includes:
...
Was a mis-predicted or not-taken branch

But I couldn't find any info regarding configuring perf filter to collect it.

aaupov
aaupov121 days ago (edited 120 days ago)

Let me share my understanding on the LBR format to see if I got this right:
Each LBR event gets a contiguous stack of taken branches. And any other branches that may lay in between them are known to be fall-throughs, which BOLT can infer. eg, if we have:

Right, with taken branch stacks, we automatically "infer" fall-throughs between entries, and that becomes part of profile data that gets attached.

With SPE, if we're able to distinguish taken branches from not taken, I think we can similarly make that part of profile data so won't need infer-fall-through. If we can't distinguish them, but can filter by taken branches only, then we'd need to use infer-fall-throughs to assign fallthrough counts after taken branch counters are attached to a CFG.

paschalis-mpeis
paschalis-mpeis118 days ago (edited 102 days ago)

is there a way to configure SPE to only collect taken branches?

What I believe you are asking here is to configure SPE to get us a pair of $\bf\textsf{\color{blue}SRC}$ -> $\textsf{\color{blue}TGT}$, where $\bf\textsf{\color{blue}SRC}$ and $\bf\textsf{\color{blue}TGT}$ are two taken branches captured in sequential order and are not necessarily directly linked in the CFG. In other words, make SPE act as an LBR-like buffer with a branch stack depth of 1.

I don't think that is possible. SPE does some periodic, non-contiguous, capture of events packets, in our case branches. Please consider the example below:

.  000007c0:  PC 0xAFF el2 ns=1
.  000007c9:  PAD
.  000007d3:  B COND
.  000007d5:  EV RETIRED NOT-TAKEN MISPRED
.  000007da:  LAT 12 ISSUE
.  000007dd:  LAT 13 TOT
.  000007e0:  TGT 0xB03 el2 ns=1
.  000007e9:  PAD
.  000007f7:  TS 12345
  • PC:
    • is the instruction that was captured, in our case the source branch
    • we can know whether the branch at PC was taken or not, and if it was a prediction miss/hit
    • this aligns with the blogpost you've provided
  • TGT:
    • is the target address of the landing block, ie where execution will continue next
    • SPE has no information whether the next branch will be taken or not
    • we cannot configure SPE to capture two taken branches in sequential program order, which are not necessarily directly linked in the CFG
  • PBT: optional HW feature pointing to the previous block (FEAT_SPE_PBT)
    • it is a statistical profiling of the Previous Branch Target
    • TMU there is no known HW implementation of this optional feature.

I could re-word some points in the PR/patch to make the above more clear.

(@mikewilliams-arm feel free to correct me if I missed anything)

paschalis-mpeis
paschalis-mpeis118 days ago (edited 117 days ago)

With SPE, if we're able to distinguish taken branches from not taken, I think we can similarly make that part of profile data so won't need infer-fall-through. If we can't distinguish them, but can filter by taken branches only, then we'd need to use infer-fall-throughs to assign fallthrough counts after taken branch counters are attached to a CFG.

Currently, there is no such information but we could expose it with more follow-up patches on perf/linux. Please note that if we filter-out any non-taken branches, then we'll exclude information we cannot later infer.

Given the SPE limitations I've explained in previous comments, will this taken/not-taken additional information (or the infer flag) help propagating additional CFG hotness data?
Let's assume we captured the below entries:

  • Branch1 (taken) -> Block1
  • Branch2 (fallthrough) -> Block2

Regardless of whether the source branch was FT or Taken, we still don't know what will happen in Block1 and Block2 in terms of branching. I think BOLT will not be able to propagate information past these blocks, unless those are part of some extended basic block (EBB). In that case, it depends on how BOLT deals with EBBs and whether it lacks any support for them in BasicAggregation (thus giving an advantage to the LBR-format)?

mikewilliams-arm
mikewilliams-arm118 days ago👍 1

is there a way to configure SPE to only collect taken branches?

For the avoidance of doubt, and benefit of anyone finding this and reading it out of context, you can configure SPE to collect only taken branches, but only from FEAT_SPEv1p2. That's a relatively new feature in the field. From looking at the kernel sources, you need to check for /sys/devices/arm_spe_0/format/inv_event_filter and the syntax would be something like perf record -e arm_spe_0/branch_filter=1,inv_event_filter=64/. (I might be wrong - I don't have access to such a system.)

You can always do this filtering post-hoc in software. perf record -e arm_spe_0/branch_filter=1/ should work on all SPE implementations, and according to Google's AI, 60% of branches are taken, so it's about a 66% overhead to store all the not taken branches and filter them out. perf script --inject=b used to do a poor job of preserving all the branch information through the injected events, making this harder to do. I believe that is something being looked into, if it's not already addressed.

However, even so, each sampled branch is exactly that - a single sampled branch. It does not collect sequences of branches other than through the aforementioned optional PBT extension. So, you can only infer that where you came from and where you branched to were executed.

paschalis-mpeis
paschalis-mpeis117 days ago (edited 117 days ago)

Great, thanks a lot Michael for filling in with details!

Indeed the differences are subtle. I've answered a slightly different question, which I've now refined as it wasn't fully correct:

What I believe you are asking here is to configure SPE to get us a pair of $\bf\textsf{\color{blue}SRC}$ -> $\textsf{\color{blue}TGT}$, where $\bf\textsf{\color{blue}SRC}$ and $\bf\textsf{\color{blue}TGT}$ are two taken branches captured in sequential order and are not necessarily directly linked in the CFG.

Whether we filter-out the non-taken branches at the HW collection level (i.e., with the inv_event_filter interface), or in post-processing SW, the information loss holds:

Please note that if we filter-out any non-taken branches, then we'll exclude information we cannot later infer.

And this is because we'll end up with all the taken branch pairs that have direct links in the CFG.
In other words, we cannot get two branches that are indirect ancestors in the CFG, which would have left opportunities for inferring FTs.

ilinpv
ilinpv commented on 2025-02-03
bolt/lib/Profile/DataAggregator.cpp
ilinpv104 days ago

Am I correct in understanding that it is the case when we have sample for branch SRC -> TGT which was or was not be taken. However we increase hotness of SRC and TGT nodes in any case registering samples always for both nodes and not taking into account ratio of samples with this branch taken and not taken?

paschalis-mpeis102 days ago

Hey Pavel,

Reading this back, you are concerned whether storing samples on TGT branches that are not NOT-TAKEN might increase hotness in a block that it shouldn't have. Correct?

That should not be a concern, as regardless of whether a branch is taken or not, the reported TGT is what was architecturally executed. In other words, NOT-TAKEN (or it's absence) characterizes what had happen in the src branch (PC), while TGT will always point to the path we end up taking.

So, for fall-through SPE packets, the TGT address would always be the next address from PC (ie, 0xA00 + 4, which is the instruction size in AArch64):

PC 0xA00
B COND
EV RETIRED NOT-TAKEN
TGT 0xA04

For taken branches, the TGT can be at a distance further than just 4 :

PC 0xA00
B COND
EV RETIRED
TGT 0xBBB

In my previous examples I was using mock addresses for PC/TGT, so I've updated any relevant examples to avoid confusion.

ilinpv96 days ago👍 1

Right, thank you @paschalis-mpeis for clarifying about taken/not taken information and updating examples. @aaupov @maksfb would you like any additional explanations regarding SPE packets? Generally speaking SPE is providing event based sampling for branches and doesn't have enough information to create trace of N>1 branches and inferring fall throughs. We are aiming to add BRBE (Branch Record Buffer Extension) support for this in BOLT and provide branch stack trace like LBR with it.

kaadam90 days ago (edited 90 days ago)👍 1

Hi, thanks Paschalis for your example.
Maybe it's worth to highlight that the not-taken event is only related to conditional instruction (conditional branch or compare-and-branch), it tells that failed its condition code check, that's it. Since TGT (what you mentioned) "will always point to the path we end up taking", in this case presence of the not-taken event type is not relevant us, accordingly we will always get the 'taken paths'. Theoretically these branch information support our optimization, bolt will be able to rely on them.

paschalis-mpeis90 days ago

Correct, thanks Adam. This is irrelevant to any unconditional branching (including call/ret).
Skipping 'non-taken' conditional branches is the optimization LBR/BRBE can do, as that can be inferred in post-processing.

paschalis-mpeis
paschalis-mpeis95 days ago (edited 90 days ago)👍 1

Just adding that we are in the process of upstreaming 'brstack' support for SPE, which would handle the PBT feature nicely for us, and would sit nicely within bolt sources. The latest revision of these patches could also be found here:

Once upstreamed, we can adapt the patch to work for the LBR-format (cc: @kaadam).

paschalis-mpeis [BOLT][AArch64] Introduce SPE mode in BasicAggregation
4c68b93d
paschalis-mpeis clang-format fix
07829117
paschalis-mpeis Addressing reviewers (1)
e74d2ae7
paschalis-mpeis Addressing reviewers (2)
47a986de
paschalis-mpeis paschalis-mpeis force pushed from 10f72191 to 47a986de 80 days ago
paschalis-mpeis
paschalis-mpeis80 days ago

Forced push to rebase to latest main to address conflicts (PCs/IPs were removed from the LBR samples).

Will be proceeding soon with an LBR patch, which for now will be stacked on top of this PR (cc: @kaadam).

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone