reveal.js

# Week 7  
## Fan Mail and Corpus Building

Notes:
> Introduce the shift from analyzing comics to building datasets. Emphasize that today is methodological and practical.

---

## Where We Are in the Course

Week 5 → Concepts (paratexts, text as data)  
Week 7 → Corpus (building datasets)  
Week 8 → Analysis (methods + interpretation)

Notes:
> Bridge slide: today connects theory (Week 5) to computation (Week 8).

---

### Today’s Focus

Turning sources into structured data for analysis

- Fan mail as data  
- OCR workflow  
- What is a corpus?  
- Building a corpus  
- File naming + organization

Notes:
> Walk through agenda. Frame as moving from interpretation to infrastructure.

---

## Key Question

What turns a pile of fan mail into a *corpus*?

Notes:
> Push students to think about intentionality, selection, and structure—not just collecting stuff.

---

## From Text to Data to Analysis

Week 5 → What counts as the text?  
Week 7 → How do we structure that text as data?  
Week 8 → How do we analyze it?

---

scan → OCR → text → corpus → analysis

Notes:
> In Week 5, we expanded what counts as the “text” of a comic—especially paratexts like letter columns.  
>
> This week, we turn that expanded notion of text into something structured and analyzable.
>
> Next week, we actually analyze it.

---

## What Is Fan Mail?

- Letter columns  
- “Letterhacks”  
- Fan–publisher interaction

Notes:
> Connect to prior weeks on paratexts and participatory culture.

---

## Text Analysis (Preview)

**Next week**, we will:

- Identify patterns across many texts  
- Compare language across issues and decades  
- Use computational tools to support interpretation

**Today**, we prepare the data

Notes:
> Keep high-level—don’t go deep into methods yet. This is setup.

---

## Important

Most DH projects fail at the **corpus stage**, not analysis.

Notes:
> This is the key takeaway of the day. Repeat it later.

---

## Activity 1

### Explore Fan Mail

Find 2–3 examples:

- Title, issue, date  
- Topics discussed, e.g., story, plot, dialogue, art, characters, creators, process, current events, mistakes, criticisms, etc.
- Writer details, e.g., gender, occupation, etc.
- Would this be an interesting find in a corpus?

After working individual, we will get together in small groups to share our findings, and then groups will report back to the full class.

Notes:
> Circulate and push students to justify inclusion/exclusion decisions.

---

## What Is a Corpus?

- Structured dataset  
- Selected materials  
- Designed for analysis

Notes:
> Distinguish from archive or random collection.

---

## Key Idea

Your corpus is an **argument**

Notes:
> Emphasize interpretive stakes. Every corpus encodes assumptions.

---

## Corpus Decisions

- Scope  
- Unit of analysis  
- Inclusion  
- Exclusion  
- Normalization

Notes:
> Give quick examples: letters vs pages vs issues; Marvel only vs multiple publishers.

---

## Bridge to Computational Analysis

Computational methods require:

- Clean text  
- Consistent structure  
- Defined units  
- Metadata

---

No corpus → No computation

Notes:
> Explicit link to next week.

## Inclusion and Exclusion

Two ways to think about building a corpus:

---

### Model 1: Selective Corpus

- Choose specific items  
- Filter by theme or interest  
- Example: letters about continuity errors

---

### Model 2: Bounded Exhaustive Corpus

- Define a scope  
- Include *everything within that scope*  
- Example: all Spider-Man fan mail, 1963–1996

Note:
> Students often assume corpus-building means selecting interesting examples.  
> That’s one valid approach—what we might call a selective corpus.  
> 
> But there’s another approach that I use in my own research:  
> define a scope carefully, and then include everything within that scope.  
> 
> So instead of selecting individual letters, I select the *frame*—and then everything inside it counts.

---

## Corpus Strategies

| Strategy | Description | Example |
|----------|------------|--------|
| Sampling | Select a subset | 100 letters across decades |
| Thematic selection | Filter by topic | Letters about continuity |
| Convenience corpus | Use what’s available | Digitized issues only |
| **Bounded exhaustive** | Define scope, include all | All ASM fan mail 1963–1996 |

---

### Key Idea

> You’re not choosing individual items—you’re identifying the rules for what counts

Note:
> Here are a few different strategies people use.

> Sampling and thematic selection are probably the most intuitive.  
> Convenience corpora are also very common—just using what’s available digitally.  
> 
> What I tend to do is this last approach: bounded exhaustiveness.  
> 
> In my own research, for example, I’ve worked with:
> - all Spider-Man fan mail from 1963 to 1996  
> - all poetry by Shelley and Swinburne  
> - all issues of specific 1970s prozines  
> 
> Within that scope, I’m not selecting individual items—I’m including everything.  
> 
> But—and this is important—that doesn’t mean there’s no exclusion.  
> The exclusion happens when I define the scope itself.

---

## Activity 2

### Define a Corpus

In groups:

- Research question  
- Scope  
- Inclusion/exclusion  
- Unit of analysis

Notes:
> Have groups report one decision and one difficulty.

---

## File Naming

Bad:

scan1.jpg  
IMG_2045.png

Good:

`asm_1965_i003_p012.txt`

Notes:
> Students often underestimate this—frame it as critical infrastructure.

---

## Naming Pattern

[title]_[year]_i[ssue]##_p[age]##.ext

Notes:
> Tell them they can adapt, but consistency matters more than perfection.

---

## Naming Rules

- lowercase  
- no spaces  
- consistent  
- meaningful  
- zero-padded numbers

Notes:
> Reinforce simplicity. Avoid overengineering.

---
## Zero-padding

```
issue1
issue10
issue2
```
Sorts as `issue1, issue10, issue2` ❌

```
issue01
issue02
issue10
```

Sorts as `issue01, issue02, issue10` ✅
---
## Filenames vs Metadata

Filenames:
- short
- stable
- consistent
- just enough information

Metadata:
- detailed
- flexible
- authoritative
- where meaning lives

Note:
> Early on, it’s helpful to include more information in filenames so you can keep things straight.
> But in more advanced work, the database becomes the authoritative source, and filenames become simpler and more stable.

---

## Folder Organization

Option A:

```
corpus/
  asm/
  foom/
```

Notes:
> Good for small, publication-centered corpora.

---

## Folder Organization

Option B:

```
corpus/
  raw/
  ocr/
  cleaned/
```

Notes:
> Better for workflows and larger datasets.

---

## Metadata

Notes:
> Explain that spreadsheets are the simplest database.

---

## Why Metadata?

- Searchable  
- Sortable  
- Scalable

Notes:
> Ask: what questions become possible with good metadata?

---

## Activity 3

### Build a Mini Corpus

- Rename files  
- Create folders  
- Build metadata

For this activity you may use these [raw image and OCR files](https://indiana.sharepoint.com/:u:/r/sites/msteams_36a456/Shared%20Documents/z616/ff_1967-1968_fan_pages_and_bullpen_bulletins.zip).

Notes:
> This is hands-on. Help students who get stuck early.

---

## Discussion

- What was difficult?  
- What felt arbitrary?  
- What would break at scale?

Notes:
> Push toward reflection on method, not just task completion.

---

## Key Takeaway

A well-built corpus enables analysis  
A messy corpus distorts it

Notes:
> Repeat the central idea from earlier.

---

## Looking Ahead (Week 8)

With a corpus, we can:

- Count words and phrases  
- Compare across issues or decades  
- Identify themes and patterns  
- Track changes over time

---

Today: build the dataset  
Next: analyze it

Notes:
> Preview only.

---

## Week 8 Teaser: What We’ll Do With Your Corpus

Example: Letter Column Analysis

- Count most frequent words (e.g., “Stan”, “story”, “art”)  
- Compare across years (1960s vs 1970s)  
- Track topics (praise, criticism, continuity errors)

---

> Same corpus structure → different questions become possible

Notes:
> Keep this light. The goal is to make students see payoff: their corpus enables analysis next week.

## Next

- Final Project Proposal  
- Tales from the Crypt

Notes:
> Explicitly connect today’s work to upcoming assignments.