# Week 7 ## Fan Mail and Corpus Building Notes: > Introduce the shift from analyzing comics to building datasets. Emphasize that today is methodological and practical. --- ## Where We Are in the Course Week 5 → Concepts (paratexts, text as data) Week 7 → Corpus (building datasets) Week 8 → Analysis (methods + interpretation) Notes: > Bridge slide: today connects theory (Week 5) to computation (Week 8). --- ### Today’s Focus Turning sources into structured data for analysis - Fan mail as data - OCR workflow - What is a corpus? - Building a corpus - File naming + organization Notes: > Walk through agenda. Frame as moving from interpretation to infrastructure. --- ## Key Question What turns a pile of fan mail into a *corpus*? Notes: > Push students to think about intentionality, selection, and structure—not just collecting stuff. --- ## From Text to Data to Analysis Week 5 → What counts as the text? Week 7 → How do we structure that text as data? Week 8 → How do we analyze it? --- scan → OCR → text → corpus → analysis Notes: > In Week 5, we expanded what counts as the “text” of a comic—especially paratexts like letter columns. > > This week, we turn that expanded notion of text into something structured and analyzable. > > Next week, we actually analyze it. --- ## What Is Fan Mail? - Letter columns - “Letterhacks” - Fan–publisher interaction Notes: > Connect to prior weeks on paratexts and participatory culture. --- ## Text Analysis (Preview) **Next week**, we will: - Identify patterns across many texts - Compare language across issues and decades - Use computational tools to support interpretation **Today**, we prepare the data Notes: > Keep high-level—don’t go deep into methods yet. This is setup. --- ## Important Most DH projects fail at the **corpus stage**, not analysis. Notes: > This is the key takeaway of the day. Repeat it later. --- ## Activity 1 ### Explore Fan Mail Find 2–3 examples: - Title, issue, date - Topics discussed, e.g., story, plot, dialogue, art, characters, creators, process, current events, mistakes, criticisms, etc. - Writer details, e.g., gender, occupation, etc. - Would this be an interesting find in a corpus? After working individual, we will get together in small groups to share our findings, and then groups will report back to the full class. Notes: > Circulate and push students to justify inclusion/exclusion decisions. --- ## What Is a Corpus? - Structured dataset - Selected materials - Designed for analysis Notes: > Distinguish from archive or random collection. --- ## Key Idea Your corpus is an **argument** Notes: > Emphasize interpretive stakes. Every corpus encodes assumptions. --- ## Corpus Decisions - Scope - Unit of analysis - Inclusion - Exclusion - Normalization Notes: > Give quick examples: letters vs pages vs issues; Marvel only vs multiple publishers. --- ## Bridge to Computational Analysis Computational methods require: - Clean text - Consistent structure - Defined units - Metadata --- No corpus → No computation Notes: > Explicit link to next week. ## Inclusion and Exclusion Two ways to think about building a corpus: --- ### Model 1: Selective Corpus - Choose specific items - Filter by theme or interest - Example: letters about continuity errors --- ### Model 2: Bounded Exhaustive Corpus - Define a scope - Include *everything within that scope* - Example: all Spider-Man fan mail, 1963–1996 Note: > Students often assume corpus-building means selecting interesting examples. > That’s one valid approach—what we might call a selective corpus. > > But there’s another approach that I use in my own research: > define a scope carefully, and then include everything within that scope. > > So instead of selecting individual letters, I select the *frame*—and then everything inside it counts. --- ## Corpus Strategies | Strategy | Description | Example | |----------|------------|--------| | Sampling | Select a subset | 100 letters across decades | | Thematic selection | Filter by topic | Letters about continuity | | Convenience corpus | Use what’s available | Digitized issues only | | **Bounded exhaustive** | Define scope, include all | All ASM fan mail 1963–1996 | --- ### Key Idea > You’re not choosing individual items—you’re identifying the rules for what counts Note: > Here are a few different strategies people use. > Sampling and thematic selection are probably the most intuitive. > Convenience corpora are also very common—just using what’s available digitally. > > What I tend to do is this last approach: bounded exhaustiveness. > > In my own research, for example, I’ve worked with: > - all Spider-Man fan mail from 1963 to 1996 > - all poetry by Shelley and Swinburne > - all issues of specific 1970s prozines > > Within that scope, I’m not selecting individual items—I’m including everything. > > But—and this is important—that doesn’t mean there’s no exclusion. > The exclusion happens when I define the scope itself. --- ## Activity 2 ### Define a Corpus In groups: - Research question - Scope - Inclusion/exclusion - Unit of analysis Notes: > Have groups report one decision and one difficulty. --- ## File Naming Bad: scan1.jpg IMG_2045.png Good: `asm_1965_i003_p012.txt` Notes: > Students often underestimate this—frame it as critical infrastructure. --- ## Naming Pattern [title]_[year]_i[ssue]##_p[age]##.ext Notes: > Tell them they can adapt, but consistency matters more than perfection. --- ## Naming Rules - lowercase - no spaces - consistent - meaningful - zero-padded numbers Notes: > Reinforce simplicity. Avoid overengineering. --- ## Zero-padding ``` issue1 issue10 issue2 ``` Sorts as `issue1, issue10, issue2` ❌ ``` issue01 issue02 issue10 ``` Sorts as `issue01, issue02, issue10` ✅ --- ## Filenames vs Metadata Filenames: - short - stable - consistent - just enough information Metadata: - detailed - flexible - authoritative - where meaning lives Note: > Early on, it’s helpful to include more information in filenames so you can keep things straight. > But in more advanced work, the database becomes the authoritative source, and filenames become simpler and more stable. --- ## Folder Organization Option A: ``` corpus/ asm/ foom/ ``` Notes: > Good for small, publication-centered corpora. --- ## Folder Organization Option B: ``` corpus/ raw/ ocr/ cleaned/ ``` Notes: > Better for workflows and larger datasets. --- ## Metadata file | title | issue | year | page | type Notes: > Explain that spreadsheets are the simplest database. --- ## Why Metadata? - Searchable - Sortable - Scalable Notes: > Ask: what questions become possible with good metadata? --- ## Activity 3 ### Build a Mini Corpus - Rename files - Create folders - Build metadata For this activity you may use these [raw image and OCR files](https://indiana.sharepoint.com/:u:/r/sites/msteams_36a456/Shared%20Documents/z616/ff_1967-1968_fan_pages_and_bullpen_bulletins.zip). Notes: > This is hands-on. Help students who get stuck early. --- ## Discussion - What was difficult? - What felt arbitrary? - What would break at scale? Notes: > Push toward reflection on method, not just task completion. --- ## Key Takeaway A well-built corpus enables analysis A messy corpus distorts it Notes: > Repeat the central idea from earlier. --- ## Looking Ahead (Week 8) With a corpus, we can: - Count words and phrases - Compare across issues or decades - Identify themes and patterns - Track changes over time --- Today: build the dataset Next: analyze it Notes: > Preview only. --- ## Week 8 Teaser: What We’ll Do With Your Corpus Example: Letter Column Analysis - Count most frequent words (e.g., “Stan”, “story”, “art”) - Compare across years (1960s vs 1970s) - Track topics (praise, criticism, continuity errors) --- > Same corpus structure → different questions become possible Notes: > Keep this light. The goal is to make students see payoff: their corpus enables analysis next week. ## Next - Final Project Proposal - Tales from the Crypt Notes: > Explicitly connect today’s work to upcoming assignments.